Superlinked

Superlinked

181 Photos and videos

Tweets

Superlinked

@superlinked

Jun 11

Your search stack probably doesn't need more GPUs. You could make efficiency gains with the ones you already have. Modern retrieval runs four model types: dense, sparse, ColBERT, and a cross-encoder reranker. The usual setup gives each its own container and GPU pool, and most of that capacity sits idle while you pay for all of it. In his Berlin Buzzwords 2026 talk, @F_Makraduli shows how to serve all four from one process on a single GPU using SIE, our open-source inference engine. A model like BGE-M3 returns dense, sparse, and multi-vector from one encode call. Reranking runs through the same API. Same server, same GPU. Real BEIR and L4 benchmark data included, plus the honest tradeoffs: when multi-model serving on one GPU wins, and when a model still deserves its own box. If you run hybrid search or RAG, this one is for you! Watch the talk and read the breakdown: buff.ly/FdevCr6

Superlinked

Superlinked

@superlinked

Jun 4

Shopify's product taxonomy goes 8 levels deep with over 10,000 categories. We tested four completely different classification approaches on a 1,790-node slice of it so you don't have to. Our very own Andrey Pikunov has done a systematic evaluation of four approaches to taxonomy classification on Shopify's product hierarchy (1,790 categories, up to 8 levels deep): zero-shot NLI, text retrieval, image retrieval with CLIP/SigLIP, and cross-encoder reranking. Text retrieval with Stella won (hierarchical F1 of 0.425 strict). Reranking didn't help. Image retrieval held up surprisingly well on its own. Full walkthrough: buff.ly/97m4heC

Superlinked

Superlinked

@superlinked

May 21

Give your agent a memory that holds up past the fiftieth markdown file. OpenClaw stores everything your AI agent has ever logged as plain markdown. Readable, portable, and completely unsearchable once the folder hits critical mass. Grep finds the word you typed. SIE finds the meaning you remember. Slot SIE in as the semantic memory layer and your agent can ask "what did we decide about the retry logic last Tuesday" and get the actual chunk back. Local embeddings, content-hash deduping, and no re-embedding the bits that haven't changed. Follow the build here: buff.ly/quvjlAV

Hasan Toor

Superlinked retweeted

Hasan Toor

@hasantoxr

May 13

I'm replacing OpenAI, Cohere, and AWS Comprehend with one open-source server. It's called SIE. One docker run gets you 85 models behind three API calls: → encode() for embeddings (Stella, BGE-M3, SPLADE) → score() for reranking (BGE-reranker v2) → extract() for named entity recognition (GLiNER, Florence-2) The cost difference is brutal. AWS Comprehend entity extraction → $5,000/month Same workload on a spot A10G with SIE → $5/month That's the same models, your own cloud, and a 1000x cheaper bill. It ships the full production stack out of the box: → OpenAI-compatible /v1/embeddings (swap the base URL and you're done) → KEDA autoscaling on Kubernetes → Terraform modules for GKE and EKS → Grafana dashboards → All 85 models quality-verified against MTEB in CI Native integrations with LangChain, LlamaIndex, Haystack, DSPy, CrewAI, Chroma, Qdrant, and Weaviate. Your data never leaves your VPC. Apache 2.0. Built by Superlinked.

4,280

Superlinked

Superlinked

@superlinked

May 19

We just launched native @trychroma support for the Superlinked Inference Engine. If you're using ChromaDB, you can now use SIE as your embedding function with a one-line swap. That gives you access to 85 SOTA models, including sparse embeddings for Chroma Cloud's hybrid search and multimodal models like CLIP for image search, all running in your own cloud. pip install sie-chroma Check it out in our docs: buff.ly/VieMrxM

126

Superlinked

Superlinked

@superlinked

May 14

Clone our latest SIE example and you have a full product search engine running on your laptop in five minutes. Type “wireless bluetooth headphones”, get ranked Amazon products back with extracted brand, color, and material filters. All three capabilities (extract, encode, score) run on one local SIE server through three SDK calls. No vector DB to provision. No separate reranker service. No hand-rolled regex for attributes. One Docker container, one SDK, one pipeline. Sounds impressive? Go have a look at the full build: buff.ly/eGa7Pt4

Superlinked

Superlinked

@superlinked

May 12

We just launched native @Weaviate support for the Superlinked Inference Engine. The interesting one here is SIEDocumentEnricher. It combines embedding with entity extraction and classification at index time, which means Weaviate's Query Agent gets a rich metadata surface to work with. So a natural language query like "show me legal documents mentioning Google" resolves into the right vector search plus filters automatically. pip install sie-weaviate Check it out in our docs: buff.ly/hAjuC1M

340

Antoine Chaffin

Superlinked retweeted

Antoine Chaffin

@antoine_chaffin

May 7

Just watched this talk from @f_makraduli It is very interesting because it highlights a key takeaway: specialised models outperforms LLMs for specific tasks (routing, retrieval, reranking), but serving them is challenging because it is less explored and there are a lot of different models/architecture/inputs/outputs As someone who worked on serving my two loved ones (ModernBERT and ColBERT), this resonates!

Superlinked

@superlinked

May 6

Most embedding infrastructure assumes you know exactly which model you want ahead of time. This talk starts where that assumption breaks. @f_makraduli walks through the real profiling mistakes, infrastructure gaps, and production constraints that led to building an embedding inference engine designed for dynamic model loading, hot-swapping, and memory-aware eviction instead of brittle one-model-per-container deployments. If you're working on small-model inference, embeddings, or GPU infrastructure, this is a practical look at what breaks in the real world and how to design around it. Check it out here: buff.ly/S1HZCZB Dive into the SIE repo here: buff.ly/EBnNglg

1,840

Superlinked

Superlinked

@superlinked

May 6

2,125

Superlinked

Superlinked

@superlinked

May 4

We're now a native Haystack integration. The sie-haystack package gives you SIE embedders (dense, sparse, ColBERT, image), cross-encoder rerankers, and zero-shot extractors as first-class Haystack 2.0 components. Everything routes through one endpoint, so you can build a full RAG pipeline, swap models with a config change, and not spin up new infrastructure for each one. pip install sie-haystack Check it out in our docs: buff.ly/be2Crv7

419

Superlinked

Superlinked

@superlinked

Apr 14

We sometimes hear clients talking about running 700B parameter models, but most AI tasks don’t actually need them! A huge amount of real-world work can be done with small, task-specific models. Instead of forcing one giant model to do everything, you combine a few specialized models together to solve the problem. Because these models are only a few billion parameters, they fit comfortably on standard 16–24GB GPUs. That means lower latency, dramatically lower cost, and infrastructure that is much easier to run in your own cloud. This shift toward Small Language Models is a big part of what we discuss in our latest guest appearance on the AI Powered Search , where @svonava gives a preview of the Superlinked Inference Engine and how we think about running many models in production. If you are building AI systems today, it is worth asking whether the biggest model is really the right tool for the job. Thanks to @treygrainer and @softwaredoug for having us!

0:48

370

Superlinked

Superlinked

@superlinked

Apr 7

How long does it take your team to get a new model into production? If the answer is anything more than a config change, @f_makraduli 's talk at AI Engineer Europe is worth your time. On April 10th, Filip will walk through the small-model infrastructure problem we kept finding one layer deeper than expected, and what we built in response. The short version: five small models should not require five GPUs running at single-digit utilisation. A new model on HuggingFace should not require days of Docker builds and infra tickets. And the gap between a working model server and a production system that scales, monitors itself and costs nothing when idle should not require months of in-house work. We fixed all three. Come find us in London to find out how. buff.ly/nnWlpYU #AIEngineer #Embeddings #MLOps #OpenSource #Superlinked

Superlinked

Superlinked

@superlinked

Apr 1

We would like to announce that our co-founders Daniel and Ben have launched a side hustle. SUPER INKED Tattoo Studio will be opening its books to paying customers April 1st at 12pm PST. That’s right, they dropped the L, because in this business *we don’t take no Ls.* Ben has 1 month of experience with a tattoo gun and has been using Daniel as a test-dummy on a daily basis. We call his technique “vibe tatting” and so far the results have been great, enabling Daniel to fulfill his dream of having an entire arm sleeve of vague illegible scribbles. SUPER INKED can be trusted to produce the best quality, definitely not AI-generated flash sheets, with designs that will be professionally embedded, just like a vector (but more permanent) into your skin by one of our tech-team-turned-tattooists. Like this post for 50% off face tattoos!

Superlinked

Superlinked

@superlinked

Mar 19

Right now SO many companies are paying per token for LLM APIs. At scale, that gets expensive very quickly. What’s interesting is that in many cases there are open models with similar capabilities that you can run yourself. The difference is that instead of paying per token, you are paying for GPU infrastructure. The gap between those two models of pricing can easily be one or two orders of magnitude. That is why more teams are starting to look seriously at self-hosting. If you can run the models reliably in your own environment, the cost savings become hard to ignore. @Svonava talks about this shift and why infrastructure for running many specialized models efficiently is becoming an important part of modern AI systems.

0:32

231

Superlinked

Superlinked

@superlinked

Mar 17

The self hosting small models is an increasingly emerging topic as of late, but where's the evidence? The team was in Belgrade last week, presenting alongside @TopK and @Perplexity, answering that exact question. @f_makraduli presented "The Case for Self-Hosting Small Models". *TLDR: Small models are quietly winning in production AI.* Open source has exploded to over 2.6M models, and open-weight systems are now only about 1 to 3 months behind proprietary frontier models. In some cases, they already match top-tier performance at a fraction of the cost At the same time, task-specific models consistently outperform general LLMs where it matters. They are faster, cheaper, easier to run, and trained on more relevant data. That is why they power things like search, ranking, and extraction in real systems today It appears the future is not one giant model, but many smaller models doing specific jobs to a better standard. Thanks to @KayaVC for the invite!

Superlinked

Superlinked

@superlinked

Mar 10

GPUs can deliver hundreds of TFLOPS, so why are they often underutilised during inference? Because the real constraint is often memory bandwidth, not compute. With small batches, GPUs spend much of their time waiting for data to move through memory. The compute cores sit idle because weights and activations cannot be fetched fast enough. Increase the batch size and things start to change. Memory access becomes more efficient, the GPU stays busy doing matrix multiplications, and the bottleneck shifts from memory bandwidth to raw compute. That transition is key to understanding why batching matters so much for inference performance. Filip's article breaks down this shift clearly and explains how it shapes real world GPU utilization. Check it out here: buff.ly/E0dbSHD

Superlinked

Superlinked

@superlinked

Feb 24

If you are running search or large scale data processing, you have probably experienced: -Rising API costs. -Experimenting until something breaks in production. -Memory constraints and throughput ceilings that block real workloads. We're working on an alternative... On Feb 27 at 4 PM GMT, @Svonava will preview the Superlinked Inference Engine, our open source software for running Small Language Models in your own cloud. Join us on Maven Live for Optimizing Search & Data Processing with Self-hosted SLMs. We’ll cover: • When SLMs beat LLMs for search and data tasks • How to support 35 model architectures and LoRAs in production • Designing a multi model cluster pushing 1M tokens per second • How teams cut 95% of managed API costs Daniel will be joining AI-Search masterminds @treygrainger and @softwaredoug for this free lightning lesson! Join us here: buff.ly/amM5yBI

340

Superlinked

Superlinked

@superlinked

Jan 27

“System X is fast because it’s written in Rust.” Is this true 100% of the time? Most people assume embedding inference speed comes down to the code they write. Python versus Rust, frameworks etc. In practice, almost none of that is decisive. What really affects embedding latency is memory. GPUs are extremely fast at calculations but comparatively slow at moving data. Generating an embedding is mostly about reading and writing large model weights and intermediate tensors instead of crunching numbers. That is why techniques like Flash Attention (used by popular inference model TEI) matter. They reorganise computation so more work stays in fast on chip cache instead of repeatedly hitting slower GPU memory. Quantisation helps for the same reason. Smaller weights mean less data to move. If you want faster embeddings, start thinking about memory, cache locality, and data movement to realise some actual gains. Or better yet, read Filip’s full deep-dive on the matter here: buff.ly/Kq1y8kZ

Superlinked

Superlinked

@superlinked

Jan 22

Using open-source solutions to productionise your embeddings can get you a long way, but the efficiency problem that faces ML and AI Engineers still needs solving… *Some models can generate dense, sparse, and multi vector embeddings in one pass, but today you usually need multiple API calls because these outputs are handled separately. *Running and testing multiple models in production is costly and complex, with limited support for serving many models efficiently when VRAM is constrained. *Differences in embeddings, pooling strategies, and model quirks require careful handling by users, and current systems lack flexible ways to support new model types without code changes. @f_makraduli takes a deep dive into the existing open source inference solutions, what they do well, and what they’re ultimately missing to make everyone’s jobs easier (and to get the most out of your GPUs). Check out the article here: buff.ly/U4bsFOB

234

Superlinked

Superlinked

@superlinked

8 Jul 2025

Problems with your text-embedding models? Filip explains the common issues with the traditional approach to search embeddings. Superlinked has a smarter approach, using a MIXTURE of embeddings instead. Check out the video to find out more.

0:48

434