Why I’m excited about
@topk_io semantic_index as someone coming from recommender systems community
Whether you’re building search or recsys, the journey of results displayed to users always starts at the retrieval layer. And the hardest places to do retrieval are usually hard for the same reason: the catalog of items (documents) never sits still. New items pour in constantly, existing ones change, and anything more than a few hours old is pretty much dead weight. This is true for news feeds, or social platforms, or marketplaces. Freshness isn’t nice-to-have - it defines the product.
In general, any large-scale retrieval system requires three things:
- A way to ingest (and re-ingest) and index (and re-index) your whole dataset at scale
- A scalable way to serve your search or recommendation queries (high QPS, low latency)
- and, of course, high recall
You might be thinking: “but the current solutions already achieve all three” and you are right.
But in achieving this, they usually trade-off cost and freshness.
To get high quality, we can take the biggest embedding model we can serve (and accept either low write throughput or higher cost). Then, after the ingest is done, we build an index on top of this data, which means we need to wait for some amount of time until our data becomes available for querying. Alternatively, this means we are serving old data (e.g., not reflecting real-time updates in the posts).
@topk_io semantic_index is different because the compromise on freshness was never something we were willing to accept. Instead, by looking at the system as a whole, we realized there exists another way if we carefully co-design all the core components: the model, the inference engine, and the database.
Half of the issue, we realized, lies in the reliance on a large model producing dense embeddings and then running a simple cosine similarity search at query time. There’s been a lot of works recently showing how this leads to suboptimal quality and that even small models with a more expressive similarity function (late interaction) can match and often even exceed much bigger models in quality. But a somewhat understated consequence of this is a changed balance of cost in the system - by selecting a smaller model, you can more easily scale to higher write throughput, but at the same time your queries become more expensive. Not a free lunch, but it significantly helps alleviate the first freshness bottleneck: slow writes.
Naturally, this makes the second half of the problem - scalable querying *without indexing lag* - even more pressing, because with more expressive scoring function you make each query more expensive. Our solution here was a radically different form of representations (SMVE) - one where the vectors expose an index structure on their exterior, instead of us constructing an index post-hoc on dense vectors. Now, in semantic_index, a new entry is transformed on write to a form that clicks into an existing index. No rebuilds, no lag - you write and it’s there, ready to be retrieved efficiently.
I’m very excited about this release because with this design you get all three requirements of large-scale retrieval the freshness that makes your product engaging.
If freshness is a key requirement in your retrieval, you shouldn’t sleep on this one.