We are very excited to release zerank-2,
@ZeroEntropy_AI 's newest reranker model. 🔥
It shows major improvement on the 5 most common RAG failure modes below.
Existing rerankers consistently fail on seemingly “simple” tasks:
🔢 Comparing numbers and date: “Biggest deals closed after 04/2024.”
🗄️ Aggregation: “Top 10 objections of customer X?”
🌍 Multilingual: Major pain point, especially non-English to non-English.
🙏 Instruction-Following: “Find the *counterargument* of the claim in the transcript”
🥇 Calibrated scores: You ask "what should I cook for dinner?", and "I am allergic to nuts" scores too low for your threshold.
Many rerankers overfit public benchmarks, and don’t generalize to these real issues. zerank-2 outperforms existing rerankers considerably on all of these failure modes, in real production environments.
With zerank-2, you get:
* 15% improvement vs Cohere rerank 3.5 on Arabic/Hindi (Miraql dataset)
* 12% NDCG@10 on sorting tasks (new open-sourced eval set)
* 7% vs Gemini Flash on instruction-following (MAIR dataset)
* $0.025/1M tokens, 150ms p90 latency at 100KB
🤗 We are open-sourcing the model weights, along with new challenging eval sets on
@huggingface. Our Elo-inspired training methodology is already open-source!
We're starting a series of technical deep dives to explain various failure modes zerank-2 fixes, with concrete prod examples, methodologies, and benchmarks.
First technical deep dive in the comments.