if you're testing a new retrieval model or long-context LLM, it's a waste of your time (and ours...) to report 0.2% gains on the many saturated and expired benchmarks
if you're in that position and looking for way to rescue your great new idea, put it to the test on OBLIQ-Bench
We set out to build a better retriever, so we looked for the hardest IR benchmarks.
For each, we asked how much headroom remained by running oracle reranking with a frontier LLM. Most had little room left!
So we built OBLIQ-Bench to study much harder search queries than before.