Spotlighting our newest benchmark for agentic search: DETOUR
When people try to recall something in conversation, they rarely give a perfect query upfront. They say things like โthat movie with the scene whereโฆโ or โthe paper aboutโฆโ and the assistant has to ask the right follow-up questions to get there.
Existing search and agent benchmarks often miss this multi-turn, tip-of-the-tongue behavior. To more realistically evaluate it, we introduce DETOUR: Dual-agent based Evaluation Through Obscure Under-specified Retrieval, an interactive benchmark for dual-agent search and reasoning.
DETOUR contains 1,011 prompts across text, image, audio, and video. In the benchmark, a Primary Agent is evaluated on its ability to identify a target entity by querying a consistent Memory Agent, testing whether models can resolve ambiguity through useful follow-up questions.
Current state-of-the-art models still struggle: performance reaches only 36% accuracy across all modalities, showing that todayโs agents remain weak at clarification-seeking in underspecified, real-world search settings.
We hope DETOUR helps push the next generation of search agents toward better reasoning, better questions, and more robust multi-turn retrieval.
arXiv Paper:
arxiv.org/abs/2602.00352