Your AI agent scores 95% on benchmarks but can it actually synthesize information from multiple sources to answer a real question?
DEEPSYNTH, a new benchmark from
@DebjitPaul2, Daniel Murphy, Milan Gritta and team at Huawei Noah's Ark Lab (@HuaweiNoahsArk),
@imperialcollege, @ucaboratory, and
@Cambridge_Uni, tests exactly this. 120 tasks across 7 domains, 67 countries, requiring agents to gather data, cross reference sources, and produce structured insights.
The results are sobering: 11 state of the art LLMs and deep research agents top out at 8.97 F1. The best LLM judge score reaches only 17.5. Models hallucinate freely and collapse when reasoning over large information spaces.
The gap between answering trivia and doing actual research remains massive. Accepted at ICLR 2026.