The paper replaces fixed tests with live chats to measure how humans and models perform together.
The core idea is simple, turn benchmark questions into back and forth chats that capture how people actually ask, skip details, and follow up.
The setup measures 3 modes, person alone, model alone, and person with AI, so the team can compare real outcomes.
The key claim, model-alone scores are a weak signal for person plus AI success on the same items.
Another consistent result, the big gap between a strong model and a weaker one on static tests shrinks once real chat starts.
Why this happens, people rarely paste the exact question, they paraphrase, add context, and sometimes correct the model mid way.
To scale beyond small studies, the authors add a simple 2-step user simulator that asks, then follows up like a person.
This simulator predicts person plus AI outcomes far better than raw model-alone accuracy.
So model choice and tuning should use this chat-based evaluation, not letter-only answers from static benchmarks.
----
Paper – arxiv. org/abs/2504.07114
Paper Title: "ChatBench: From Static Benchmarks to Human-AI Evaluation"