The paper shows reasoning models often answer multi-hop questions while straying from the needed steps.
Multi-hop questions need information from several documents linked in a chain.
The authors track each jump between documents as a hop, check if all required sources are covered, and tag any extra meandering as overthinking.
They build 7 clear error categories by comparing the model's hop count to the gold hop count.
This turns fuzzy explanations into concrete signals about where reasoning goes off track.
That structure is the core contribution.
They test 6 models across 3 datasets and annotate 1,440 answers, keeping 1,080 after filtering.
They also automate judging with a compact 2-step pipeline that first extracts hops then classifies errors, cutting annotation time by about 20x and reaching up to 92% agreement on simpler sets.
That makes large scale diagnosis practical.
On 2Wiki, most traces match the gold steps and final accuracy is strong.
Correct answers mainly appear when the hop count exactly matches the gold, while early irrelevant steps are more damaging than trailing ones.
Smaller models break more when a step is wrong, larger models like Claude 3.7 Sonnet are steadier, yet even they overhop on harder questions.
DeepSeek-R1 sometimes gets the answer with lower reasoning fidelity, showing that accuracy can hide messy chains.
----
Paper – arxiv. org/abs/2508.04699
Paper Title: "Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis"