The paper builds a generative stepwise judge that learns to grade reasoning steps and boosts math solving.
ProcessBench average jumps to 61.9 vs 39.7 on the same 7B base.
The judge initially writes a short rationale for the step, then issues a clear verdict.
To make steps meaningful, the solver is trained to self-segment its chain of thought into coherent chunks with 1 goal.
After each chunk, the system runs rollouts from that point to estimate success, then labels the chunk by whether success rises or falls.
These labels drive reinforcement learning, so the judge learns to reason about steps and mark them Positive or Negative.
At inference, the judge rejects bad chunks and forces a rewrite from the last accepted point, lifting accuracy without longer answers.
Relative signals work best because they reward steps that actually improve success odds, not ones that only look locally good.
Taken together, reasoning about the reasoning plus RL yields a judge that catches errors early and guides training and test-time search.
----
Paper – arxiv. org/abs/2508.19229
Paper Title: "StepWiser: Stepwise Generative Judges for Wiser Reasoning"