AI models are incredible at coding and math. Labs like OpenAI and Anthropic solve verifiable domains by teaching models with tasks that have clear right or wrong answers, like "5/2."
But in domains like finance or law, there is rarely a single right answer. There, labs turn to verifiers, complex systems that use AI, to grade the answers. But these verifiers can make mistakes! Is that an issue?
In our latest research, we show that the verifier can be wrong 15ā30% of the time, and the models will learn just as well. This means we can use these imperfect verifiers without losing performance!
Does an imperfect verifier break reinforcement learning with verifiable rewards (RLVR)? Turns out it doesnāt!
Why does this matter? As the world moves into reinforcement learning in semi-verifiable domains, perfect verifiers donāt exist.
We added controlled and LLM-based noise to RLVR reward signals and found that up to 30% noise barely hurts training; performance stays within 4pp of the clean baseline.
This research has already impacted how we build reinforcement learning environments at
@joinHandshake. For a major benchmark we are launching tomorrow, we hill-climbed the verifier to 88% accuracyāabove the 85% human inter-rater agreementāknowing from this research that this is good enough.
With
@andreas_plesner @guzmanhe