Understanding R1-Zero-Like Training: A Critical Perspective
From Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin, across Sea AI Lab, National University of Singapore, and Singapore Management University.
Problem: R1-Zero-style RL looks magical, but what is actually doing the work?
They take DeepSeek-R1-Zero as the spark, then dissect two core ingredients: the base model and the RL algorithm. The goal is to separate true RL-driven gains from artifacts caused by pretraining quirks, prompt templates, or biased optimization.
Approach: stress-test base models, templates, and RL dynamics
They probe multiple base models (including DeepSeek-V3-Base and Qwen2.5 families) and show templates can be the difference between “answering questions” versus “completing text.” One surprising observation is that Qwen2.5 can look unusually chat-like even without templates, which they argue may reflect pretraining bias.
RL finding: GRPO has a built-in length and difficulty bias
This is the part I enjoyed most because it is pure RL hygiene applied to LLM post-training. They show GRPO’s objective includes response-length normalization and per-question std normalization, which can systematically push incorrect answers to get longer and overweight “too easy/too hard” questions.
Fix: Dr. GRPO removes the bias and improves token efficiency
Their solution is intentionally simple: remove the length and std normalization terms, so the optimization matches the unbiased policy-gradient form more closely. The result is better token efficiency while keeping reasoning performance, and they show this in training dynamics and evaluation comparisons.
Why I recommend it
I recommend this as an RL-for-LLMs “reality check” paper. It made me appreciate RLHF and RLVR even more because it highlights the real job: define the right objective, avoid accidental biases, and measure what changes. It is a reminder that “emergence” can be partly optimizer math, and fixing that is part of doing RL right.