AlphaGo’s 10-year anniversary today — huge milestone for RL!
Small serendipity: it’s also 1 year since we released 𝐑𝐀𝐆𝐄𝐍, our LLM Agent RL framework.
Some thoughts on the past decade of RL, plus a major 𝐑𝐀𝐆𝐄𝐍 update on reasoning collapse in Agent RL coming soon.
1/
Ten years ago, on Jan 27, DeepMind brought AlphaGo to the world.
Back then, RL felt mythic. For the first time, it reached top professional-level in a domain that demands long-horizon planning -- already gone 5–0 against the European champion.
That moment made a lot of people truly believe this: a policy can “grow out of interaction” instead of being hand-coded or hand-taught.
One year ago, on Jan 27, we released RAGEN, an RL codebase for LLM agents.
We started applying RL with verifiable rewards beyond ‘winning a game’ to large reasoning models that can plan and interact with the world. RL
is no longer just about winning inside a closed board. It now plays out in a more open, long-horizon training loop that can resemble parts of the real world.
But in this year, we also saw a quieter kind of collapse.
It does not always look like failure. Sometimes it looks stable. Sometimes it even looks safer and more consistent. Yet the policy slowly turns into a “persona”, a “template”, a “low-effort sense of security”.
So I’ve increasingly felt that 𝐑𝐀𝐆𝐄𝐍 isn’t just a system. For me, it reads more like the second half of a decade-long thread I’ve been watching unfold.
The first half: “RL can learn reasoning.”
The second half: “RL can also quietly collapse if we don’t have the right diagnostics.”
It feels like a time marker: ten years later, we’re finally forced to look beyond reward and ask what stays input-conditioned—and what drifts.
2/
If I use this coincidence as an anchor, I would split the last decade of RL into three chapters.
The AlphaGo era: RL proved itself on long-horizon planning. It proved policies can emerge from interaction;
The RLHF era: RL moved from winning games to alignment. It became a core mechanism that makes language models track human preferences. It became a key part behind many products today;
The LLM Agent RL era: RL enters closed-loop, multi-turn self-training. The LLM agent learns more than answers. It learns plans, tools, revisions, reflection, and behavioral consistency across longer time scales.
Put together, these chapters point to a missing piece for me: we still lack a clear, shared vocabulary and practical gauges for “failure modes in LLM Agent RL”.
Progress has been fast on the capability side. But the language and gauges for how LLM agents degrade—especially in closed-loop training—still feel less settled.
That’s the piece we’ve been trying to put words and measurements to this year.
3/
A decade after AlphaGo, a lot of the attention and resources in RL do seem to be shifting from closed worlds like board games toward systems like LLM agents.
At the same time, closed-loop self-training can introduce a more systemic risk. In a loop of self-sampling and self-updating, a model can gradually settle into a “task-insensitive but cheaper” strategy.
It does not look terrible. It may even look safe and consistent. But it slowly loses prompt “discriminability”. It can lose the property that makes reasoning actually change with the input.
I like to define this with one sentence: “training continues, but learning is idling”.
Rewards still move. Gradients still update. But the information is already dry. The policy solidifies toward templates, inertia, and risk-avoidance.
One transferable takeaway from our year with 𝐑𝐀𝐆𝐄𝐍 is this:
In LLM Agent RL, it’s not enough to only watch the reward or success rate. You must also watch whether “input-conditioned information” is still flowing. You must watch whether the LLM agent is still sensitive to the task.
We are now preparing a new version of 𝐑𝐀𝐆𝐄𝐍. You do not need to believe any result in advance. But we will make this line much clearer: how the battlefield shifts, how the new collapses happen, and which diagnosis view is the most actionable.
4/
Here I want to write something more personal, because this part wasn’t “thought up”. It was almost collided into.
Right before writing this, I was sprinting on the new 𝐑𝐀𝐆𝐄𝐍. After days of deadline pressure, I finally took a breath and noticed the date coincidence. Thinking about the past year, I started crying. When I actually began typing, the tears had just stopped.
I looked at the time. It was 5pm, Jan 20, 2026, and my screen had gone dark. The contrast made the point feel sharper.
This year wasn’t about “one more loss term” or “one more trick”. It was about a latent variable that kept showing up in closed-loop LLM Agent RL, but is hard to name cleanly: whether the agent’s reasoning is still tied to the input.
Training can keep running while reasoning drifts into templates, inertia, and avoidance. Reward can still move while prompt discriminability quietly erodes.
“More stable, more certain” can sometimes just mean “less sensitive, less distinctive”. Collapse is rarely a sudden crash. It’s usually a slow drift that looks fine from the outside.
That’s what I mean by a quiet failure mode. Not bad news, just something we’d benefit from better gauges for.
And on a personal note, learning to notice this earlier has changed how I work. The hits still come. I just recover faster, and keep moving.
5/
Then I looked back at the past year’s timeline and noticed another coincidence.
DeepSeek-R1 landed on Jan 20, 2025 — the same date I happened to notice the AlphaGo/RAGEN alignment.
I’ll treat it as coincidence, but it did make the moment feel unexpectedly vivid.
Since then, I’ve been jokingly calling 01/20 my “dark mode day”.