Today's Training Data episode takes us BTS on the infrastructure challenges required to do large RL runs at scale, featuring
@ellev3n11 (Composer Lead at
@cursor_ai) and
@dzhulgakov (Co-Founder at
@FireworksAI_HQ).
The Cursor team trained Composer 2 on Fireworks by starting with a strong base model (Kimi 2.5) and performing large-scale mid-training on code tokens and web data to learn common patterns and libraries, followed by a large-scale Reinforcement Learning run to learn how to navigate the Cursor harness, call tools, and write correct code.
Today's episode dives into the systems and infrastructure challenges of making that large RL run happening, and there were many (!!), from numerical mismatch to global distribution to synchronizing rollouts across asynchronous pipelines to keeping track of expert activation across runs and more.
Extremely nerdy in-the-weeds challenges that Federico and Dima were delighted to nerd out on together :)
Beyond RL infra, we also discussed Online vs Simulated rollouts, self-summarization for long-horizon agents, environment design ("the most powerful RL environment is the product itself"), and other technical nuggets.
PS: We filmed this episode before the SpaceX news, while the Cursor team was still compute-constrained. While Cursor now has *all* the flops, the takeaways and hurdles crossed ring true for any serious application-level company that is racing to post-train their own models.
I believe that more serious application companies will go the way of Cursor and post-train their own models.
00:00 Introduction
00:53 Why Cursor Trained Composer 2
04:55 Specialization vs Bitter Lesson
06:16 Composer 2 Training Recipe
16:32 Scaling RL Infrastructure Globally
23:32 Floating Point Drift
25:11 MoE Sensitivity Explained
26:25 Router Replay Fix
27:19 Real Time RL Loop
31:49 Long Horizon Agents
34:29 Why RL Everywhere
37:34 LLM as Judge Rewards
39:14 RL in Hard Domains
40:13 Build Your Own Environments
44:34 Closing Thoughts