The biggest bottleneck for computer-use agents just got automated away.
Reinforcement learning broke open math and coding.
But for agents clicking around real software, progress stalled.
The bottleneck was generating training data at scale.
CUA-Gym is a pipeline that solves this.
It synthesizes verifiable tasks for computer-use agents end to end.
The setup uses three coordinated coding agents:
> Generator writes environment setup scripts
> Discriminator drafts the reward function blind
> Orchestrator iterates until both align
The team also built mock versions of 94 popular apps.
These include Slack, Notion, Salesforce, and Gmail clones.
Rewards read state directly, skipping flaky screenshot judges.
The resulting dataset holds 32,112 verified tuples across 110 environments.
A trained model hits 72.6% on OSWorld-Verified, matching Claude Sonnet 4.6.
A smaller 3B version matches its 17B base with 10x fewer parameters.
The full system, dataset, and models are open source.