Introducing RapTB ICML2026🎉
Most RL post-training methods are fundamentally mode-seeking: they optimize expected return. In sparse or binary-reward settings, methods like PPO/GRPO can easily collapse to a few high-reward modes, losing solution diversity.
Our work brings GFlowNets to LLM post-training.
Instead of training the model to find one maximum-reward trajectory, we train it to sample solutions with probability proportional to reward. This encourages broad exploration of the solution space, enabling the model to generate samples that are both diverse and high-reward.
Two key contributions:
1. Rooted-prefix training
Improves credit assignment and reduces trajectory-level variance.
2. Submodular replay selection
Lets the model automatically select replay samples that jointly balance diversity, reward, and length, reducing the risk of getting trapped in local modes.
We evaluate on tasks where diversity matters: molecule generation, AMP design, Game of 24 expression synthesis with binary rewards, and natural-language tasks.
Compared with PPO/GRPO and GFlowNet baselines such as TB/SubTB, we observe substantial improvements.