🚀 Breakthrough RL Algorithm is Here!
Tackling the two critical challenges in off-policy training for large models:
1️⃣ Training instability & crashes
2️⃣ Decreased exploration due to entropy decay
Our Discovery:
Through deep analysis of PPO-like objectives, we identified the key issue: imbalanced optimization! Negative advantage samples contribute far more to the gradient than positive samples, leading to "over-punishment" of the policy.
The Solution:
By adjusting the gradient contribution ratio of positive to negative samples to 1:1, we immediately enhance stability and preserve the model's exploration capability!
Deep Mechanism:
We uncover the root cause of entropy decay and propose an "entropy clipping rule":
Low-probability positive advantage tokens and high-probability negative advantage tokens are the key to maintaining entropy!
However, existing PPO algorithms' fixed clipping mechanisms systematically exclude these tokens.
Introducing the BAPO Algorithm!
Core Innovations:
• Dynamic adjustment of clipping boundaries
• Balancing positive and negative sample contributions
• Incorporating key low-probability positive advantage tokens
• Excluding overly negative samples
Impressive Results:
Tested across various complex scenarios:
✅ Standard Replay Mode
✅ Partial Rollout Mode
Proven on architectures like Qwen and Llama.
In AIME 2024/2025 Benchmarks:
🌟 Outperformed existing SOTA open-source models of the same scale
🌟 32B model surpasses leading proprietary systems
Our deep analysis and BAPO algorithm offer a stable and efficient new approach for off-policy training in the LLM RL community, making large model training more stable and improving exploration capabilities! 🚀
#ReinforcementLearning #LargeModelTraining #AlgorithmInnovation #AIResearch #LLM #DeepLearning #AI