Elf🧚‍♂️（🇨🇳互关）

Elf🧚‍♂️（🇨🇳互关）

Users
Tweets

Elf🧚‍♂️（🇨🇳互关）@lingt1923

Jun 7

Domestic Chip Shatters Myths: Post-Training of 1.6 Trillion Parameter Giant Model Completed on Ascend 910C — DeepSeek-V4-Pro Full Parameters #HuaweiAscend910C #DomesticAI #ComputingPower #Parameters #LargeModelTraining #NVIDIA #DeepSeek

Prof Qi Zhang

Prof Qi Zhang

@QiZhang_FDU

1 Nov 2025

🚀 Breakthrough RL Algorithm is Here! Tackling the two critical challenges in off-policy training for large models: 1️⃣ Training instability & crashes 2️⃣ Decreased exploration due to entropy decay Our Discovery: Through deep analysis of PPO-like objectives, we identified the key issue: imbalanced optimization! Negative advantage samples contribute far more to the gradient than positive samples, leading to "over-punishment" of the policy. The Solution: By adjusting the gradient contribution ratio of positive to negative samples to 1:1, we immediately enhance stability and preserve the model's exploration capability! Deep Mechanism: We uncover the root cause of entropy decay and propose an "entropy clipping rule": Low-probability positive advantage tokens and high-probability negative advantage tokens are the key to maintaining entropy! However, existing PPO algorithms' fixed clipping mechanisms systematically exclude these tokens. Introducing the BAPO Algorithm! Core Innovations: • Dynamic adjustment of clipping boundaries • Balancing positive and negative sample contributions • Incorporating key low-probability positive advantage tokens • Excluding overly negative samples Impressive Results: Tested across various complex scenarios: ✅ Standard Replay Mode ✅ Partial Rollout Mode Proven on architectures like Qwen and Llama. In AIME 2024/2025 Benchmarks: 🌟 Outperformed existing SOTA open-source models of the same scale 🌟 32B model surpasses leading proprietary systems Our deep analysis and BAPO algorithm offer a stable and efficient new approach for off-policy training in the LLM RL community, making large model training more stable and improving exploration capabilities! 🚀 #ReinforcementLearning #LargeModelTraining #AlgorithmInnovation #AIResearch #LLM #DeepLearning #AI

174

Prof Qi Zhang

Prof Qi Zhang

@QiZhang_FDU

24 Oct 2025

🔥 Breakthrough RL Algorithm Solves Two Major Off-Policy Issues in Large Models Core Issues: 1️⃣ Instability—training can collapse 2️⃣ Policy entropy decay—exploration suffers Our Discovery: Analysis of PPO-like objectives reveals optimization imbalance. Negative-advantage samples dominate gradients more than positive ones, over-punishing the policy. Solution: Adjusting the positive-to-negative gradient ratio to 1:1 improves stability while preserving exploration. Mechanistic Insight: Entropy decay arises from PPO’s fixed clipping. Low-probability positive-advantage tokens and high-probability negative-advantage tokens are critical but excluded in standard PPO. BAPO Algorithm Innovations: 1️⃣ Dynamically adjusts clipping boundaries 2️⃣ Balances positive and negative sample contributions 3️⃣ Includes key low-probability positive-advantage tokens 4️⃣ Excludes overly negative samples Empirical Results: ✅ Tested in Standard Replay and Partial Rollout modes, supporting Qwen, LLaMA, and other architectures ✅ At AIME 2024/2025 benchmarks: • Outperforms all comparable open-source SOTA • 32B models surpass leading proprietary systems Summary: Our analysis and BAPO algorithm provide a stable and efficient off-policy training solution for LLMs, enhancing both stability and exploration. 🔗huggingface.co/papers/2510.1… #ReinforcementLearning #LargeModelTraining #AlgorithmInnovation #AIResearch #BAPO

Paper page - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy...

Join the discussion on this paper page

huggingface.co

116