Prof Qi Zhang

Prof Qi Zhang

Users
Tweets

Prof Qi Zhang

@QiZhang_FDU

1 Nov 2025

🚀 Breakthrough RL Algorithm is Here! Tackling the two critical challenges in off-policy training for large models: 1️⃣ Training instability & crashes 2️⃣ Decreased exploration due to entropy decay Our Discovery: Through deep analysis of PPO-like objectives, we identified the key issue: imbalanced optimization! Negative advantage samples contribute far more to the gradient than positive samples, leading to "over-punishment" of the policy. The Solution: By adjusting the gradient contribution ratio of positive to negative samples to 1:1, we immediately enhance stability and preserve the model's exploration capability! Deep Mechanism: We uncover the root cause of entropy decay and propose an "entropy clipping rule": Low-probability positive advantage tokens and high-probability negative advantage tokens are the key to maintaining entropy! However, existing PPO algorithms' fixed clipping mechanisms systematically exclude these tokens. Introducing the BAPO Algorithm! Core Innovations: • Dynamic adjustment of clipping boundaries • Balancing positive and negative sample contributions • Incorporating key low-probability positive advantage tokens • Excluding overly negative samples Impressive Results: Tested across various complex scenarios: ✅ Standard Replay Mode ✅ Partial Rollout Mode Proven on architectures like Qwen and Llama. In AIME 2024/2025 Benchmarks: 🌟 Outperformed existing SOTA open-source models of the same scale 🌟 32B model surpasses leading proprietary systems Our deep analysis and BAPO algorithm offer a stable and efficient new approach for off-policy training in the LLM RL community, making large model training more stable and improving exploration capabilities! 🚀 #ReinforcementLearning #LargeModelTraining #AlgorithmInnovation #AIResearch #LLM #DeepLearning #AI

174

Prof Qi Zhang

Prof Qi Zhang

@QiZhang_FDU

24 Oct 2025

🔥 Breakthrough RL Algorithm Solves Two Major Off-Policy Issues in Large Models Core Issues: 1️⃣ Instability—training can collapse 2️⃣ Policy entropy decay—exploration suffers Our Discovery: Analysis of PPO-like objectives reveals optimization imbalance. Negative-advantage samples dominate gradients more than positive ones, over-punishing the policy. Solution: Adjusting the positive-to-negative gradient ratio to 1:1 improves stability while preserving exploration. Mechanistic Insight: Entropy decay arises from PPO’s fixed clipping. Low-probability positive-advantage tokens and high-probability negative-advantage tokens are critical but excluded in standard PPO. BAPO Algorithm Innovations: 1️⃣ Dynamically adjusts clipping boundaries 2️⃣ Balances positive and negative sample contributions 3️⃣ Includes key low-probability positive-advantage tokens 4️⃣ Excludes overly negative samples Empirical Results: ✅ Tested in Standard Replay and Partial Rollout modes, supporting Qwen, LLaMA, and other architectures ✅ At AIME 2024/2025 benchmarks: • Outperforms all comparable open-source SOTA • 32B models surpass leading proprietary systems Summary: Our analysis and BAPO algorithm provide a stable and efficient off-policy training solution for LLMs, enhancing both stability and exploration. 🔗huggingface.co/papers/2510.1… #ReinforcementLearning #LargeModelTraining #AlgorithmInnovation #AIResearch #BAPO

Paper page - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy...

Join the discussion on this paper page

huggingface.co

116

Dustin

Dustin

@r0ck3t23

30 Jan 2025

Revolutionizing GPU Performance: The 800-Fold Leap with a New Chinese Algorithm A groundbreaking development from researchers at Shenzhen MSU-BIT University has catapulted the performance of Nvidia GPUs by an astounding 800-fold increase in scientific computing scenarios. Here's a comprehensive look at this technological leap: How They Did It: The algorithm in question focuses on enhancing the computational efficiency of peridynamics (PD), a non-local theory used for modeling complex physical phenomena like cracks, damage, and fractures in materials. The traditional computational complexity of PD simulations has been a bottleneck due to high memory usage and slow processing speeds. The researchers have cleverly optimized this by: - Parallel Processing: By leveraging the parallel computing capabilities of GPUs, the algorithm distributes computational tasks across numerous GPU cores, reducing the time taken for each calculation significantly. - Memory Optimization: The new method reduces memory requirements by employing more efficient data structures and algorithms that minimize redundancy and optimize data access patterns, allowing for larger scale simulations without proportional increases in memory usage. How It Works: The algorithm works by: - Breaking Down Complex Problems: It subdivides large-scale peridynamic simulations into manageable, parallel tasks. Each task is processed concurrently on different GPU cores, utilizing the inherent parallelism of GPUs which are designed for handling thousands of threads simultaneously. - Adaptive Meshing: Instead of using a static grid for calculations, this approach dynamically adjusts the computational mesh to focus computational power where it's most needed, thus enhancing the efficiency of each computation step. - Data Compression Techniques: By compressing data on the fly, the algorithm ensures that the GPU's high-bandwidth memory (HBM) is used more effectively, allowing for faster data transfer between the GPU's processing units and memory. What It Means: - Enhanced Problem-Solving: This breakthrough means that complex mechanical problems in industries such as aerospace, military applications, and bridge design can now be simulated with unprecedented speed and accuracy on consumer-grade GPUs. - Cost-Effective Solutions: By achieving such performance gains with existing hardware, there's potential for significant cost savings in computational resources. Organizations can now tackle larger and more complex simulations without needing to invest in exponentially more powerful and expensive hardware. - Broader Access to Advanced Simulations: With this algorithm, even those with budget constraints can perform high-level simulations, democratizing access to advanced computational tools. Significance: - Technological Advancement: It signifies a massive leap in how we utilize GPU technology for scientific computing, potentially reshaping the landscape of computational simulations. - Global Innovation: This is an example of how international collaboration (between Chinese and Russian institutions in this case) can lead to significant technological breakthroughs, pushing the boundaries of what's possible with current hardware. - Strategic Importance: For Nvidia, this could mean renewed interest in their GPUs for scientific and industrial applications, potentially offsetting some of the challenges posed by U.S. export controls on advanced chips to China. In summary, this new algorithm not only showcases the untapped potential of Nvidia GPUs but also sets a new benchmark for what can be achieved in scientific computing with algorithmic innovation. This could herald a new era where computational power is less about hardware alone and more about the ingenuity of the software driving it. #GPUPerformance #Nvidia #AlgorithmInnovation #ComputationalScience #Peridynamics #HighPerformanceComputing #TechBreakthrough #AIResearch #ScientificComputing #BigDataSimulation

119

The Quantum Insider

The Quantum Insider

@QuantumDaily

16 Sep 2024

- @KipuQuantum has released an updated roadmap focused on achieving Commercial Quantum Advantage. 🛣️ thequantuminsider.com/2024/0… #QuantumAdvantage #AlgorithmInnovation #CommercialQuantum

734