We propose DRPO: a soft version of DPPO🔥
Since PPO, clipping/mask-based trust regions have long outperformed smooth divergence regularization like KL, even though the latter one feels more principled. 👺
We found two missing pieces:👇
1️⃣ Weight the regularizer by |advantage|
- Otherwise, the trust region geometry changes dynamically and optimization becomes unstable.
2️⃣ Use the right divergence
- What matters is not just “regularization”, but the trust-region geometry induced by the gradient. DPPO-style geometry works much better than PPO-style geometry in LLM.
These insights lead to DRPO, which delivers the most robust and best overall performance across algorithms, even outperforming original mask-based DPPO. 🚀
This project is an amazing collaboration with
@ExplainMiracles,
@NickZhou523786, Wee Sun Lee, Liefeng Bo,
@TianyuPang1 . Do follow them if you are interested in this work!
📄 Paper:
arxiv.org/pdf/2606.09821
💻 Code:
github.com/Tencent-Hunyuan/U…