GenSI

GenSI

GenSI

@hello_gensi

Feb 2

Why does your LLM struggle with multiple rewards? 🧩 The secret lies in "Reward Compression." GRPO sums rewards before normalization, making it blind to whether the model hit one goal or all of them. Introducing GDPO (Group reward-Decoupled Normalization Policy Optimization) from Nvidia: 🔹 The Problem: Group-wise normalization maps different reward combinations to the same advantage pattern. Signal loss = Training instability. 🔹 The Fix: Decouple the process. Normalize each reward dimension separately first, then aggregate. 🔹 The Result: ✅ Precise distinction between "correct format" and "correct answer." ✅ Stable convergence in complex Tool Calling, Math, and Coding tasks. ✅ Prevents the model from "cheating" by only optimizing easy rewards. Incredible work by @nbasyl_tw @shizhediao @NVIDIAAI ! Check out our deep-dive video below to see how GDPO redefines Multi-reward RL! 👇 #LLMs #AI #MachineLearning #GDPO #GRPO #NVDIA #ReinforcementLearning #MultiObjectiveRL #AIResearch #LLMAlignment

2:25

221