@emnlpmeeting / #EMNLP2025 Accepted Paper: Direct Judgement Preference Optimization
📝 Paper: arxiv.org/abs/2409.14664
This work introduces SFR-Judges, a family of generative judge models trained with Direct Preference Optimization (DPO) to enhance LLM evaluation capabilities across diverse tasks. The approach moves beyond traditional supervised fine-tuning by learning from both positive and negative evaluation examples, addressing the limitation that SFT only learns from correct judgements without avoiding incorrect ones.
Key contributions:
➡️ Novel DPO training approach with three preference data types: Chain-of-Thought critique, standard judgement, and response deduction
➡️ Response deduction auxiliary task that teaches judges to understand what good/bad responses contain by deducing original responses from evaluations
➡️ Comprehensive evaluation across 13 benchmarks covering single rating, pairwise comparison, and classification tasks
➡️ State-of-the-art performance with SFR-LLaMA-3.1-70B-Judge achieving best results on 10/13 benchmarks, outperforming GPT-4o
Results demonstrate the largest model (70B) achieves 92.7% on RewardBench, marking the first generative judge to exceed 90% accuracy. The approach effectively counters evaluation biases like position and length bias while providing flexible adaptation to different evaluation protocols. Additional analysis shows the models can serve as effective reward models for downstream development, improving AlpacaEval win rates from 39.25% to 44.29% through AI feedback refinement.
👥 Authors: Peifeng Wang @PeifengWang3, Austin Xu @austinsxu, Yilun Zhou @YilunZhou, Caiming Xiong @CaimingXiong, Shafiq Joty @JotyShafiq#FutureOfAI#EnterpriseAI#LLMEvaluation#PreferenceOptimization#RewardModeling#AIFeedback#MachineLearning#AutoEvaluation
5️⃣ Final thoughts?
Reward modeling isn’t perfect, and no method completely eliminates reward hacking.
But with a structured inductive bias, trajectory-consistent learning, and lower complexity, GenARM makes a strong case for more reliable token-level alignment.
Let’s discuss! 👇
Do you think reward hacking can ever be fully solved? 🤔
#AI#LLMs#ICLR2025#RewardModeling#TestTimeAlignment
🦃 At the end of Thanksgiving holidays, I finally finished the piece on reward hacking. Not an easy one to write, phew.
Reward hacking occurs when an RL agent exploits flaws in the reward function or env to maximize rewards without learning the intended behavior. This is imo a major blockers for real-world deployment of more autonomous use cases of AI models.
Also would like to call out more research on mitigation strategies for reward hacking, especially in the context of LLMs and RLHF.
👉lilianweng.github.io/posts/2…
🚀 Want to level up your AI skills? In just 2 hours, learn to train LLMs for Reward Modeling and fine-tune models with LoRA.
Perfect for anyone looking to optimize AI for complex tasks. Join now!
shorturl.at/UIVbb#AI#MachineLearning#LLM#RewardModeling