Check out our new work on making reasoning models think broadly! 🤔
We find a minimalist, surprisingly effective recipe to THINK for CHAT: RLVR a strong reward model, trained on real-world prompts.
This project was fun and surprised me in a few ways 👇
📌 We can run RL directly on a base model (no SFT), showing base models might already chat well.
Llama-3.1-8B-Base with only 7K prompts ends up chatting well, matching Llama-3.1-8B-Instruct. This is interesting since Instruct was trained with a complex multi-stage pipeline. Also nice to see this working on Llama, while most RLVR papers only show success on Qwen.
📌 Interesting findings about rewards. Leaderboard scores of reward models aren’t always the best indicator of downstream performance. We also tested checklist-based rewards, which helps on synthetic instruction-following tasks (IFEval) but didn’t generalize well to chat. I still believe in this direction, and would love to see more open-source efforts.
📌 Real user prompts (shout out to WildChat
@wzhao_nlp ) were the most effective.
These prompts often require “thinking before answering,” which makes them fit for teaching models general thinking. The recipe is simple, we need good ingredients to cook better.
📌 Algorithms, like GRPO vs PPO, has a bigger impact when training directly from base models, but once warm-started with SFT, models are less sensitive to the choice.
Overall, my feeling is: if we start with a strong base LM, and put it in the right “chat environment” (good prompts good rewards), simple RL training goes a long way. Thus we are quite excited to explore more on pretraining and reward design!
Language models that think, chat better.
We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench!
Read on. 🧵
1/8