Current reasoning LLMs think everything upfront before writing code. But developers don't work that way. We pause during implementation when things get tricky.
Think Anywhere, from Xue Jiang, Yihong Dong, Ge Li and collaborators at Peking University and @TongYi_China (Alibaba Tongyi Lab), lets LLMs invoke reasoning at any token position during code generation, not just before it.
The method uses cold start training reinforcement learning (GRPO) so the model learns WHEN to pause and think. Analysis shows it triggers reasoning at positions with high entropy, exactly where bugs tend to appear.
Results on Qwen2.5 Coder 7B: 9.3% absolute improvement over the base model across LeetCode, LiveCodeBench, HumanEval, and MBPP. Beats both distillation methods (OlympicCoder) and other RL approaches (CodeRL , CodeBoost).
The model also generalizes to math reasoning benchmarks without any math training. Code and data are open source.