"Always reasoning" isn't the optimal strategy for LLM agents! 🧠
Our new work from UCL DARK identifies a "Goldilocks" effect: planning too frequently, or not enough, degrades performance. We show how to train agents to dynamically allocate test-time compute for best results. 👇
Almost all agentic pipelines prompt LLMs to explicitly plan before every action (ReAct), but turns out this isn't optimal for Multi-Step RL 🤔 Why?
In our new work we highlight a crucial issue with ReAct and show that we should make and follow plans instead🧵