Agents do not scale because they spend more compute.
They scale because they turn interaction into usable feedback.
A sharp new preprint by Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, and Wanxiang Che introduces:
Scaling Laws for Agent Harnesses via Effective Feedback Compute
This matters because agent performance is no longer determined only by the base model.
It depends on the harness:
how the model calls tools
how it receives feedback
how it verifies intermediate states
how it stores memory
how it repairs errors
how it decides when to stop
But most test-time scaling analysis still measures crude expenditure:
tokens
tool calls
operations
wall time
cost
That is like measuring a research lab by electricity consumed instead of valid evidence produced.
The authors propose a better scaling coordinate:
Effective Feedback Compute, or EFC.
A feedback event only receives credit if it is:
informative
valid
non-redundant
retained for later decisions
That last condition is crucial.
A unit test that reveals a bug and changes the agent’s next action is effective feedback.
A repeated tool call that returns redundant information the agent ignores is not.
Same raw budget.
Different epistemic value.
The results are striking.
In controlled scaling experiments, raw tokens and tool calls explain limited variation in failure rates: R² = 0.33 and 0.42.
A strong multivariate SAS baseline reaches 0.88.
Oracle-EFC and Estimated-EFC reach 0.94.
And task-demand-normalized Oracle-EFC reaches 0.99.
Even more important: in matched-budget interventions, raw cost and tool calls are held fixed, but improving feedback quality raises success from 0.27 to 0.90.
That is the whole paper in one lesson:
the unit of progress in agent systems is not the token.
It is durable, task-sufficient feedback.
This reframes agent design.
More tools can hurt.
More turns can be waste.
More tokens can create noise.
More memory can preserve the wrong state.
The harness is a feedback converter.
The real question is not “how much compute did we spend?”
It is:
how efficiently did the harness convert raw budget into information the agent could actually use?
For pretraining, we got scaling coordinates: parameters, data, FLOPs.
For agents, we need a different coordinate.
EFC may be a step toward that.
Full credit to the authors:
Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che.
Paper:
Scaling Laws for Agent Harnesses via Effective Feedback Compute
arxiv.org/abs/2605.29682
I’m attaching the first page because the abstract is worth reading closely.
The future of agents may not belong to systems that spend the most.
It may belong to systems that learn the most from each step.
#AIResearch #AIAgents #LLM #MachineLearning #TestTimeCompute #ArtificialIntelligence