Brilliant new paper from Meta, CMU and other labs.
It shows agent speed is mostly a full-system problem, not a “faster model” problem, and with a coordinated stack.
AgentInfer, proposed in this paper, cuts wasted tokens by over 50% and speeds up real agent task completion by about 1.8x to 2.5x.
AgentInfer is a system that makes Large Language Model agents finish tool tasks faster.
A Large Language Model writes chatbot text, and an agent makes it loop, think, call tools like web search, read results, then write again.
These loops get slow because the chat history keeps growing, so every new step has more old text to reread.
AgentCollab uses 2 models, the big model plans and fixes stalls, and the small model does most steps after quick self checks.
AgentCompress keeps the important tool outputs but trims noisy search junk, and it summarizes in the background so the input stays smaller.
AgentSched avoids throwing away cached context when memory is tight, and AgentSAM reuses repeated text from past sessions to draft the next chunks the main model checks.
The punchline is that agent speed comes from coordinating reasoning, memory, and server scheduling, meaning which request runs next, not from faster decoding alone.
----
Paper Link – arxiv. org/abs/2512.18552
Paper Title: "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"