There's a common assumption in AI right now that if one language model can do a task reasonably well, having several of them collaborate — splitting up the work, checking each other's outputs, debating answers — should do it better.
This paper puts that assumption under a controlled experiment across 180 configurations and finds that the reality is messier and more interesting: multi-agent setups improved performance by up to 81% on some tasks and made things worse by up to 70% on others, with the difference coming down to whether the task can be broken into parallel pieces or whether each step depends on the previous one.
In a financial analysis, one agent can look at regulatory filings while another reads market news and a third examines earnings data — none of them need to wait for the others.
In a Minecraft crafting puzzle, on the other hand, each action changes the inventory that the next action depends on, so the steps have to happen in order and splitting them across agents just adds overhead without any benefit.
The paper fits an equation that predicts which architecture will work best for a new task 87% of the time.
For anyone building or thinking about building systems where multiple AI models work together, this replaces a lot of hand-waving with something concrete.
Read with an AI tutor:
chapterpal.com/s/5c02af66/to…
Download the PDF:
arxiv.org/pdf/2512.08296