Everybody wants a multi-agent system. A planner talks to a researcher talks to a coder talks to a reviewer — it feels sophisticated, it photographs well for architecture diagrams, and according to a paper from Google and MIT, it might be making your system 70% worse at the thing you actually need it to do. The paper — "Towards a Science of Scaling Agent Systems" — is the first serious attempt to put numbers on a question the industry has been hand-waving about: when does adding more agents actually help?
What They Tested
The team ran 180 agent configurations across five architectures (single-agent, independent parallel, centralized hub-and-spoke, decentralized mesh, hybrid), four benchmarks, and three model families (GPT, Gemini, Claude). The benchmarks spanned financial analysis, web browsing, Minecraft-style crafting plans, and general tool use. Not a narrow test suite designed to prove a point — a genuine attempt to cover the landscape of tasks people actually throw agents at.
Task Shape Is Everything
Results split along one axis: can the work be divided into independent chunks?
On financial reasoning — a naturally parallelizable workload where you analyze ten companies independently and merge the results — centralized coordination improved performance by 80.9% over a single agent. Substantial. If your problem decomposes into N independent sub-problems, multi-agent coordination is the right call.
Sequential reasoning told a different story. On PlanCraft, where step 3 depends on step 2 depends on step 1, every multi-agent variant degraded performance by 39 to 70%. Not marginal. Catastrophic.
The paper calls this the "cognitive budget" problem. Each model invocation has finite reasoning capacity. When agents spend tokens coordinating — parsing messages from peers, maintaining shared state, resolving conflicts — that overhead eats directly into the budget available for actual problem-solving. On planning tasks, the coordination tax left agents unable to think clearly about the plan itself.
This isn't a framework bug you can engineer around. It's structural. Adding more agents to a sequential problem is like adding more cooks to a recipe that must be done in order — you don't get speed, you get confusion.
Three Effects That Govern the Outcome
The paper identifies three dominant dynamics. Understanding them is more useful than any framework comparison chart.
Tool-Coordination Trade-off. Tasks requiring many tools — database queries, API calls, file operations — create overhead when distributed across agents. Each agent needs tool access, context about when to use each tool, and awareness of what other agents have already done. As tool count grows, coordination cost outpaces the benefit of parallelism.
Capability Saturation. There's a performance threshold above which adding agents yields diminishing returns. If your model already solves the task 85% of the time solo, a second agent adds noise more often than value. The saturation point varies by task but is predictable from the single-agent baseline score. Before reaching for a second agent, check whether you're already past it.
Topology-Dependent Error Amplification. Independent (uncoordinated) multi-agent systems amplify errors 17.2x. Centralized systems: 4.4x. The mechanism is simple — without a checkpoint between agents, one agent's hallucination becomes another's confident input. Centralized orchestrators act as validation filters. Decentralized mesh architectures fall somewhere in between, but in practice most don't implement robust cross-validation and drift toward the 17.2x end.
When to Use What
| Task Type | Recommended | Performance Impact |
|---|---|---|
| Parallelizable (independent sub-tasks) | Centralized multi-agent | +80.9% over single agent |
| Sequential (dependent reasoning chain) | Single agent | Multi-agent degrades 39-70% |
| Tool-heavy workflows | Centralized or hybrid | Orchestrator manages tool routing |
| High-reliability requirements | Centralized | 4.4x vs 17.2x error amplification |
| Open-ended exploration | Decentralized | Diversity outweighs coordination cost |
What This Actually Means
The paper's predictive model identifies optimal architecture with 87% accuracy on unseen tasks. For a lot of production workloads involving sequential reasoning — code generation, multi-step planning, document analysis with dependencies — the optimal number of agents is one.
There's a practical trap here. Teams build multi-agent systems not because the task demands it, but because the architecture feels like a natural decomposition of responsibilities. "The researcher searches, the analyst reasons, the writer outputs." Clean separation of concerns. But if those responsibilities execute sequentially — and most real workflows do — you've turned a single coherent reasoning chain into a game of telephone between models that each have incomplete context.
The data doesn't care about your architecture diagram. On chain-of-thought-heavy tasks, every agent you add makes the system measurably dumber.
A simpler heuristic than the full nine-variable model: if your task requires step-by-step reasoning where each step depends on the last, use one capable agent with good tools. If your task splits into genuinely independent chunks, use centralized coordination with the minimum number of agents that covers the workload. Don't default to multi-agent because it sounds more advanced.
The field has been running on architectural intuition and conference-talk aesthetics. Now there's math. Whether teams will actually use it — run the single-agent baseline before reaching for a swarm — remains to be seen.