Most multi-agent failures trace back to orchestration bugs and handoff engineering, not model capability -- UC Berkeley's MAST framework found that 79% of production breakdowns are specification and coordination problems.
I keep seeing the same post-mortem pattern: multi-agent system breaks in production, team spends a week evaluating whether to swap Claude for GPT or Gemini, runs benchmarks, maybe fine-tunes something. The system still breaks. Because the model was never the problem.
UC Berkeley's MAST framework analyzed 1,642 execution traces across production multi-agent systems and found something that should be printed on every AI team's wall: 79% of failures trace back to specification and coordination problems. Not hallucination. Not reasoning gaps. Not context window limits. Plain old orchestration bugs.
The kind of bugs we've been writing about in distributed systems textbooks for forty years.
The taxonomy of how your agents actually break
The MAST research splits failures into three buckets, and the distribution is damning:
Specification and system design: 41.8%. This is the biggest category and it's entirely self-inflicted. Task misinterpretation because the system prompt was ambiguous. Duplicate agent roles where two agents think they own the same subtask. Missing termination conditions — the agent literally doesn't know when to stop. A procurement agent at one company deleted half its vendor records because someone wrote "clean up outdated entries" without defining what "outdated" meant. That's not an AI problem. That's a requirements problem wearing an AI costume.
Inter-agent misalignment: 36.9%. This is where multi-agent systems earn their reputation for being brittle. Context evaporates at handoff boundaries. Agent A produces output in one format, Agent B expects another, and nobody catches it until a customer gets a garbled email. The research found that problems requiring more than four handoffs between agents almost always fail. Four. That's your budget.
Free-text handoffs are the worst offender. Teams pass raw conversation history between agents and wonder why downstream agents hallucinate. The conversation history is a lossy, ambiguous, ever-growing blob of text — it's the worst possible interface contract.
Task verification and termination: 21.3%. Agents that quit early. Agents that loop forever. A document processing agent that analyzed half a contract and declared victory. An editing agent caught in an infinite refinement cycle, burning tokens on changes that made the output worse with each pass. These failures are boring and completely preventable, which makes them the most frustrating category.
The real villain: handoff engineering
Here's what I've seen kill the most multi-agent deployments. It's not the model. It's not even the framework. It's the handoff.
When Agent A finishes work and passes state to Agent B, three things can go wrong, and in production all three eventually will:
Discovery failure. Agent B doesn't know Agent C exists, so when a request falls outside B's scope, it either hallucinates an answer or tells the user it can't help. The capability is right there in the system — the routing just doesn't reach it.
Context collapse. The context window fills up with prior conversation, and the agent loses track of decisions made six turns ago. It re-asks questions. It contradicts earlier commitments. The user notices before the system does.
The "you handle it" loop. Two agents keep bouncing a task between each other. Neither commits. Tokens burn. The user waits. I've seen this eat $40 in API calls on a single customer request before a timeout killed it.
PwC found that adding structured validation loops with independent judge agents improved their code generation accuracy from 10% to 70% — a 7x improvement from changing the orchestration, not the model.
What actually fixes this
The fix isn't a better model or a fancier framework. It's treating agent boundaries like API contracts.
Structured handoffs, not conversation forwarding. Stop passing chat history between agents. Define a JSON schema for what crosses the boundary. Agent A's output schema is Agent B's input schema. If they don't match, the system fails at deploy time, not at 3 AM when a customer is waiting.
Four-handoff rule. If your workflow requires more than four agent-to-agent transfers, redesign it. Collapse agents. Remove unnecessary delegation. The research says this is where reliability falls off a cliff, and my experience matches. Every handoff is a potential failure point, and failure probability compounds.
Termination contracts. Every agent needs an explicit definition of "done." Not "when the output looks good" — a checkable condition. Did the document get fully processed? Did all required fields get populated? Is the downstream system acknowledging receipt? If you can't write it as a boolean, your agent doesn't know when to stop.
Kill switches with token budgets. Set a hard ceiling on how many tokens any single agent invocation can consume. When the budget runs out, the system escalates to a human or fails gracefully. This prevents the infinite-loop-of-politeness from turning a 0.02 request into a 40 incident.
The uncomfortable implication
If 79% of your agent failures are orchestration problems, then 79% of your reliability investment should go into orchestration engineering. Not model selection. Not prompt optimization. Not RAG pipeline tuning.
Yet most teams I talk to spend 80% of their time on the model layer and 20% on everything else. They're optimizing the part that's already working and ignoring the part that's actually broken.
The multi-agent future everyone's building toward — with MCP standardizing tool access, A2A enabling peer-to-peer agent collaboration, frameworks like LangGraph and CrewAI making orchestration accessible — all of it is bottlenecked on the same mundane engineering discipline we've needed since the first microservice called the second one: clear contracts, explicit failure handling, and knowing when to stop.
Your model is probably fine. Fix the plumbing.