Over the past three months, OpenAI retired Swarm and shipped the Agents SDK with first-class handoffs. Microsoft added handoff orchestration to its Agent Framework. AutoGen rebuilt its multi-agent coordination around delegate tools. Three independent engineering teams arrived at the same answer: agents should transfer conversations to each other like a phone call getting routed to the right department — not like a manager assigning tasks to subordinates.
Why everyone landed here
The hierarchical pattern — one orchestrator dispatching work to sub-agents — was the first thing every team tried. It makes intuitive sense: a smart coordinator breaks problems down, assigns them, collects results. The trouble is the coordinator becomes a bottleneck. It needs to understand every sub-agent's capabilities well enough to route correctly, hold context for all parallel threads, and handle partial failures gracefully. In practice, that central orchestrator turns into the most complex, fragile component in the system. It's the god object of multi-agent design.
Handoffs sidestep this by eliminating the coordinator entirely.
What the transfer looks like in code
The core mechanism is almost disappointingly simple. An agent has a special function that, instead of returning data, returns another agent:
triage_agent = Agent(
name="Triage",
instructions="Route the user to the right department.",
functions=[transfer_to_billing, transfer_to_support],
)
def transfer_to_billing():
return billing_agent
def transfer_to_support():
return support_agent
When the triage agent calls transfer_to_billing, the execution loop swaps out the active agent. The billing specialist picks up the full conversation history and continues as if it had been there from the start — same experience as when a support rep transfers your call with a quick "let me connect you to billing."
No central coordinator holding state. No message bus wiring agents together. Just one agent recognizing "this isn't my job" and pointing to someone who can handle it. AutoGen implements the same idea through delegate tools that return topic type strings, routing messages through a pub-sub layer with session IDs threaded through for isolation.
The triage hub in production
The most battle-tested deployment of this architecture is what I think of as the triage hub. A lightweight classifier agent sits at the front, reads intent, and routes to specialists.
This works better than a monolithic super-agent for a reason that's easy to overlook: tool selection degrades with tool count. Give one agent 30 tools and a sprawling system prompt, and it starts picking the wrong tool 10-15% of the time. Split that into a router with four transfer functions and four specialist agents carrying 7-8 tools each, and accuracy jumps noticeably. The model isn't getting smarter — it just has fewer wrong choices to make.
Microsoft's Agent Framework makes the routing explicit with typed topic subscriptions. Each specialist registers for specific topic types, and transfers publish messages to matching topics. The pub-sub detail matters for production: you can scale specialists independently, run them across machines, and add new ones without touching the router. The triage agent just needs a new transfer function and a one-line description of when to invoke it.
One pattern I keep seeing in customer-facing systems: every specialist gets a transfer_back_to_triage() function. If a billing agent receives a technical question, it kicks the conversation back to the router rather than hallucinating outside its domain. This creates a self-correcting loop — the system admits uncertainty instead of confabulating, which is a much better failure mode than confidently wrong answers.
Where this architecture breaks down
The obvious failure is circular delegation. Agent A transfers to Agent B, which doesn't understand the request and transfers back to A. Without a cycle detector, you burn tokens in an infinite loop. AutoGen's current implementation has no built-in safeguard — you need to bolt on a hop counter or a visited-agents set yourself.
Context bloat is subtler. Each transfer carries the full conversation history forward. After five handoffs in a complex session, you're passing thousands of tokens of irrelevant context to agents that only need the last few messages. Some teams solve this with handoff summaries — the transferring agent writes a brief recap instead of forwarding the entire transcript — but none of the three major frameworks enforce this yet. It's left as an exercise for the deployer.
Third: observability gets harder. When a user complaint leads you to a conversation that bounced across five specialists, debugging means stitching together five separate agent execution logs. OpenAI's Agents SDK added trace IDs for exactly this reason, but reconstructing the full picture is still more painful than reading a single orchestrator's decision log. Distributed tracing tools help, but the ecosystem hasn't caught up to multi-agent-specific debugging yet.
When to keep the hierarchy
Handoffs aren't universally superior. If your workflow needs to combine partial results — "query three data sources in parallel, then synthesize" — a hierarchical coordinator is the right tool. The transfer pattern works for conversations that have one active thread at a time, where exactly one agent is in charge. It falls apart when you need parallel execution with aggregation.
The heuristic I keep reaching for: if the user talks to one agent at a time, use handoffs. If the system needs multiple agents working simultaneously on overlapping inputs, use a coordinator.
The boring conclusion
When three teams with different codebases, different design philosophies, and different customer bases all converge on the same abstraction, the abstraction is probably load-bearing. The surviving multi-agent production pattern looks a lot like a phone menu with department transfers. It's not the autonomous swarm intelligence anyone pitched in 2024. It's a call center — and it works.