Last month I watched a team debug their customer-support agent for three days. Hallucinations, wrong tool calls, invented parameters. They tried prompt engineering, few-shot examples, temperature tuning. The actual problem? They'd registered 127 MCP tools in a single static toolset. The model wasn't broken — it was drowning.
The 40-Tool Cliff
Every model has a practical ceiling for simultaneous tool definitions, and it's lower than the spec sheets suggest. Speakeasy's benchmarking found that static toolsets with 200+ tools exceeded Claude's 200k context window entirely. But the real damage starts much earlier. Around 40 similarly-named tools, selection accuracy degrades noticeably — the model confuses create_deal with create_contact, picks update_account when it means update_opportunity.
This isn't a context-window problem. It's an attention problem. Tool schemas are verbose — each adds 500–1,000 tokens of parameter descriptions, types, and constraints. A 400-tool server burns 405,000 tokens before the user says a word. Even when the window technically fits everything, the model spreads its attention across hundreds of similar JSON schemas and the signal-to-noise ratio collapses.
Three Ways Out
The industry has converged on three patterns for dynamic tool selection, each with different tradeoffs.
| Approach | Mechanism | Initial Tokens | Sweet Spot |
|---|---|---|---|
| Progressive Discovery | Hierarchical meta-tools: list_tools → describe_tools → execute_tool |
~2,500 | Structured APIs with clear namespaces |
| Semantic Search | Embedding-based find_tools matches natural language intent to descriptions |
~1,300 | Large, flat toolsets with diverse capabilities |
| Embedding-Anchored Selection | Model produces a selection rationale + predicted embedding; tools ranked via softmax distance | Varies | Evolving toolsets where new tools appear at inference |
Progressive discovery gives the model a file-system-like interface. It calls list_tools with a prefix like /hubspot/deals/*, gets a summary, then drills into specific schemas only when needed. The token budget stays between 1,600 and 2,500 regardless of whether the underlying server exposes 40 or 400 tools — roughly 100x less than static registration. For complex multi-step tasks across 100 tools, Speakeasy measured costs around 0.10 per task versus 0.37 for the static approach.
Semantic search skips the hierarchy entirely. The agent describes what it needs in plain language, and a retrieval layer surfaces the best matches by embedding similarity. A recent paper on vector-based MCP tool selection reported a 97.1% hit rate at K=3 with a mean reciprocal rank of 0.91 — the right tool almost always lands in the top three results. Spring AI's implementation of this pattern measured 34–64% token savings even on moderately-sized toolsets.
Embedding-anchored selection is the most radical departure. AutoTool, a framework that gained serious traction after its late-2025 paper, teaches the model to produce an embedding representation of the tool it wants rather than generating a tool name. This lets the agent generalize to completely unseen tools — the researchers tested against 886 tools the model never encountered during training and performance held. Their dual-phase optimization (trajectory stabilization followed by Plackett–Luce ranking refinement) yielded 4.5–7.7% accuracy gains across math, code generation, and search-based QA.
Knowing When Not to Call
The flip side matters just as much. NVIDIA released the When2Call benchmark specifically to evaluate three decision types: when to invoke a tool, when to ask a follow-up question, and when to tell the user the available tools can't help. Most production agents handle the first case adequately but fail on the other two — either hallucinating a tool call that doesn't make sense, or silently guessing when they should be asking for clarification.
Alignment research frames this as a knowledge-boundary problem. The model carries an implicit confidence distribution over its own knowledge. High confidence means calling a tool wastes latency and money. Low confidence means skipping the tool produces hallucinations. The dangerous zone is the uncertain middle, and that's exactly where the majority of tool-calling errors cluster. Speculative tool calling — firing the request early in parallel with response generation and discarding results if they contradict — is one way voice agents are already attacking this latency gap.
Picking Your Architecture
If you're running under 30 tools, stop reading. Static registration works fine and you'd be over-engineering.
30–200 tools with clear namespaces — progressive discovery. If your tools organize naturally by service (/github/*, /slack/*, /salesforce/*), the hierarchical model maps directly. You get predictable token costs and the LLM only ever sees the 3–5 schemas it actually needs per turn.
200+ tools or flat namespaces — semantic search. Let embeddings handle the routing. The 97% hit-rate numbers are strong enough for production, and you avoid forcing organizational structure onto a tool catalog that may not have one.
Rapidly evolving toolsets — keep an eye on embedding-anchored selection. It's still closer to research than production, but the ability to handle unseen tools without retraining or re-indexing is precisely what the growing MCP ecosystem demands. With 10,000 live MCP servers and counting, "a couple hundred tools" isn't an edge case anymore.
The models didn't get worse at tool calling. The menus got unreasonable, and we kept blaming the waiter.