Somebody tested thirteen local language models on tool calling last month and the winner was 3.4 gigabytes. Not the runner-up. The winner. Qwen3.5 4B scored 97.5% on forty deterministic test cases, beating models five times its size. Three models specifically marketed as tool-calling specialists — Salesforce's xLAM-2, MadeAgents' Hammer, and Mistral Small 3.2 — scored 15%, 20%, and 42.5% respectively.

That result alone should make you reconsider how you pick models for your agent stack.

The Benchmark That Broke Assumptions

JD Hodges ran the eval through LM Studio using Q4_K_M quantization across five categories of tool-calling tasks. The results challenged what most teams assume about model selection for agentic workloads:

Model Size Pass Rate
Qwen3.5 4B 3.4 GB 97.5%
GLM-4.7-Flash 18 GB 95.0%
Nemotron Nano 4B 4.2 GB 95.0%
Mistral Nemo 12B 7.5 GB 92.5%
Qwen3 8B ~5 GB 85.0%
GPT-OSS 20B ~12 GB 85.0%
Mistral Small 3.2 ~15 GB 42.5%
Hammer 2.1 7B ~4.5 GB 20.0%
xLAM-2 8B ~5 GB 15.0%

The bottom three are worth staring at. These aren't generic chat models that happen to stumble on function calling. xLAM-2 and Hammer were purpose-built for tool use. Mistral Small 3.2 lists tool calling as a headline feature. All three failed.

The caveat matters: Hodges attributes some failures to chat template compatibility issues with LM Studio rather than inherent model limitations. But that's actually the point. In production, your model doesn't run in a vacuum. It runs through inference servers, quantization layers, and template parsers that each introduce their own failure modes. A model that only works under ideal conditions isn't a production model.

Parallel Greed vs. Sequential Patience

The eval surfaced something more interesting than raw pass rates. Top parallel-calling performers — models that eagerly fired off multiple tool calls at once — struggled when tasks required sequential reasoning. Qwen3.5 and GLM-4.7 would try to parallelize steps that needed to happen in order, calling tools before results from previous calls had returned.

Meanwhile, Nemotron Nano waited. One call, process the result, decide the next step. Lower parallelism throughput, higher correctness on dependent chains.

This is the tradeoff that benchmarks love to hide. If your eval only measures "did the model call the right functions with the right arguments," both approaches look fine on isolated tasks. The divergence shows up when you chain five tools together and step three depends on step two's output. For agent builders, the real question isn't whether a model can call tools — it's whether it can decide when to wait.

Why Training Methodology Eats Parameters for Lunch

The finding that architecture and training matter more than raw parameter count isn't limited to local models. It maps onto what's happening at the frontier.

DeepSeek V3.2 is the clearest case. It's the first major model to integrate reasoning directly into tool use — not as an afterthought bolted onto a chat model, but as a core design decision. The team generated over 1,800 distinct environments and 85,000 complex prompts specifically for training agentic capabilities, investing more than 10% of pre-training cost into agentic post-training. That investment in training data shaped around tool-use scenarios is what produces a model performing in the same tier as GPT-5 on agent tasks at a fraction of the cost. The parameter count didn't do the work. The 85,000 scenarios where the model learned what "call a tool and wait" actually means — that did.

At the small end, Mistral's Ministral-3-3B packs function calling and structured output into 3.4 billion parameters by designing the architecture around those tasks from day one. The architecture was shaped by the requirement to reliably produce structured output, not retrofitted to support it.

The pattern repeats across the 2026 landscape. Seven of the ten leading open models now use Mixture-of-Experts architectures, routing to specialized parameter subsets rather than throwing the full model at every token. For tool calling, this means the parameters responsible for schema adherence and argument extraction can be disproportionately skilled without inflating total model size. A well-trained 4B MoE can dedicate a larger fraction of its active parameters to structured output than a dense 20B model where those capabilities compete for capacity with poetry generation and trivia recall.

The Frontier Ceiling

At the top of the leaderboard, tool calling is essentially solved. Gemini 3.1 Pro, Claude Opus 4.6, and LongCat-Flash-Thinking all cluster at 99.3% on the Berkeley Function Calling Leaderboard. But drop to local inference and the spread runs from 97.5% down to 15% — on the same hardware, through the same server. The capability is fragile in ways that clean API benchmarks never reveal.

What This Means for Your Agent Stack

Three takeaways from the data.

Evaluate at your actual deployment conditions. Run benchmarks at your quantization level, through your inference server, with your tool schemas. A model scoring 95% on the vendor's native API might drop to 60% through Ollama with Q4 quantization. The eval that matters is the one running on your hardware, not the one on the vendor's blog.

Test sequential reasoning separately from parallel dispatch. If your agent chains tools — and most production agents do — you need evals that specifically test dependent calls. A model that parallelizes everything will ace single-call benchmarks and quietly corrupt multi-step pipelines where order matters.

General instruction-following beats narrow specialization right now. The purpose-built tool-calling models underperforming general-purpose ones is probably a compatibility story, not a capability story. But until inference servers mature their template handling, the safest local choice is a model with strong general instruction following and solid structured output support — not a specialist that falls apart when your inference stack deviates from the training setup.

The framework wars get all the conference talks. But the model sitting inside your agent — the thing deciding which tool to call, which argument to pass, and whether to wait for results — that's where the real variance lives. And right now, the variance is staggering.