A 3.4 GB model just posted a 97.5% pass rate on a tool calling benchmark, beating models five times its size. If you're still defaulting to the biggest model you can afford for your agent stack, you might be lighting money on fire for the wrong capability.
The Benchmark That Broke Assumptions
JD Hodges ran 13 local models through 40 deterministic test cases covering tool selection, argument accuracy, multi-tool calling, edge cases, and format compliance. Eight tool schemas — weather, currency conversion, reminders, email, calendar events — the kind of mundane plumbing that real agents spend most of their time doing.
| Model | Size | Pass Rate |
|---|---|---|
| Qwen3.5 4B | 3.4 GB | 97.5% |
| GLM-4.7-Flash | 18 GB | 95.0% |
| Nemotron Nano 4B | 4.2 GB | 95.0% |
| Mistral Nemo 12B | 7.5 GB | 92.5% |
Qwen3.5 4B — the smallest model in the top four — won outright. Not on some subjective "vibes" benchmark. On deterministic tests where the function call is either correct or wrong.
And this isn't a fluke. An AWS research team fine-tuned Facebook's OPT-350M (yes, 350 million parameters) on the ToolBench dataset and hit 77.55% pass rate. For perspective: ChatGPT with chain-of-thought scored 26% on the same eval. ToolLLaMA at 7B managed 30.18%. A model that fits on a Raspberry Pi outperformed a 175-billion-parameter commercial system at the specific task of selecting and invoking functions.
Why This Makes Sense If You Think About It
Tool calling is fundamentally a structured output problem. The model reads a user request, matches it against available function schemas, extracts the right arguments, and formats a JSON call. That's pattern matching and slot filling — not open-ended reasoning, not creative writing, not nuanced debate.
A frontier model's advantages — deep world knowledge, subtle language understanding, long-range coherence — are mostly irrelevant when the job is producing {"function": "get_weather", "args": {"city": "Tokyo"}}. You're paying for capabilities you don't use.
Fine-tuning sharpens this dynamic. The AWS team trained their 350M model on 187,542 ToolBench examples for a single epoch. One pass through the data. The model internalized tool selection and argument extraction patterns so thoroughly that it achieved 74–80.5% success across all six evaluation categories. Consistent performance, not a lucky spike on one task type. Targeted training on a narrow capability demolishes general-purpose in-context learning.
Where the Wheels Come Off
Don't rip the big model out of your stack just yet.
The same evaluation surfaced a critical weakness: sequential multi-tool dependencies. The best parallel callers sometimes ignored words like "then" or "after that" and fired every tool simultaneously. User says "check my calendar for Thursday, then if I'm free, book a restaurant." Small model calls both tools at once, happily booking a table regardless of whether Thursday is packed.
Single calls and parallel independent calls? Small models crush it. Multi-step chains where tool A's output gates whether tool B fires? Parameter count and architectural depth start mattering again. Production agents rarely make one tool call — they chain five or ten with conditional logic woven between them. That 97.5% headline number deflates fast once you're testing actual workflows instead of isolated invocations.
Decompose Your Model Stack
The practical implication isn't "small models good, big models bad." It's that tool calling is separable from orchestration, and you should treat them as different infrastructure layers.
Dispatch layer — a compact, specialized model handles the hot path. Every user interaction triggers tool selection and argument formatting. Qwen3.5 4B or Nemotron Nano responds in milliseconds on consumer hardware. This is your high-frequency, low-complexity layer. Optimize for latency and cost.
Orchestration layer — a mid-sized model manages workflow logic. What happens when a tool call fails? Should the agent retry, ask for clarification, or try an alternative? How do sequential dependencies between calls get resolved? This fires less often but demands actual reasoning. Mistral Nemo or a 12B-class model earns its keep here.
Planning layer — for genuinely complex tasks requiring decomposition of a user request into a dependency graph of tool calls, a frontier model. This layer activates rarely — maybe once per complex session — and the cost per invocation is tolerable because volume is low.
Three layers. Three different cost profiles. Three different latency budgets. Instead of running a 70B model that sits idle between tool calls, each layer uses exactly the capability it needs.
The Numbers That Matter
Running Qwen3.5 4B locally: effectively free, fits in 4 GB of VRAM. If your agent makes 10,000 tool calls daily through a cloud API at 0.01 each, that's 3,000/month. Move dispatch to a local small model and you're paying for electricity.
The MoE revolution compounds this. Five of six major open-weight families now use mixture-of-experts. Llama 4 Scout has 109B total parameters but only activates 17B. Mistral Small 4 packs 119B into 6.5B active. Frontier-class orchestration models that used to need a cluster now fit on a single H100. Self-hosting costs dropped 4–8x in under a year.
The orchestration layer — maybe 1,000 calls daily to a bigger model — costs a fraction of what you'd spend running everything through one oversized endpoint.
You're Probably Benchmarking Wrong
Most agent evaluations measure end-to-end task completion. Model X resolved 73% of SWE-bench issues. Model Y closed 81% of support tickets. Useful, but these numbers hide where failure actually happens.
Is the model picking the wrong tool? Mangling arguments? Botching multi-step logic? Choking on error recovery? Each failure mode has a different fix. "Upgrade to a bigger model" only addresses some of them — and for the most common one, getting the tool call itself wrong, the fix might actually be a smaller, dedicated model that does nothing else.
Stop throwing parameters at a pattern-matching problem.