Tool Argument Rot Breaks More Agents Than Hallucinations

Everyone obsesses over hallucinations. Meanwhile, the most common failure mode in production agent systems is far more mundane: the LLM returned a JSON object with an integer where the downstream tool expected a string, and the whole pipeline silently ate garbage for three hours before anyone noticed.

The Rot Nobody Tracks

Michael Lanham coined a useful term for this: tool argument rot. It's what happens when an LLM generates structurally invalid outputs for tool calls — missing required fields, wrong types, extra properties the consumer doesn't expect. The model understood the task fine. It just formatted the answer wrong.

Without schema enforcement, older GPT-4 models achieved less than 40% compliance with output schemas. More than half of all tool calls were structurally broken before the model even attempted to reason about the problem.

The typical failure looks deceptively small:

{"customer_id": 12345, "include_history": "yes"}

Two errors hiding in one object. customer_id should be a string, include_history should be a boolean. The tool might accept this silently, coerce the types internally, and return something that looks plausible. Or it throws an error three function calls downstream when a string comparison fails on an integer. Either way, you're debugging type mismatches at 2 AM while the on-call Slack channel fills up with increasingly creative profanity.

Schema Gates — Embarrassingly Simple, Unreasonably Effective

The fix doesn't require a new framework or a PhD. Validate every LLM output against a JSON Schema before it touches anything downstream.

from jsonschema import validate, ValidationError

tool_schema = {
    "type": "object",
    "properties": {
        "customer_id": {"type": "string", "pattern": "^CUS-\\d{6}$"},
        "include_history": {"type": "boolean"}
    },
    "required": ["customer_id", "include_history"],
    "additionalProperties": False
}

try:
    data = json.loads(agent_output)
    validate(instance=data, schema=tool_schema)
except ValidationError as e:
    return error_to_agent(e)  # let the model self-correct

That additionalProperties: false is load-bearing. Without it, the model can stuff extra fields into the object — fields that downstream consumers might inadvertently depend on, creating invisible coupling between agents that were never designed to know about each other. You end up with a distributed system held together by accidental JSON properties, and nobody realizes it until someone changes a prompt and three unrelated pipelines break.

With native structured output enforcement — now available across GPT-4o, Claude, and Gemini — compliance jumps from under 40% to effectively 100%. The model's token generation gets constrained at inference time so only schema-valid tokens can be produced. Syntax errors become impossible by construction, not by hope.

The numbers from production tell the story. A financial services team reported dropping their error rate from roughly 5% to under 0.3% after adding schema gates. Another team saw multi-step workflow accuracy climb from 10% to 70%. Those aren't marginal improvements you celebrate in a sprint retro. Those are the difference between a demo you show investors and a system you'd trust with customer data.

Structural Validity Is Not Semantic Correctness

Here's where teams get overconfident. A schema gate catches {"customer_id": 12345} but happily passes {"customer_id": "CUS-000000", "include_history": true} — structurally perfect, pointing at a customer that doesn't exist.

One team ran a sentiment classifier for weeks before noticing it returned 0.99 confidence on every single input. The schema validated flawlessly every time. The model had simply learned to game the constraint by always committing to high confidence. Structurally perfect. Semantically, a coin flip wearing a lab coat.

You need a second checkpoint:

def validate_semantics(data):
    if not customer_exists(data["customer_id"]):
        raise ValueError(f"Unknown customer: {data['customer_id']}")
    if data.get("amount", 0) > 50 and not data.get("override_approved"):
        raise ValueError("Amount exceeds limit without approval")

Schema gate catches formatting errors. Semantic validator catches lies. Run both.

Schemas as Agent-to-Agent Contracts

This is where the architectural implications get interesting. In multi-agent systems, the schema stops being just a validation mechanism and becomes a data contract — the same concept data engineering teams have used for years to manage interfaces between producers and consumers.

When Agent A hands results to Agent B, the schema defines exactly what B can expect. No parsing heuristics. No "let me try to understand what A meant." A typed, versioned interface that either validates or fails loudly.

The A2A protocol community is actively debating this tradeoff. The romantic vision — agents having rich natural language conversations, negotiating, clarifying ambiguities — sounds compelling in conference talks. In production, it means every receiving agent needs custom parsing logic, error recovery for ambiguous phrasing, and token budgets that balloon because you're sending paragraphs where a 200-byte JSON object would do the job.

Schema versioning becomes non-negotiable at this point. As one practitioner put it bluntly: schema changes are breaking changes. Record both schema_version and prompt_version with every event. Need to add a field? Version the schema. Need to change a type? Version the schema. Treat these interfaces exactly like REST APIs, because functionally that's what they are — just with an LLM on one end instead of a microservice.

The Field Ordering Footgun

One subtle gotcha that bites people who otherwise do everything right: the order of fields in your schema definition affects model behavior. If you put the answer field before the reasoning field, the model commits to an answer before thinking through the problem. Autoregressive generation means tokens flow left to right, and earlier tokens constrain later ones.

Always structure schemas so reasoning and explanation fields precede conclusion fields. This isn't a provider-specific quirk. It's a fundamental consequence of how these models generate text, and it applies whether you're using OpenAI, Anthropic, or a local model with grammar-based decoding.

When Free Text Still Wins

Not everything should be a schema. Agents exploring ambiguous problem spaces, brainstorming approaches, or synthesizing findings from diverse sources lose nuance when forced into rigid JSON structures. A research agent pulling together themes from fifteen different papers would produce worse output if constrained to {"themes": [{"name": "string", "confidence": "number"}]}.

The heuristic is straightforward. If the receiving agent or system needs to take a deterministic action based on the output — route a ticket, trigger a deployment, update a database — use a schema. If the receiving agent needs to reason further about the content, free text is probably the right call.

Most production agent communication falls squarely into the first category. Which is why most of it should be schema-gated, versioned, and validated twice before anything downstream gets to touch it.

#The Rot Nobody Tracks

#Schema Gates — Embarrassingly Simple, Unreasonably Effective

#Structural Validity Is Not Semantic Correctness

#Schemas as Agent-to-Agent Contracts

#The Field Ordering Footgun

#When Free Text Still Wins