Last week a team I advise pushed a support-ticket agent to production. Latency looked fine. Error rate: 12%. The Datadog dashboard showed clean HTTP 200s across every service. Twelve percent of users were getting nonsense answers, and the traces told them absolutely nothing about why.
They'd instrumented it like a microservice. That was the mistake.
Agent traces have a shape problem
Microservice traces are trees. Request hits gateway, fans out to three services, each calls a database, responses roll back up. The shape is predictable — you can eyeball a waterfall and spot the slow query in seconds.
Agent traces are ragged loops. A single user request might trigger this sequence: LLM call → tool lookup → LLM call → two parallel tool calls → LLM call → final response. Next request with slightly different input? Three LLM calls and one tool invocation. The trace tree is a different shape every time.
This breaks every assumption your existing observability stack makes. Alert on "span count > N"? Useless — span count varies by design. Set latency thresholds per service? The same agent service might take 800ms or 14 seconds depending on how many reasoning loops the model decides it needs. Dashboard averages smooth out exactly the variance you need to see.
The deeper issue: without semantic meaning on each span, you can't distinguish whether a slow trace is slow because the model was thinking hard (fine) or because a tool call hung for 9 seconds (not fine). Standard HTTP instrumentation gives you timings. It doesn't give you why.
OpenTelemetry's GenAI semantic conventions
OpenTelemetry shipped semantic conventions specifically for generative AI and agent systems earlier this year. They're not just "add some custom attributes" — they define a vocabulary for what agent operations mean.
The core span types:
| Span | gen_ai.operation.name |
When to use |
|---|---|---|
invoke_agent {name} |
invoke_agent |
Root span for any agent invocation |
gen_ai.chat |
chat |
Each LLM call within the agent loop |
agent.tool_call |
varies | Tool execution — API calls, DB lookups, retrieval |
gen_ai.embeddings |
embeddings |
Vector generation for RAG steps |
Each span carries attributes that actually matter for debugging:
gen_ai.usage.input_tokens/output_tokens— Cost per step, not just per request. A singleinvoke_agentspan might contain fivechatspans burning wildly different token counts.gen_ai.response.finish_reasons— Did the model stop because it was done, hit a length limit, or timed out? This single attribute would have caught my friend's 12% failure rate immediately: the model was hittingmax_tokenson complex tickets and returning truncated reasoning.gen_ai.conversation.id— Ties multi-turn interactions together across separate invocations, so you can trace a user's entire session instead of staring at isolated requests.
The span hierarchy captures the agent's decision loop: a root invoke_agent span contains child gen_ai.chat spans for each reasoning step, each of which may contain agent.tool_call children. You get a waterfall that shows the model's decision-making process, not just the HTTP calls underneath it.
Span kind matters too. Use CLIENT for remote agent services like OpenAI Assistants or Bedrock Agents — anything crossing a network boundary. Use INTERNAL for in-process agents running via LangChain or CrewAI. Getting this wrong doesn't break anything, but it confuses your trace UI into misrepresenting where time is actually spent.
Context propagation is where it falls apart
Auto-instrumentation packages exist for the major providers — OpenAI, Anthropic, LangChain, LlamaIndex. Drop them in, get spans without code changes. That's the happy path.
The unhappy path is anything involving MCP servers or custom tool infrastructure. Span context doesn't automatically propagate from your agent framework to remote tool servers. Llama Stack 0.2.x and 0.3.x? Context dies at the MCP boundary. You end up with two disconnected trace trees instead of one.
The fix is manual injection — annoying but straightforward:
from opentelemetry.propagate import inject, extract
# Agent side: inject context into outgoing tool call
tool_headers = {}
inject(tool_headers)
response = call_mcp_tool(params, headers=tool_headers)
# Tool server side: extract parent context
parent_ctx = extract(request.headers)
with tracer.start_as_current_span("tool.execute", context=parent_ctx) as span:
span.set_attribute("gen_ai.operation.name", "execute_tool")
result = do_the_thing()
Six lines of plumbing. But without them, you're debugging with half a trace — which is worse than no trace at all, because it looks complete until you realize the 9-second gap is a missing subtree, not a slow operation.
If you're running FastMCP or similar frameworks that lack native OTel support, a decorator pattern keeps this from polluting your business logic. Wrap tool functions once, propagate context automatically, move on.
The bill nobody budgeted for
A typical RAG agent pipeline generates 10–50x more telemetry per request than an equivalent REST endpoint. Every reasoning loop creates spans. Every tool call creates spans. Token counts, finish reasons, prompt content — it compounds fast.
Teams adding agent observability to existing Datadog or Grafana setups report 40–200% increases in their monthly bill. That number isn't hypothetical — OneUptime published a detailed breakdown in April showing how AI workloads are blowing up observability budgets across the industry.
The practical answer is tail-based sampling. Capture 100% of error traces and long-running agent invocations. Sample 5–10% of successful ones. You lose some baseline visibility but keep the traces that actually matter for debugging. One more trick: store prompt and completion text as span events, not attributes. This lets you filter at the collector level before content hits storage — critical both for cost and for not accidentally shipping PII into your telemetry backend.
Start here
According to the latest State of Agent Engineering report, 89% of organizations claim they've implemented "some form" of agent observability. But only 62% have step-level tracing. That gap — between knowing an agent call happened and knowing what it decided at each step — is exactly where production bugs live.
If you're deploying agents right now: auto-instrument your LLM provider first, manually wire context propagation across any tool or MCP boundaries, and set gen_ai.response.finish_reasons as your first alert condition. That one attribute catches more silent failures than any latency threshold you'll ever configure.