Your agent scored 82% on Terminal-Bench 2.0. Congrats. Nobody on that leaderboard tells you what it cost to get there, and the missing column is hiding a 50x variance in production economics.

The Missing Column

A multi-dimensional evaluation study published last quarter — the CLEAR framework — ran six leading systems across 300 enterprise tasks and found that configurations achieving similar accuracy differed by up to 50x in per-task cost. The range ran from about 0.10 to 5.00 for the same job. Not "similar job." Same job. Same correctness output.

This shouldn't surprise anyone who has actually shipped one of these things, but it keeps surprising people because the leaderboards do not show it. Terminal-Bench 2.0 reports task completion. SWE-bench Verified reports patch correctness. GAIA reports answer accuracy. None of them publish the token bill. So when your team says "let's use the one at the top of the list," they are picking based on one axis of a problem that has at least four.

Why the Spread Is So Wide

The gap exists because accuracy scaling in agentic systems is bought with API calls, not parameters. Reflexion-style refinement loops will issue up to 2,000 model calls on a single task. Tree-of-thoughts variants branch and prune. Planner-executor-critic stacks triple the per-turn spend before the first tool call even fires. Two stacks can converge on the same output — one makes 12 calls, the other makes 1,800, and both show a green check on the leaderboard.

Complex architectures extract marginal accuracy at exponential cost. In the CLEAR numbers, the systems with the highest raw accuracy cost 4.4 to 10.8 times more than the Pareto-efficient alternatives sitting a point or two below them on the scoreboard. A 2-point bump in accuracy worked out to roughly $50,000 in extra spend per 10,000 tasks. For most enterprise workloads that trade never makes sense, but it gets made anyway because "best on the leaderboard" is a sticky heuristic nobody questions in a vendor review meeting.

The incentive structure reinforces this. Vendors want the top line. Researchers want publishable deltas. Neither has a reason to advertise the denominator.

Pareto Is a Real Place

The useful mental model is not "pick the best agent." It is "pick the one on your frontier." The frontier is the set of configurations where no alternative is strictly better on both accuracy and cost. Everything else is dominated — it costs more for equal or worse quality, and you have no reason to run it except that it sits above other names on a webpage.

This reframes the architecture decision. If you are choosing between a single-call tool-use runner at 74% and a Reflexion variant at 80%, the question isn't "which is better." It's: does the six-point gap earn the 15x call multiplier at your volume? If your task stream is 10 per hour and a failure costs a support ticket, maybe. If it's a million calls a day running background triage, almost certainly not.

Cost-Normalized Accuracy

The CLEAR authors propose a metric called cost-normalized accuracy — accuracy divided by cost, or accuracy at a fixed budget. Not fancy. ML benchmarks have tracked FLOPs alongside accuracy for years. But for agent evals, publishing a cost figure is still rare, and the handful that do use wildly different accounting: tokens, wall-clock latency, dollars at some model's list price. Without a standard, you cannot compare two stacks honestly, let alone predict what 200k requests will look like on the invoice.

Pass@1 Is Also Lying

Cost is the loudest problem, but reliability is the quieter one. The same research found that runners scoring 60% on a single pass dropped to 25% when measured across eight consecutive attempts on identical inputs. For anything touching a user or a tool call with side effects, pass@1 is advertising, not measurement. If the system only works three out of eight times on the same prompt, 60% is a ceiling under perfect conditions, not a floor you can plan against.

Most teams don't find this out until after rollout. A staged canary where the same input goes through ten times and you diff the outputs catches it in an afternoon.

What to Actually Do

Before committing to an architecture, run the same 200-task slice through each candidate and log three things per task: correctness, total tokens in + out, and latency. Plot them. Look for points on your frontier. Ignore the public rankings that don't carry a cost axis.

For reliability, run each candidate k=8 on a stratified sample and record consistency, not just top-line accuracy. The stack that wins at pass@1 is not always the one that wins at pass@8.

When a vendor sells you on "we topped Terminal-Bench 2.0," the follow-up question is: at what cost-normalized score, with what variance across repeated runs. If they don't have those numbers, they haven't measured the thing you are about to buy.

The leaderboard is a marketing surface. Your production invoice is not.