AI agent benchmarks in 2026: separating real signal from marketing

In February 2026, OpenAI published a post with a remarkable title: “Why SWE-bench Verified no longer measures frontier coding capabilities.” Remarkable because SWE-bench Verified was OpenAI’s own benchmark — the company built it in 2024, with the original authors, precisely to fix the flaws in the original SWE-bench. Eighteen months later it retired the thing in public, explaining that scores had crawled from 74.9% to just 80.9% in six months, that a manual review of 138 persistently-failing tasks found 59.4% had material defects (35.5% tests too strict, 18.8% tests checking behavior the issue never mentioned), and — the kicker — that every frontier model they tested could reproduce the benchmark’s gold-standard patches verbatim. The tasks were in the training data.

Meanwhile Anthropic’s latest flagship announcement, Claude Opus 4.8 (May 2026), still headlines SWE-bench Verified.

So as of mid-2026, the two labs whose models you’re most likely choosing between no longer report the same primary coding benchmark. One abandoned the field’s standard citing contamination; the other kept it. Every leaderboard screenshot you see on X stitches these incomparable numbers together anyway. This piece is about how we got here, what the decay mechanisms actually are, and how to read any benchmark claim without getting played.

Benchmarks are software, and they ship with bugs

The original SWE-bench (Princeton, October 2023) was a genuinely great idea: 2,294 real GitHub issues from 12 popular Python repositories — fix the issue, pass the repo’s tests. No toy puzzles. When it launched, the best model of the day solved 1.96% of it. That number was the point: finally, a benchmark with headroom.

But real-world tasks import real-world mess. When OpenAI built SWE-bench Verified in August 2024, it paid 93 professional developers to annotate a 1,699-task sample of the original, three annotators per task. The audit found 38.3% of tasks flagged for underspecified problem statements and 61.1% flagged for tests that could unfairly reject valid solutions. After filtering, 68.3% of sampled tasks were thrown out, leaving the 500-task Verified set — on which GPT-4o promptly doubled, from 16% to 33.2%, with no change to the model.

Sit with that: the same model, same scaffold, doubled its score because the test got fixed. Half of what the original benchmark measured was benchmark bugs, not capability. That’s not a scandal — it’s the normal lifecycle. The scandal is treating any benchmark as a fixed standard of truth after the field has had two years to train against it.

The three ways a benchmark dies

1. Contamination — the model has seen the answers. SWE-bench tasks are public GitHub history, and so are their fixes. A 2025 Microsoft paper with a perfect title — The SWE-Bench Illusion — measured what that means: state-of-the-art models could name the buggy file from the issue text alone, no repository access, 76% of the time on SWE-bench repos — dropping to 53% on comparable repos outside the benchmark. Five-gram analysis showed roughly double the verbatim overlap with SWE-bench Verified solutions versus other benchmarks (35% vs 18%). By 2026, OpenAI’s retirement post said the quiet part out loud: frontier models can emit the gold patches verbatim. At that point a rising score measures recall, not engineering.

2. Saturation — the ceiling becomes the floor. HumanEval, the 164-problem set that launched code benchmarking in 2021 (the original Codex scored 28.8%), is the cautionary tale. Frontier models pushed it so close to the ceiling that it stopped differentiating anything — neither the GPT-5.5 nor the Claude Opus 4.8 announcement so much as mentions it. A benchmark doesn’t have to be wrong to become useless; it just has to stop separating the models you’re choosing between. OpenAI’s stated reason for dropping SWE-bench Verified included exactly this: six months of frontier progress moved it six points.

3. Scaffold divergence — same benchmark, different game. The number you see is never just the model; it’s model + harness + prompting + retry budget. The footnotes say so, if you read them:

Anthropic’s Claude Opus 4.5 SWE-bench number came with the footnote that it was achieved “using parallel test-time compute, a method that aggregates multiple ‘tries’ from the model and selects from among them”.
Anthropic’s Opus 4.8 announcement reports every model’s Terminal-Bench score through one public harness (Terminus-2) — and footnotes that “GPT-5.5’s reported score with the Codex CLI harness is 83.4%”, i.e., OpenAI’s own harness yields a meaningfully different number for the same model on the same benchmark.
OpenAI’s GPT-5.5 announcement footnotes its own headline SWE-bench Pro score with a warning that labs have found evidence of memorization in that eval, and notes its τ²-bench number was run with raw prompts, deliberately ignoring other labs’ prompt-tuned results.

None of this is cheating — every footnote is disclosed. But it means a one-decimal-place comparison chart between two vendors’ self-reported numbers is noise wearing a lab coat. The harness difference alone (83.4 vs a Terminus-2 number for the same model) is bigger than most claimed model-over-model gaps.

Where the two labs actually stand, mid-2026

Reported in the flagship announcement	OpenAI — GPT-5.5 (Apr 2026)	Anthropic — Claude Opus 4.8 (May 2026)
SWE-bench Verified	Dropped — formally retired Feb 2026	Still headlined
SWE-bench Pro	Yes — 58.6% (public set), with a memorization-warning footnote	Yes
Terminal-Bench	Yes — v2.0, 82.7%	Yes — v2.1, via the Terminus-2 public harness
House evals	Expert-SWE (internal, ~20-hour tasks)	CursorBench, Vending-Bench, τ²-bench, Online-Mind2Web

The overlap that survives is Terminal-Bench and SWE-bench Pro — and even there, the harness footnote above shows the numbers aren’t directly comparable. If you remember one thing from this table: a cross-vendor benchmark chart with no harness disclosure is marketing, full stop.

What still carries signal

The benchmarks worth your attention in 2026 share design properties that resist the three decay modes:

Fresh questions, on a schedule. LiveBench (Abacus.AI + NYU, with Yann LeCun among the authors) releases new questions roughly monthly, drawn from recent papers and competitions, with objective ground-truth scoring — the paper calls it “contamination-limited,” which is also the honest ceiling of what’s achievable. You can’t memorize what didn’t exist at training time.

Versioned, audited, and hard. Terminal-Bench (Stanford × Laude Institute) does the unglamorous thing: v2.0 shipped as 89 curated end-to-end terminal tasks — compile, debug, set up servers — each audited for ~3 reviewer-hours, calibrated so frontier models score under 65%. Versioning matters: when contamination or saturation creeps in, the fix ships as v3, not as silent rot.

Task formats that measure the loop, not the snippet. Aider’s polyglot benchmark is 225 of the hardest Exercism exercises across six languages, two attempts allowed — the model sees its own test failures and must self-correct. That’s much closer to what your agent actually does all day than single-shot patch generation.

A different axis entirely. METR’s time-horizon work sidesteps “score on a fixed set” by asking: how long a human task can the model complete at 50% reliability? Their March 2025 result — the horizon has doubled roughly every 7 months for six years — became the most-cited capability chart of the decade. The January 2026 update (228 tasks, rebuilt infrastructure) finds post-2024 models doubling even faster — roughly every 89 days — with the top measured model at a 320-minute (~5-hour) horizon. Time horizons saturate much more slowly than pass-rates, because the task ladder extends upward.

The five questions that separate signal from marketing

Next time a launch post, a leaderboard, or a reply-guy screenshot hands you a number, run it through these:

Who ran the harness? Vendor self-reported, or a third party with a public harness? Is the harness named at all?
Same scaffold for every model in the chart? If one column used parallel test-time compute or a custom CLI and another didn’t, the chart is fiction. (The disclosed 83.4-vs-Terminus-2 gap for one model is your calibration for how much this matters.)
Could the tasks be in training data? When were they created vs. the model’s cutoff? Has anyone published a contamination probe (à la SWE-Bench Illusion)?
Has the test itself been audited? SWE-bench Verified exists because 61% of sampled original tasks had tests that could reject correct answers. If nobody has audited the benchmark, assume some of the headroom is bugs.
Is it still separating the frontier? If the top five models sit within a couple of points (or near 100%), the benchmark has retired itself, whatever the marketing says.

A number that survives all five is signal. Most numbers in launch threads survive zero.

Companion reading

Claude vs ChatGPT for coding in May 2026 — where we first hit the “vendors no longer report the same benchmark” wall in practice, and chose by agent ergonomics instead.
State of AI Coding Agents — May 2026 — the monthly ledger of what actually shipped.
Every popular Claude Code workflow is the same five steps — why verification gates in your own workflow beat trusting anyone’s leaderboard.
Claude’s “Dreams,” accurately — the “6x task completion” claim is a textbook unsourced multiplier; here’s the skeptic’s read.

Sources: OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities” (Feb 23, 2026); OpenAI, “Introducing SWE-bench Verified” (Aug 13, 2024); Jimenez et al., SWE-bench (ICLR 2024); Liang, Garg & Zilouchian Moghaddam, “The SWE-Bench Illusion” (Microsoft, 2025); Anthropic, Claude Opus 4.5 and Claude Opus 4.8 announcements; OpenAI, GPT-5.5 announcement (Apr 2026); LiveBench; Aider polyglot benchmark; Terminal-Bench (paper); METR and Time Horizon 1.1. GPT-5.5 footnote wording paraphrased from a non-English-served page; substance verified.