← All reviews
Verdict · June 9, 2026 · 9 min read

Claude Code Agent Teams: Are They Worth the Tokens?

Every guide shows you how to spin up a team. None tell you whether you should. The honest answer: agent teams are a token multiplier you point at a narrow class of high-value, genuinely parallel work — and token theater everywhere else. Here's where the line is, with the real cost numbers attached.

There is no shortage of “how to set up Claude Code agent teams” tutorials. Search the phrase and you’ll get a dozen, plus the official docs. What almost none of them do is answer the question you actually have before you turn the feature on: is it worth it? Agent teams multiply your token bill. The setup guides quietly skip past whether that multiplier buys you anything.

This page judges. We’ll start with the cost — because most of what’s written about it is wrong — then draw the line between the work where teams earn their tokens and the work where they’re an expensive way to do what one session already does.

The verdict in 30 seconds

  • Worth it for high-value work that genuinely splits into independent, file-disjoint chunks and benefits from agents comparing notes: research and review, a feature with separable modules, competing-hypothesis debugging, cross-layer (frontend/backend/tests) coordination.
  • Token theater for sequential work, same-file edits, dependency chains, or anything routine. A single session or subagents do it for a fraction of the cost.
  • The cost: plan on roughly N× a single session for an N-teammate team. Anthropic’s own figure is “approximately 7× more tokens… when teammates run in plan mode,” and otherwise “roughly proportional to team size.” The “3–4×” number you’ll see quoted everywhere is a third-party rule of thumb, not Anthropic’s — and the famous 4×/15× figures are from a different product (more on that below).
  • The catch: it’s experimental, disabled by default, the evidence that it’s faster for coding is still anecdotal, and the team lead sometimes just does the work itself instead of delegating.

If your task isn’t in the first bullet, you can stop reading and save the tokens. If it is, the rest of this page is about making the multiplier pay.

What it actually costs (read the labels)

The cost conversation around agent teams is a mess of numbers lifted from the wrong source. Three buckets, kept separate:

1. Anthropic’s actual Agent Teams number. The Claude Code costs doc states agent teams “use approximately 7× more tokens than standard sessions when teammates run in plan mode, because each teammate maintains its own context window and runs as a separate Claude instance.” Outside plan mode, Anthropic declines to publish a single multiplier — only that teams use “significantly more tokens than a single session” and that “token usage is roughly proportional to team size.” So: budget for something on the order of your team size, and treat 7× as the documented worst-case-ish for plan-mode teams.

2. The “3–4×” rule of thumb is not Anthropic’s. It appears in setup guides as a paraphrase (“a 3-teammate team will use roughly 3–4× the tokens”). It’s a reasonable back-of-envelope, but it’s third-party, and independent estimates range from ~3× to ~7× depending on what’s measured. Don’t cite it as a fact; we don’t.

3. The 4×/15× numbers are from a different system. Anthropic’s widely-quoted line — “agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats” — comes from its multi-agent research system, a Claude.ai research product, measured against chat. That is not Claude Code agent teams, and the baseline isn’t a Claude Code session. Anyone splicing 15× into an Agent Teams cost estimate is comparing two different things.

On top of the headline multiplier, two costs hide in the footnotes:

  • The context-reload tax. When a teammate spawns, it “loads the same project context as a regular session: CLAUDE.md, MCP servers, and skills.” Each teammate pays that base-context load independently — so a bloated CLAUDE.md is now a tax you pay per agent, not once. (If yours is heavy, this is the week to read the token mistakes that inflate every session — they compound by team size here.)
  • Idle burn and round-trips. Anthropic warns that “active teammates continue consuming tokens even if idle,” so a team you forget to shut down keeps spending. And every message between agents is processed by a model on both ends — CloudZero characterizes it bluntly: “every inter-agent message is a round trip through the model.” Coordination isn’t free; it’s billed.

What you’re actually buying

The multiplier only makes sense if you know what it buys that a cheaper option doesn’t. Here’s the real difference, from the official architecture:

A subagent runs within your session, does a scoped job, and reports a summary back to the main agent. Subagents “never talk to each other.” Cheap, because the result is summarized back into one context.

An agent team is a set of full, independent Claude Code sessions — a fixed team lead plus teammates, coordinating through a shared task list (with file-locking) and a mailbox that lets “any teammate message any other” directly, without going through the lead. Each teammate has its own context window; the lead’s history doesn’t carry over.

So you’re paying for exactly two things subagents can’t give you:

  1. Agents that talk to each other — challenge a plan, hand off, react to what a peer found — instead of all reporting to a single coordinator.
  2. Escaping one context window — sustained parallel work that, in aggregate, exceeds what a single session can hold.

Anthropic’s own one-liner is the cleanest decision rule anyone has written: “Use subagents when you need quick, focused workers that report back. Use agent teams when teammates need to share findings, challenge each other, and coordinate on their own.” If your task doesn’t need points 1 or 2, you’re buying capability you won’t use.

When it’s worth it

Anthropic is unusually candid about the economics, and the single best sentence for deciding is from the research-system write-up: “multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance.” That’s the whole test. The token multiplier is fixed; what varies is the value of the thing you point it at.

Concretely, teams earn their tokens when the work is valuable, parallel, and separable:

  • Research and code review across a large surface — many files read at once, findings compared. Anthropic lists this first, and it’s where the “agents challenge each other” model shines.
  • A feature with independent modules — “teammates can each own a separate piece without stepping on each other.” The key word is separate: disjoint files.
  • Debugging with competing hypotheses — point three teammates at three theories in parallel instead of testing them in sequence.
  • Cross-layer changes — one teammate on the API, one on the UI, one on tests, coordinating through the task list.

The common thread: the work breaks into chunks that don’t touch the same files, each chunk is worth real money to finish faster, and the chunks benefit from talking. Hit all three and the multiplier is a bargain. Anthropic’s sizing guidance: start with 3–5 teammates; “three focused teammates often outperform five scattered ones.”

When it’s token theater

The mirror image — where you’re paying N× for nothing:

  • Sequential work. If step two needs step one’s output, parallelism buys you nothing and you pay for idle agents waiting.
  • Same-file edits. “Two teammates editing the same file leads to overwrites.” The file-lock forces them to take turns — you’ve bought a traffic jam.
  • Dependency-heavy tasks. Lots of cross-dependencies means lots of coordination messages, each a billed round-trip, with the agents mostly waiting on each other.
  • Routine tasks. Anthropic’s own line: “For routine tasks, a single session is more cost-effective.” A bug fix, a rename, a small refactor — one session, full stop.

And even on paper-suitable work, the feature’s experimental rough edges can eat the savings. From the official limitations: the lead “may decide the team is finished before all tasks are actually complete,” teammates “sometimes fail to mark tasks as completed, which blocks dependent tasks,” and the lead sometimes “starts implementing tasks itself instead of waiting.” Session resumption is broken with in-process teammates — /resume and /rewind “do not restore” them. This is a v2.1.32-era experimental feature, and it behaves like one.

The honest part: the coding evidence is thin

Here’s what the setup guides won’t tell you, and what our no-leaderboard-worship policy demands we say out loud: there is no rigorous, reproducible public benchmark of Claude Code agent teams speeding up coding work versus what they cost.

The one hard performance number Anthropic publishes — a multi-agent system beating single-agent Opus by 90.2% on an internal eval — is, again, the research system, on a research benchmark. Impressive, and genuinely relevant to “multi-agent can win when parallelism is high,” but it is not Claude Code shipping a feature.

For coding specifically, the best evidence is a hands-on day-one field report: a developer who built an OAuth feature with a team and felt it took “roughly half the time my subagent workflow would have taken” — who then immediately added, “This is one test on day one. I’m not claiming Agent Teams are universally faster,” and noted “a five-teammate debugging session burns through tokens fast.” That’s an honest n=1: directionally promising, ~2× felt speedup, real cost. It is not proof, and we won’t dress it up as one.

So the verdict rests on mechanism and economics, not a benchmark: teams should win on valuable, parallel, separable work, and the people who’ve tried them on exactly that work report they do. Outside that envelope, you’re funding overhead.

If you do use it: make the multiplier pay

Turning it on (it’s off by default): set CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your settings.json or environment, on Claude Code v2.1.32 or later. Then follow Anthropic’s own cost-control checklist, every item of which is about not wasting the multiplier:

  • Use Sonnet for teammates, not Opus — it “balances capability and cost,” and you rarely need five Opus-class agents in parallel.
  • Keep the team small (3–5). Token cost is roughly linear in team size; scattered teammates add coordination cost without adding throughput.
  • Keep spawn prompts focused. Everything in the spawn prompt is context each teammate pays for from the first token — on top of the CLAUDE.md/MCP/skills it already reloads.
  • Split work by files, not by feature. Disjoint file ownership avoids the lock contention that turns parallel work serial.
  • Shut the team down when it’s done. Idle teammates keep billing.

What would change our mind

  • A reproducible coding benchmark — speedup vs. token cost on a fixed, real task — showing >2× faster at less than team-size cost would move teams from “worth it on a narrow class of work” to a default for parallel features.
  • Per-teammate model selection at spawn (today permissions and much of the setup are inherited from the lead) plus prompt-cache credit for the shared base context would cut the reload tax and widen the worth-it envelope.
  • Graduation out of experimental — reliable resumption, dependable task-completion marking, a lead that delegates instead of doing the work — would remove the “rough edges eat the savings” caveat.

Until then: agent teams are a sharp tool for a specific job. Aim them at valuable, parallel, separable work and they’re a bargain. Point them at everything and you’ve just bought a 7× token bill for a single session’s output.

Companion reading

Related reading


Reviews independently produced · Editorial policy

Read more reviews →