All posts
April 5, 202615 min read

The Handoff Tax: Why Your Team's AI Productivity Numbers Are Wrong

The 55% individual productivity gain from AI coding tools is real but it does not account for what happens after the task ends. DORA 2024 found that individual speed gains come with declining team delivery stability.

A

Akshay Saraswat

Founder, Evols

The Handoff Tax

There is a number that gets cited in nearly every enterprise AI presentation right now: developers who use AI coding tools complete tasks 55% faster. It comes from a GitHub study. It has appeared in board decks, in procurement justifications, in analyst reports. It is, in the strict sense of what it measured, accurate.

Here is what it didn't measure: what happens after the task is done.

The study (Peng, Kalliamvakou, Cihon, Demirer - 2023, arXiv:2302.06590) asked 95 developers to implement an HTTP server in JavaScript, with or without Copilot. The Copilot group finished 55.8% faster. This is real data from a real study. It is also a single task, with a single developer, in a controlled environment, with no downstream handoffs, no integration with other people's work, and no measurement of the code's quality, maintainability, or what it cost the next developer who had to work with it.

The Paradox: Individual velocity spikes while team throughput remains flat, crushed by the gears of context loss.

The DORA 2024 report, which surveys thousands of development teams and tracks actual software delivery metrics, found something that didn't get nearly as much coverage: AI tool adoption "significantly increases individual productivity, flow, and job satisfaction" while simultaneously "negatively impacting software delivery stability and throughput." Individual speed goes up. Team delivery performance goes sideways.

This is not a paradox. It is what you'd expect when you have tools that are very good at optimizing for individual sessions and say nothing about what happens when that session has to connect with someone else's work.


What the 16% Problem Actually Means

The 16% Trap: AI optimizes the smallest segment of a developer’s day, leaving the massive coordination overhead untouched.

Research from Microsoft, cited in a DX analysis from March 2026 covering 400 companies, found that coding represents approximately 16% of a developer's time. If AI tools make developer's 50% faster, total time saved is about 8% before accounting for any additional overhead the AI creates.

Developers in the same study were candid about the ceiling this creates:

"The bigger issue I keep running into is that most codebases and systems aren't set up for AI to actually help. Not an architecture problem, more that the knowledge of how things work lives in people's heads."

And on the emerging bottleneck:

"AI can very significantly speed up initial engineering time, but often that saved time is spent on extended reviews, fact checking or issue remediation."

The DX longitudinal study found that across 400 companies from November 2024 to February 2026, pull request throughput increased by approximately 10%, not the 2-3x often cited. The efficiency gain is real. It is also much smaller than marketed, because the measurement captures only the coding portion of the work and it cannot capture what gets lost in handoffs.


The Three Places Knowledge Dies in Transit

The handoff tax takes three specific forms in AI-native teams. Each one is a place where knowledge generated in one person's AI session has to be rebuilt from scratch by the next person.

The PM-to-Engineer Handoff

Lost in Transit: A PM builds rich context in Claude (left), but only shuffles a flat document (the PRD) to the engineer, forcing them to rebuild the reasoning from scratch.

A PM spends an afternoon in Claude doing customer research; analyzing interview transcripts, synthesizing pain points, identifying priority patterns. The output is a summary document. The document captures conclusions. It does not capture the 40 minutes of exploratory conversation, the alternative framings that were considered and rejected, the specific customer quotes that shaped the final prioritization, the reasoning chain that led from data to decision.

The engineer who reads the PRD gets the conclusions. When they have questions, and they will, they start a new conversation with Claude and try to reconstruct the context the PM already built. Two people, two AI sessions, one knowledge base that was built twice.

This isn't a communication failure. The document was written. The information was shared. The problem is that documents capture outputs, not the reasoning context that makes outputs useful when they need to be extended or questioned.

The Session-to-Documentation Gap

The bug was fixed. The understanding wasn't.

A developer spends 45 minutes debugging an authentication issue. They figure it out. They might write a comment in the code or a brief note in a ticket. What they don't write, because it would take another 30 minutes after the task is done, is the reasoning path: what they checked first, what red herrings they ruled out, why the actual fix works, and what would cause this to break again.

The next developer to touch that code encounters the same issue. The debugging context exists in one developer's AI session history and nowhere else. Their session history probably has a retention period. After that, the knowledge is gone.

The getdx.com "Cognitive Debt" analysis (April 2026), citing Dr. Margaret-Anne Storey's research, describes this as a structural risk: "velocity can outpace understanding" in AI-native development. The diagnostic question it doesn't answer though, because the literature genuinely doesn't have data yet, is how much of each developer's time goes to rebuilding context that someone on the team already built. The researchers themselves call for measurement that doesn't yet exist.

The Onboarding Cold Start

A developer joins a team that has been using Claude Code for 18 months. In that time, the team has made hundreds of architectural decisions, surfaced dozens of codebase-specific patterns, debugged obscure issues that took days to resolve, and built up a body of institutional knowledge about how to work with this codebase using AI effectively.

None of this is in the CLAUDE.md file. Very little of it is in comments. Some of it is in PRs if the engineer knows where to look. Most of it is in the heads and session histories of their new teammates, and in the AI contexts those teammates have accumulated individually.

The DX analysis covering ramp-up time (April 2026, 400 companies) found that new developers now take about 33 days to their 10th merged PR, down from around 67 days in early 2024, a real improvement driven partly by AI tools. But the study's authors note that this metric doesn't capture "depth of understanding." Getting to 10 PRs faster doesn't mean the developer understands the system. It means they can produce output. The two are different things.


Putting a Number on It

Doubling the Cost: When one developer must reconstruct 12,000 tokens of reasoning context (right) that already exists (left), the knowledge cost of the task doubles.

There is no published study that directly measures the cost of context reconstruction in AI-native teams. The research gap is real, the DX cognitive debt report explicitly calls for measurement that doesn't exist yet.

But a rough framework is possible.

If a developer's AI session generates, say, 12,000 tokens of relevant context about a codebase decision, and that context must be reconstructed by a teammate starting fresh, at roughly the same token cost, then every handoff that doesn't transfer context doubles the cost of the work that was done. Not the cost of the code that was shipped. The cost of the knowledge that informed the code.

A 10-person engineering team, each doing three substantive AI sessions per day, each generating knowledge that one or more teammates will need to partially reconstruct: the daily tax isn't on one session. It is distributed across dozens of partial reconstructions, each smaller than the original, none of which appear in any productivity metric.

The 10% PR throughput improvement found in the DX longitudinal study is consistent with this framing. The individual sessions are faster. The coordination overhead clips most of the gain before it reaches team-level output.


The Metrics That Actually Measure What Matters

The Stack Overflow Developer Survey 2025, with 49,000 respondents, found something worth sitting with: only 17% of developers who use AI tools say those tools have improved team collaboration. Eighty-four percent are using AI tools. Seventeen percent see team-level benefit. The gap between those two numbers is not a product gap. It is a measurement gap, teams are tracking the wrong things.

Individual-level AI productivity metrics (tasks completed faster, time saved per session, lines of code generated) are measuring at the wrong level. They capture the input cost and miss the coordination cost. A framework oriented toward team-level value would measure different things:

Context transfer rate: how much of what one developer's AI session produced is available to the next developer who needs it, without reconstruction? This is currently near zero for most teams, all knowledge transfer goes through documents, which capture outputs but not reasoning.

Reconstruction frequency: how often do teammates start AI sessions on problems where someone on the team already has relevant context? This would capture duplicate work in the form that AI tools produce it, not identical code, but redundant reasoning.

Handoff quality: does the receiving developer have enough context to continue the work without a synchronous conversation? This measures what documents actually deliver, not what their authors intended.

None of these are tracked today because no tool surfaces them. But they are the metrics that separate teams where AI investment compounds over time from teams where it produces individual speed gains that don't add up at the team level.


Why This Matters Now

AI made each step smarter. Not the connection between them.

The DX study's most pointed observation was not about speed: "Having at least one other teammate who is bought in to using AI tools is essential, being an isolated solo-adopter does not allow you to materialize the gains in a meaningful way."

The implication runs deeper than it appears. It's not just that you need teammates using AI tools. It's that the value of AI tools is partially social, it accrues when people can build on each other's AI work, not just when they do their own work faster. That kind of value requires infrastructure that doesn't currently exist.

The teams getting the best results from AI aren't the ones where every individual is using it most. They're the ones where AI work is designed to feed forward, where sessions produce artifacts that reduce the cost of the next person's session. Right now that design is manual, ad hoc, and dependent on individual discipline. The teams that figure out how to make it systematic will be the ones where the 10% throughput improvement turns into something closer to the 2–3x that gets put in the pitch decks.

That gap, between what AI tools deliver for individuals and what they could deliver for teams, is where the next generation of work infrastructure gets built.

AI ProductivityEngineeringTeam AI

Try Evols for your team

Give your team an AI that remembers your context, decisions, and product knowledge.

Get early access