The copilot paradox: when coding faster makes your codebase worse

AI coding assistants promise dramatic productivity gains, but independent research tells a contradictory story. When controlled studies, delivery metrics, and 211 million lines of analysed code all point in the same uncomfortable direction, the industry narrative deserves scrutiny.

In early 2025, researchers at METR asked experienced open-source developers to predict how much AI tools would speed up their work. The developers—averaging five years of experience on their specific projects—estimated a 24% time savings. After careful measurement across hundreds of real tasks, the researchers found the opposite: developers using AI tools were 19% slower1. Not marginally. Not within the margin of error. Meaningfully, measurably slower—whilst believing they were faster.

This single finding encapsulates a problem the software industry has been reluctant to confront. AI coding assistants have been adopted faster than almost any development tool in history. Sixty-three percent of professional developers now use them2. GitHub claims Copilot makes developers 55% faster3. The marketing is relentless, the adoption is massive, and the measured reality is far more complicated than the headline numbers suggest.

The complication is not that AI tools are useless. It is that the metrics most commonly used to evaluate them—speed of code generation, lines produced, tasks completed in isolation—miss the dimensions that matter most for long-term codebase health. When independent researchers measure what happens to codebases after widespread AI adoption, the pattern that emerges is not one of accelerating productivity. It is one of accelerating decay.

The productivity claim

The gap between marketed and measured productivity gains is the first indication that something more nuanced is happening.

GitHub's internal research—published and promoted extensively—found that developers using Copilot completed coding tasks 55% faster3. The study measured task completion in controlled environments, using specific coding exercises. It was well-designed for what it measured. But what it measured was narrow: isolated task completion speed, not end-to-end software delivery.

Three controlled field experiments conducted by Cui and colleagues across Microsoft, Accenture, and a Fortune 100 electronics company found a more modest result: a 26% increase in completed tasks when developers used AI assistance4. The studies involved experienced developers working on real codebases, not isolated exercises. The productivity gain was real but less than half the marketed figure. Separate Microsoft research found that users of AI assistants require approximately eleven weeks before reporting consistent improvements in productivity and work quality5—the initial period involves learning to work with the tool, verifying its suggestions, and developing the judgement to know when to accept and when to reject.

The most sobering data comes from DORA's 2024 Accelerate State of DevOps report, the largest annual study of software delivery performance, surveying over 39,000 professionals across thousands of organisations6. Their finding: AI adoption was associated with a 1.5% decrease in delivery throughput and a 7.2% reduction in delivery stability. Teams using AI tools were delivering less reliably than teams that were not.

These three data points—55% faster in lab conditions, 26% faster in controlled field studies, and measurably worse at the organisational level—are not contradictory if you understand what each is measuring. Individual coding speed can increase whilst overall delivery degrades, because software delivery is not primarily a code generation problem. It is a coordination problem, a quality problem, a maintenance problem. Generating code faster solves the part that was already the easiest part, whilst potentially making the harder parts worse.

The instinct is to dismiss the DORA finding as a transitional cost—teams are still learning, processes haven't adapted. But the DORA data covers a broad population well past the initial adoption curve. The survey captures established patterns, not early struggles. Something structural is happening, and the GitClear data helps explain what.

The quality erosion

In 2025, GitClear published their analysis of 211 million lines of changed code across thousands of repositories, tracking quality indicators from 2021 through 2024—a period that coincides precisely with the rise of AI coding assistants7. The findings describe a codebase ecosystem that is growing faster but ageing worse across every measured dimension.

Code duplication—copy-pasted or near-identical code blocks—rose from 8.3% of all changes in 2021 to 12.3% in 2024. Duplicated code blocks specifically grew eightfold over the same period. This is not a minor statistical artefact. It represents a fundamental shift in how code is being produced. A 2019 study in the Journal of Systems and Software found that cloned code is a significant vector for bug propagation, with bugs frequently spreading through co-changed clone pairs8. Each duplicate is a future inconsistency waiting to manifest. When you fix a bug in one copy, the other copies remain broken. When you update a pattern, the duplicates remain outdated.

The mechanism is straightforward. When a developer familiar with a codebase needs to implement a pattern, they know where existing implementations live and they reuse them. When an AI generates code, it produces plausible implementations from its training data without knowledge of what already exists in the specific project. The result is multiple functions that all accomplish the same thing slightly differently, utility code scattered across modules instead of consolidated in shared libraries, error-handling patterns that contradict each other within the same codebase.

Churn rate—code that gets revised or reverted within two weeks of being committed—nearly doubled, climbing from 3.1% to 5.7%. Code that needs revision that quickly was either wrong, incomplete, or insufficiently understood by the person who committed it. A developer who writes code themselves builds a mental model of what it does and why. A developer who accepts AI-generated code has a weaker model—they reviewed it, perhaps tested the happy path, but they did not construct the understanding that comes from writing each line with intent. When that code breaks, they are debugging someone else's logic with less context than if they had written it themselves. The revision cycle is longer and more expensive.

Research on context switching confirms why this matters at scale. Studies show developers lose approximately 23 minutes of productive time per context switch9, and debugging unfamiliar code forces exactly this kind of expensive re-orientation. When a significant portion of your codebase was written by an intelligence that understood syntax but not your system's history, every debugging session becomes a context switch into foreign territory.

The refactoring collapse

The drop in refactoring activity—from 25% of all changed lines to less than 10%—is the most consequential finding in the GitClear data and the one that receives the least attention7.

Refactoring is the practice of improving code structure without changing its behaviour. Renaming a variable to be clearer. Extracting a function that has grown too complex. Consolidating duplicate logic. Removing dead code. These activities produce no new features and no visible progress on any dashboard or velocity chart. They are pure maintenance—the software equivalent of structural repairs on a building. Nobody celebrates them, nobody measures them, and without them, codebases decay.

Research consistently shows that developers spend approximately 33% of their working time dealing with technical debt10. This figure has been stable across multiple studies and organisations. When refactoring collapses to less than 10% of changes, technical debt accumulates unchecked. Features take longer to build because the code they depend on is harder to work with. Bugs multiply because duplicated logic creates inconsistencies. New developers struggle to understand a codebase that has grown without structural discipline. Eventually the accumulated debt reaches the tipping point described in the architecture erosion literature, where incremental fixes no longer work and teams face the expensive, risky prospect of a rewrite11.

AI tools accelerate this collapse through a subtle economic mechanism. When generating new code is fast and cheap, the incentive to refactor existing code diminishes. Why spend thirty minutes restructuring an existing module when you can generate a new implementation in two minutes? The economic logic of refactoring depends on the cost of writing new code being high enough to justify maintaining what exists. AI tools undermine that logic fundamentally. They make creation cheap whilst leaving maintenance costs unchanged—or, given the increased duplication and churn, making maintenance more expensive.

The irony is that AI tools could theoretically help with refactoring. They can identify patterns, suggest consolidations, and automate repetitive structural changes. But that is not how the industry is using them. They are being used to generate new code faster, not to maintain existing code better. The tools optimise for what developers ask them to do, and developers—under the same delivery pressure that has always existed—ask them to build new things. The refactoring that nobody was celebrating before AI became the refactoring that nobody is doing after AI.

This pattern has a historical parallel. When the cost of disk storage dropped dramatically in the 1990s, organisations stopped maintaining careful data architectures. Storage was cheap, so they stored everything, everywhere, in whatever format was convenient. Two decades later, the data management crisis is one of the most expensive problems in enterprise technology. Cheap production costs do not eliminate the need for curation. They increase it, because the volume that cheap production enables creates complexity that eventually overwhelms the organisation's ability to manage it.

The security dimension

The quality problem extends beyond maintainability into security, and the data here is particularly stark.

GitGuardian's 2025 State of Secrets Sprawl report found that repositories using AI coding assistants experience 40% more secret leaks than those that do not12. Hardcoded API keys, database credentials, and authentication tokens appear more frequently in AI-assisted code. The mechanism is similar to the duplication problem: AI tools generate code that follows common patterns, and those patterns often include placeholder credentials, example API keys, and configuration values. A developer writing code from scratch is more likely to recognise these as placeholders that need replacement. A developer reviewing AI-generated code may not notice the embedded credential amongst hundreds of lines of plausible-looking output.

GitHub detected 39 million leaked secrets across their platform in 202413. This is not a theoretical risk. These are real credentials providing real access to real production systems. Seventy percent of secrets leaked in 2022 were still active in 2024—organisations detected the leaks and then simply did not rotate the credentials12. Private repositories were nine times more likely to contain secrets than public ones, demolishing the assumption that private means secure.

Analysis of Fortune 50 companies found that AI-assisted developers produced three to four times more code but generated ten times more security issues14. The multiplication factor deserves emphasis: code volume increased fourfold but security problems increased tenfold. The per-line security quality degraded by approximately 2.5 times. At scale, across thousands of developers, this means AI tools are generating security vulnerabilities faster than security teams can identify and remediate them.

The supply chain dimension compounds the problem. Sonatype's 2024 State of the Software Supply Chain report logged 512,847 malicious packages in a single year—a 156% year-over-year surge15. When AI tools suggest dependencies or generate code that introduces new packages, they do so without awareness of whether those packages are maintained, secure, or even legitimate. The AI cannot distinguish between a well-maintained library and an abandoned package with known vulnerabilities. The developer reviewing the suggestion often lacks the context to make that distinction either, particularly when the suggestion arrives with the fluency and confidence that AI output characteristically projects.

The experienced developer paradox

The METR study deserves closer examination because its findings challenge the most basic assumption about AI tool adoption: that faster code generation translates to faster work1.

The researchers studied experienced open-source developers—people with an average of five years working on the specific projects they were tasked with modifying. These were not novices struggling with unfamiliar codebases. They were experts working in their domain of greatest competence. The AI tools provided were current and capable: Cursor Pro with Claude 3.5 and 3.7 Sonnet.

Before beginning their tasks, developers predicted AI would save them 24% of their time. After completing the study—having used AI tools extensively—they estimated the tools had saved them approximately 20%. Their subjective experience was consistent and positive. They felt faster.

The objective measurement showed the opposite. Task completion time increased by 19% when using AI tools. The perception gap—a 39 percentage point discrepancy between felt speed and measured speed—is extraordinary. It suggests that the subjective experience of AI assistance is fundamentally misleading, at least for experienced developers working in familiar codebases.

The likely mechanism involves the hidden costs of AI interaction. Time spent prompting, evaluating suggestions, correcting AI misunderstandings, and integrating generated code into existing architecture is real time that does not feel like wasted time. It feels like productive interaction. The code appearing on screen creates a sensation of velocity. But the time saved generating code is consumed—and then some—by the time spent on these integration activities.

Even Andrej Karpathy, the Stanford researcher who popularised the term 'vibe coding' for AI-assisted development, abandoned AI tools for his latest project. 'I tried to use Claude/Codex agents a few times but they just didn't work well enough at all,' he posted16. The person who named the phenomenon stepped away from it when quality mattered. This is a documented, public statement from one of the field's most prominent advocates—not a cherry-picked anecdote but a signal worth noting.

The METR finding applies specifically to experienced developers on familiar projects. For unfamiliar codebases, boilerplate generation, or exploratory prototyping, AI tools may genuinely accelerate work. The problem is that the developers most likely to adopt AI tools enthusiastically—experienced professionals at established companies—may be the ones least likely to benefit. They already know where existing implementations live. They already understand the architectural patterns. For them, AI assistance adds a translation layer between their intent and the code, and that layer has a cost.

The compound effect

Each of these findings—modest productivity gains overstated by marketing, rising code duplication, collapsing refactoring, increased security vulnerabilities, experienced developers slowing down—is concerning in isolation. Together, they describe a systemic shift in how codebases evolve under AI-assisted development.

The pattern works like compound interest in reverse. AI tools generate more code faster. Some of that code duplicates existing functionality because the tool lacks project context. The increased volume makes thorough review harder—research shows that code review effectiveness drops sharply beyond 400 lines of code, with reviewers scanning faster than 450 lines per hour missing defects in 87% of cases17. More code per review means less effective review per line. Quality issues slip through that would previously have been caught.

Refactoring declines because generating new code is easier than restructuring old code. Security vulnerabilities increase because review cannot keep pace with generation. Technical debt accumulates faster than it is paid down. The codebase grows in volume but degrades in structural quality. Each week, the maintenance burden increases slightly. Each month, new features take slightly longer because the code they depend on is slightly worse. The degradation is gradual, which makes it easy to ignore. Until it isn't gradual anymore.

The industry joke that 'two engineers can now create the tech debt of fifty' captures something real. We have dramatically increased our capacity to generate code without proportionally increasing our capacity to maintain it. The creation-to-maintenance ratio has shifted in a direction that every experienced developer recognises as dangerous, and that the technical debt literature has been warning about for decades10.

This is not a temporary growing pain that will resolve as tools improve. It is a structural change in the economics of software development. When code is cheap to produce and expensive to maintain, the rational short-term decision is always to produce more rather than maintain what exists. Each individual decision makes sense. The cumulative effect is a codebase that grows without structural discipline—rapidly, without coherence, consuming resources that could have been spent on sustainable growth.

Research on technical debt confirms the trajectory. DORA's longitudinal data demonstrates that speed without quality is self-defeating—low-performing teams suffer from both slower delivery and higher change failure rates6. The failures require fixes. The fixes require context switches. The context switches—each costing approximately 23 minutes of productive focus9—reduce the capacity for new work. The reduced capacity creates pressure to use AI tools to generate more code faster. The cycle accelerates.

The organisational incentive problem

The most difficult aspect of the copilot paradox is that it is structurally invisible to the organisations experiencing it.

Engineering leadership evaluates AI tool adoption through the metrics that are easiest to measure: pull requests merged, code output per developer, time to first commit on new features. By these metrics, AI tools look like an unqualified success. More code is being produced. More pull requests are being merged. Developers report feeling more productive. The dashboards confirm the narrative.

The metrics that would reveal the problem—code duplication rates, churn within two weeks, refactoring frequency as a percentage of changes, defect density trends, mean time to resolve production incidents—are rarely tracked at the organisational level. They require instrumentation that most teams have not built and analysis that most leadership teams have not requested. The absence of measurement is not an accident. It reflects the same short-term orientation that has always characterised software management: ship features, count output, defer quality.

Meyer and colleagues documented this perceptual gap in their research on developer productivity18. Developers' perceptions of their own productivity correlate poorly with objective measures of delivery effectiveness. The METR finding—developers feeling 20% faster whilst being 19% slower—is an extreme example of a well-established pattern. Subjective productivity assessment is unreliable even without AI tools. With AI tools generating a constant stream of visible output, the subjective experience becomes even more misleading.

The DORA framework offers the instrumentation needed to see the full picture. Their four key metrics—deployment frequency, lead time for changes, change failure rate, and mean time to recovery—capture delivery effectiveness rather than output volume6. Organisations that track these metrics will see the copilot paradox manifesting in their data: change failure rates rising, mean time to recovery increasing, lead times extending even as raw code output climbs. The ones that only track output will see nothing wrong until the accumulated quality degradation becomes a crisis.

The path forward

The solution is not to abandon AI coding tools. They represent a genuine capability expansion with real applications. The solution is to change how we use them and what we measure, grounding adoption in evidence rather than marketing.

First, measure what matters. If your only metrics are code output and developer self-reports, AI tools will look like an unqualified success. If you measure code duplication rates, churn within two weeks, refactoring frequency, defect density, and the DORA delivery metrics, you will see the full picture. The organisations that thrive with AI assistance will be the ones that track quality alongside quantity, and that take action when quality metrics degrade.

Second, invest in refactoring alongside generation. Treat AI-generated code as a first draft that requires editorial work, not a finished product. For every sprint spent generating new features with AI assistance, allocate explicit time to consolidate, deduplicate, and restructure what was generated. The refactoring collapse revealed by GitClear's data is not inevitable—it is a choice made by teams that treat generation as the end of the process rather than the beginning.

Third, strengthen code review rather than weaken it. The increased volume of AI-generated code makes review more important, not less. Reviewers need to look specifically for the failure modes that AI introduces: duplication of existing functionality, embedded credentials, patterns that contradict codebase conventions, and edge cases that require domain knowledge the model lacks. The SmartBear/Cisco research on review effectiveness suggests keeping reviews under 400 lines of code17. When AI tools make it easy to generate 1,500-line pull requests, the review process must push back.

Fourth, differentiate contexts. AI tools genuinely help with boilerplate generation, unfamiliar codebases, and exploratory prototyping. They may actively hinder work in mature codebases where deep context matters more than generation speed. The blanket adoption of AI tools across all contexts misses this nuance. Let teams decide where AI assistance adds value and where it subtracts it, and trust the METR data that experienced developers on familiar projects may be better off without it.

Fifth, be honest about productivity measurement. Self-reported gains from AI tools are unreliable. Measure actual delivery outcomes—cycle time from commit to production, defect escape rates, time spent on unplanned work—not subjective assessments of how fast things feel. The 39 percentage point gap between perceived and measured productivity in the METR study should give every engineering leader pause before accepting developer satisfaction surveys as evidence that AI tools are working.

The copilot paradox is not that AI tools are bad. It is that they optimise for the wrong metric. They make code generation faster. But software development is not primarily a code generation problem. It is a code maintenance problem, a code understanding problem, a system design problem. Making generation faster without making maintenance easier is like making a car go faster without improving its brakes. The increased speed feels exhilarating right up until the moment it becomes catastrophic.

Two hundred and eleven million lines of code are telling us something. The question is whether we are willing to measure what they reveal before our codebases pay the price.


Footnotes

  1. METR. (2025). "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity." METR. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ 2

  2. Stack Overflow. (2024). "Developer Survey 2024: AI Tools in Development." Stack Overflow Annual Survey. https://survey.stackoverflow.co/2024/

  3. GitHub. (2024). "Research: Quantifying GitHub Copilot's impact on developer productivity and happiness." GitHub Blog. https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/ 2

  4. Cui, Z. K., Demirer, M., Jaffe, S., Musolff, L., Peng, S., & Salz, T. (2024). "The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers." SSRN Working Paper No. 4945566. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566

  5. Microsoft WorkLab. (2024). "AI Data Drop: The 11 by 11 Tipping Point." Microsoft. https://www.microsoft.com/en-us/worklab/ai-data-drop-the-11-by-11-tipping-point

  6. DORA. (2024). "Accelerate State of DevOps Report 2024." Google Cloud. https://dora.dev/research/2024/dora-report/ 2 3

  7. GitClear. (2025). "AI Copilot Code Quality: 2025 Data Suggests Downward Pressure on Code Quality." GitClear. https://www.gitclear.com/ai_assistant_code_quality_2025_research 2

  8. Mondal, M., Roy, C. K., Roy, B., & Schneider, K. A. (2019). "An empirical study on bug propagation through code cloning." Journal of Systems and Software, 158, 110407. https://doi.org/10.1016/j.jss.2019.110407

  9. Mark, G., Gudith, D., & Klocke, U. (2008). "The cost of interrupted work: more speed and stress." Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2

  10. Stripe. (2018). "The Developer Coefficient." Stripe. https://stripe.com/files/reports/the-developer-coefficient.pdf 2

  11. De Silva, L., & Balasubramaniam, D. (2012). "Controlling software architecture erosion: A survey." Journal of Systems and Software, 85(1), 132-151.

  12. GitGuardian. (2025). "State of Secrets Sprawl 2025." GitGuardian. https://www.gitguardian.com/state-of-secrets-sprawl-report-2025 2

  13. Toulas, B. (2025). "GitHub expands security tools after 39 million secrets leaked in 2024." BleepingComputer. https://www.bleepingcomputer.com/news/security/github-expands-security-tools-after-39-million-secrets-leaked-in-2024/

  14. The Register. (2025). "AI code assistants improve production of security problems." The Register.

  15. Sonatype. (2024). "2024 State of the Software Supply Chain." 10th Annual Report. https://www.sonatype.com/state-of-the-software-supply-chain/Introduction

  16. Karpathy, A. (2025). "I tried to use claude/codex agents a few times but they just didn't work well enough at all." X (formerly Twitter).

  17. SmartBear/Cisco. (2006). "Best Practices for Peer Code Review." SmartBear Software. https://static1.smartbear.co/support/media/resources/cc/book/code-review-cisco-case-study.pdf 2

  18. Meyer, A. N., Fritz, T., Murphy, G. C., & Zimmermann, T. (2014). "Software developers' perceptions of productivity." Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.

Published on:

Updated on:

Topics

TL;DR

Controlled studies measured a 26% productivity increase—less than half GitHub's marketed 55%. DORA's 2024 survey of 39,000 professionals found AI adoption correlates with 7.2% worse delivery stability. GitClear's code analysis revealed duplication rising from 8.3% to 12.3% whilst refactoring collapsed from 25% to under 10%. Experienced developers were objectively 19% slower with AI tools despite perceiving themselves faster. The pattern suggests AI accelerates code production whilst degrading the maintenance practices that determine long-term codebase health.

More rabbit holes to fall down

20 min read

The copilot paradox: when coding faster makes your codebase worse

AI coding assistants promise dramatic productivity gains, but independent research tells a contradictory story. When controlled studies, delivery metrics, and 211 million lines of analysed code all point in the same uncomfortable direction, the industry narrative deserves scrutiny.
11 min read

The junior developer extinction: the missing seniors of 2035

Entry-level developer hiring has collapsed by 73% whilst companies celebrate AI as a replacement for junior talent. But senior developers do not materialise from thin air—they are grown from juniors over five to ten years. We are watching an industry cannibalise its own future.
20 min read

The velocity trap: when speed metrics destroy long-term performance

Velocity metrics were meant to help teams predict and improve, but they have become weapons of productivity theatre that incentivise gaming the system while destroying actual productivity. Understanding how story points, velocity tracking, and sprint metrics create perverse incentives is essential for building truly effective development teams.