The architecture autopsy: when 'we'll refactor later' becomes 'we need a complete rewrite'
Early architectural decisions compound over time, creating irreversible constraints that transform minor technical debt into catastrophic system failures. Understanding how seemingly innocent choices cascade into complete rewrites reveals why future-proofing architecture requires balancing immediate needs with long-term reversibility.The phrase "we'll refactor later" appears in software teams with remarkable consistency. It emerges during deadline pressure, in sprint planning meetings where features compete with architecture work, in conversations where the proper solution would take weeks but the expedient one takes days. The promise seems reasonable at the time—ship the working solution now, fix it properly when there's breathing room.
But "later" rarely comes. And when teams finally confront the accumulated architectural debt, they discover the cost has multiplied exponentially. What would have taken weeks in year one requires months in year three and often demands a complete rewrite by year five.
Research on architectural erosion confirms this pattern isn't accidental 1. Teams under pressure take shortcuts that violate architectural principles. One shortcut signals that disorder is acceptable. More follow. Violations accumulate. The gap between what the architecture was supposed to be and what it actually is grows until it becomes what researchers carefully call "architectural mismatch." Eventually the mismatch is so severe that incremental fixes don't work anymore.
The broken windows theory applies directly to software architecture 2. One poor architectural choice creates permission for others. When developers see that authentication is already fragmented across multiple services, they don't feel bad about fragmenting it further. The standard has already been set. They're not making it worse—they're just following the existing pattern. This isn't malicious or lazy. It's rational behavior in the context established by earlier decisions.
The cost escalation follows an exponential curve. Bugs found during design cost 1x to fix. During implementation, 6x. After release, between 15x and 100x 3. For fundamental architectural changes, multiply those numbers by ten. An authentication system that would have taken two weeks to build properly in year one might require fourteen months of rewrite work by year five. The mathematics are unforgiving.
The Pattern
Netscape's browser rewrite in the late 1990s demonstrates the catastrophic cost of accumulated architectural debt. They'd dominated the browser market with over 80% market share in 1996, but their codebase had accumulated years of patches and workarounds 4. The architecture was fragmented. They decided to rewrite it from scratch for version 6.0, believing they could build something better while maintaining market position.
The rewrite took three years. During those three years, Microsoft's Internet Explorer captured the market. By the time Netscape 6.0 shipped in 2000, their market share had collapsed to under 10%. They never recovered. Joel Spolsky later called it "the single worst strategic mistake that any software company can make" 4. The team had fallen victim to second-system syndrome—trying to fix everything wrong with the old system, adding features they'd always wanted, losing focus on shipping 5.
IBM's rewrite of Queensland Health's payroll system in 2010 cost the Australian government an estimated $1.2 billion 6. The project was meant to replace 22-year-old software with a modern system. Instead, it created chaos. Thousands of employees were underpaid, overpaid, or not paid at all. The old system contained decades of business logic, edge cases, and regulatory requirements that weren't properly documented. The rewrite team rediscovered this knowledge one production incident at a time, through angry workers who couldn't pay their bills and unions threatening legal action.
TSB Bank's platform migration in 2018 affected 1.9 million customers and eventually cost over £330 million when their switch to a new banking platform failed 7. Customers couldn't access accounts for weeks. A parliamentary inquiry found that the bank had ignored repeated warnings about migration risks, underestimated the complexity of switching systems, and inadequately tested the new platform. The CEO resigned. The bank's reputation suffered damage that persisted for years.
These aren't isolated incidents. They represent a pattern that repeats because the conditions creating it are structural. Companies optimize for short-term delivery. Roadmaps are aggressive. Deadlines are tight. Architecture work is invisible until its absence causes disasters. Technical debt accumulates faster than it gets paid down. Eventually the debt compounds to the point where incremental fixes don't work anymore and rewrites seem like the only option.
But rewrites are expensive, risky, and often fail. Teams underestimate the complexity embedded in the old system. They fall victim to second-system syndrome and try to solve every historical problem. Competitors ship new features while the rewrite team rebuilds existing functionality. Migration turns out to be exponentially more complex than anyone planned. The question isn't whether to rewrite. The question is how to avoid creating the conditions that make rewrites necessary in the first place.
Prevention
Jeff Bezos categorized decisions into one-way doors and two-way doors 8. One-way doors are consequential and nearly irreversible—database choices, authentication architecture, API contracts, core domain models. These decisions need to be made carefully because getting them wrong creates years of pain. Two-way doors can be changed relatively easily—UI frameworks, logging implementations, caching strategies. These decisions can be made quickly.
The problem is teams often invert this. They agonize over which UI framework to use and rush database decisions. They spend hours debating code style but make authentication decisions in brief meetings under deadline pressure. Once an architectural choice goes into production and other systems depend on it, changing course becomes exponentially more expensive. By year three, the architecture is effectively permanent not because it's good, but because it's load-bearing. Everything depends on it.
The reversibility principle suggests making decisions that preserve options rather than foreclose them 9. Delay decisions as long as possible—not through procrastination, but through intentional design that doesn't require premature commitment. When decisions must be made early, architect for reversibility. Abstract away implementation details. Design interfaces, not concrete implementations. Build in the ability to swap things out later. A team shipping an authentication feature could build a thin interface that all services use, even if the initial implementation is just a simple wrapper. That preserves the option to change the implementation later without touching every service.
Different architectural decisions have different planning horizons. UI frameworks and deployment tooling need one-year thinking—switching costs are relatively low. Database technology and authentication architecture need three-year thinking—switching costs are moderate. Data consistency models and API versioning strategies need five-year thinking—switching costs are extremely high. Core architectural principles and system boundaries need ten-year thinking—these become organizational bedrock.
The challenge is that planning horizons longer than the current quarter feel abstract. When the deadline is Thursday and the VP wants to know why the feature isn't shipping, talking about five-year architecture consequences feels like academic hand-waving. But the consequences are real. They just don't manifest until later, when later has become now and the cost to fix things has multiplied by a hundred.
Successful teams allocate explicit capacity to architectural runway 10—the technical infrastructure needed to implement features without major redesign. A common pattern is 70% capacity to features, 20% to architectural improvements, 10% to exploration. Architectural quality gets included in the definition of done. Features aren't complete until they're implemented in ways that extend the runway rather than consume it.
This requires executive sponsorship. Without it, teams will always sacrifice architecture for immediate delivery because that's what the incentives reward. Product managers need to understand that architectural debt compounds. A day of refactoring deferred in year one becomes a month of refactoring in year three becomes a year-long rewrite in year five. The interest rate on this debt is savage.
It also requires making architectural health visible. Track coupling metrics. Measure change amplification—how many files need to change for typical features. Monitor defect density trends in unchanged code. Watch developer velocity on similar work. When these metrics degrade, that's entropy accumulating. Address it before it becomes irreversible. Nobody acts on these signals because architecture degradation is gradual. Each quarter is only slightly worse than the last. The boiling frog problem. By the time the pain is acute enough that everyone agrees something must be done, the only remaining option is the expensive, risky, often-failing rewrite.
When architectural problems have accumulated but rewrites seem too risky, the strangler pattern offers a middle path 11. Named after strangler fig vines that gradually overtake host trees, it involves building new implementations alongside old ones and gradually routing traffic to the new system. Old and new run in parallel during transition. As new implementations prove stable, old code progressively retires.
Amazon's transition from monolithic architecture to microservices followed these principles over many years 12. Netflix evolved from a DVD-rental application to streaming microservices through incremental extraction 13. Both migrations took years but avoided the catastrophic risks of big-bang rewrites. The parallel running provided confidence. The incremental approach revealed edge cases gradually rather than all at once in a disastrous cutover. The ability to roll back when things went wrong prevented disasters.
But even successful strangler migrations consume substantial time and resources. Teams spend months or years rebuilding functionality they already have because they made expedient choices earlier. The opportunity cost is real. Features that could have been shipped aren't. Competitors gain ground. Engineers burn out working on unglamorous infrastructure projects that customers never see.
The Mathematics
The phrase "we'll refactor later" operates on the assumption that later costs about the same as now. It doesn't. Research shows developers spend 33% of their working time dealing with technical debt 14. Every feature built on flawed architecture inherits its constraints and complexity. Teams caught in architectural dysfunction spend 60-80% of their time on reactive problem-solving rather than proactive development 15.
The accumulation is exponential, not linear. In year one, a team might spend 5% of their time dealing with architectural complexity. By year three, it's 30%. By year five, 60%. The delta between debt accumulation and debt paydown grows until the only option is a reset.
Software entropy increases unless work is specifically done to reduce it 16. Systems undergoing continuous change experience increasing complexity. This isn't a law of nature that can't be violated—it's a law of nature that requires energy to violate. Without explicit investment in entropy reduction, systems degrade. The degradation is gradual, which makes it easy to ignore. Until suddenly it isn't gradual anymore.
Every rewrite represents an autopsy. A detailed examination of a dead system to understand what killed it. The cause of death is rarely a single catastrophic decision. It's the accumulation of thousands of small compromises, each one defensible in isolation, catastrophic in combination. The problem isn't individual decisions made under deadline pressure. The problem is the cultural and structural conditions that make those decisions inevitable. Aggressive roadmaps. Tight deadlines. Incentives that reward shipping over architecture. Lack of protected time for refactoring. Lack of executive understanding that architecture debt compounds. Lack of visibility into degradation until it becomes acute.
Those conditions create the same outcome repeatedly. Different companies, different technologies, different teams—same pattern. Expedient choices. Deferred refactoring. Gradual accumulation. Sudden crisis. Expensive, risky rewrite.
The teams that avoid this aren't the ones with perfect foresight or brilliant architects. They're the ones that treat architecture as continuous investment rather than upfront cost. They allocate capacity to runway extension and entropy reduction. They make "later" a scheduled commitment rather than an indefinite deferral. They understand that in architecture, there often are no second chances.
The decisions made today become the constraints lived with tomorrow, the technical debt serviced next year, the complete rewrite forced in five years. Choose wisely. The architecture's future depends on it.
Footnotes
-
De Silva, L., & Balasubramaniam, D. (2012). "Controlling software architecture erosion: A survey." Journal of Systems and Software, 85(1), 132-151. ↩
-
Levén, W., Broman, H., Besker, T., & Torkar, R. (2024). "The broken windows theory applies to technical debt." Empirical Software Engineering, 29, Article 73. https://doi.org/10.1007/s10664-024-10456-6 ↩
-
Systems Sciences Institute, IBM. (2006). "The Economic Impacts of Inadequate Infrastructure for Software Testing." National Institute of Standards and Technology Report. ↩
-
Spolsky, J. (2000). "Things You Should Never Do, Part I." Joel on Software. https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/ ↩ ↩2
-
Seibel, P. (2009). "Duct tape context: A tale of two rewrites." A billion monkeys can't be wrong. https://gigamonkeys.wordpress.com/2009/09/28/a-tale-of-two-rewrites/ ↩
-
Morphis Technologies. "Rewrite Code from Scratch? Case Studies and Conundrums." Morphis Insights Blog. https://morphis-insights.com/rewrite-case-studies/ ↩
-
BBC News. (2019). "TSB boss Paul Pester quits over IT chaos." https://www.bbc.com/news/business-45778468 ↩
-
Papadimitriou, G. (2018). "Technical Decisions Series - Decision Reversibility and Lean Software Architecture." LinkedIn Article. ↩
-
Malan, R. (2015). "Architecture Clues: Heuristics, Part II. Decisions and Change." LinkedIn Article. ↩
-
Scaled Agile Framework. (2024). "Architectural Runway." SAFe Framework Documentation. https://scaledagileframework.com/architectural-runway/ ↩
-
Microsoft Learn. (2024). "Strangler Fig Pattern." Azure Architecture Center. https://learn.microsoft.com/en-us/azure/architecture/patterns/strangler-fig ↩
-
Fowler, M. (2004). "StranglerFigApplication." Martin Fowler's Bliki. https://martinfowler.com/bliki/StranglerFigApplication.html ↩
-
Microservices.io. (2024). "Pattern: Strangler application." Microservices Pattern Library. https://microservices.io/patterns/refactoring/strangler-application.html ↩
-
Besker, T., Martini, A., & Bosch, J. (2018). "Technical debt cripples software developer productivity: a longitudinal study on developers' daily software development work." Proceedings of the 2018 International Conference on Technical Debt. ↩
-
Repenning, N. P. (2001). "Understanding fire fighting in new product development." Journal of Product Innovation Management, 18(5), 285-300. ↩
-
Lehman, M. M. (1980). "Programs, Life Cycles, and Laws of Software Evolution." Proceedings of the IEEE, 68(9), 1060-1076. ↩
Published on:
Updated on:
Reading time:
11 min read
Article counts:
49 paragraphs, 2,162 words
Topics
TL;DR
Software teams consistently defer architectural decisions with "we'll refactor later," creating irreversible technical debt that accumulates into system-wide rewrites. Research shows that early architectural decisions become exponentially more expensive to reverse - costing 100x more to fix post-release than during design. The broken windows theory applies to software: one poor architectural choice leads to others, creating entropy that compounds until rewrites become the only option. Major failures like Netscape's rewrite disaster and IBM's $1.2 billion Queensland Health payroll system demonstrate catastrophic costs. The strangler pattern and incremental refactoring offer alternatives, but require disciplined investment in architectural runway - the intentional technical foundation for future features. Teams that ignore early architectural planning spend 60-80% of time fighting fires instead of building features, ultimately facing the choice between living with dysfunction or starting over.