The 3,000 Incident Postmortem: Why Caches Are Actually the Enemy

April 20, 202617:14Debug Log

This episode explores Marc Brooker's controversial claim that caching, often a default scaling solution, is a major cause of catastrophic "metastable" system failures. It delves into the importance of deep postmortem analysis, moving beyond superficial root causes to question observability, testing, and fundamental architectural assumptions. Listeners will learn how unquestioning reliance on caching can create systems prone to persistent, unrecoverable breakdowns.

Key Takeaways

Primary source: https://open.spotify.com/episode/1qX2GfpbzxzGpGvDZVINdO

Detailed Report

{

"key_takeaways": [

"Insights from Marc Brooker's analysis of 3,000 AWS incident postmortems, detailed in \"The 3,000 Incident Postmortem: Why Caches Are Actually the Enemy\" (https://open.spotify.com/episode/1qX2GfpbzxzGpGvDZVINdO), challenge the conventional wisdom around caching and system resilience.",

"Caches, often considered a scaling solution, frequently act as performance crutches that mask underlying capacity deficits and can trigger catastrophic \"metastable failures.\"",

"Deep postmortems are crucial for system resilience, requiring engineers to move beyond fixing immediate bugs to interrogate observability, testing, and fundamental architectural assumptions.",

"The rise of generative AI is shifting the most valuable engineering skill from coding proficiency to a profound understanding of system context, failure modes, and architectural weaknesses."

"detailed_report": "Marc Brooker, a VP and Distinguished Engineer at AWS, has analyzed somewhere between three and four thousand internal incident postmortems, revealing critical insights into how large-scale systems truly fail. His findings challenge decades of conventional wisdom in software engineering, particularly the unquestioning reliance on caching as a universal scaling solution.\n\n## The Flawed Approach to Postmortems\n\nBrooker argues that the tech industry often stops its root-cause analysis too early, halting at what he calls the \"proximal cause\"—an immediate code bug or configuration error. This approach fixes symptoms without addressing underlying conditions.\n\nHe outlines a hierarchy for effective postmortems, pushing teams to dig deeper:\n\n* Observability Gaps: Why couldn't the issue be seen coming? Were logs and metrics insufficient?\n* Validation Failures: Why did standard testing and QA pipelines fail to catch the problem?\n* Architectural Assumptions: The deepest layer involves questioning fundamental beliefs about how the system should behave, confronting the delta between whiteboard theory and production reality.\n\nThis deeper analysis moves beyond \"what broke?\" to \"why did we *believe* it couldn't break this way?\" The true value of a postmortem lies in exposing the gap between how an engineering team *thinks* their system works and how it *actually* behaves under duress.\n\n## Caches: A Liability, Not a Panacea\n\nBrooker's extensive dataset points to caching, a common architectural reflex, as a primary trigger for some of the most catastrophic, unrecoverable system failures. Caches, in his view, are often performance crutches that mask underlying capacity deficits, sweeping bad database performance under the rug.\n\n### Metastable Failures and the Herd Effect\n\nHe introduces the concept of \"metastable failures,\" borrowed from physics, where a distributed system enters a bad state that persists even after the initial trigger is removed. While individual engineers might rarely encounter a full metastable collapse, Brooker states they are an underlying cause in a majority of the biggest system postmortems across the industry.\n\nThe mechanism of collapse, dubbed \"the herd effect,\" typically unfolds as follows:\n\n1. A cache reduces load on a backend database, hiding its true capacity limits.\n2. The cache empties due to an eviction, restart, or network blip.\n3. All requests that were hitting the cache now blast the unprepared underlying database.\n4. Latency skyrockets, and in distributed systems, clients increase concurrency by retrying requests.\n5. The system enters a death spiral, thrashing on useless retries, consuming resources without completing useful work. \"Goodput\" drops to zero.\n\nCaches often create \"open-loop

Show Notes

Works Referenced

The 3,000 Incident Postmortem: Why Caches Are Actually the Enemy: This podcast episode discusses Marc Brooker's insights on system failures, particularly the role of caching, based on his analysis of thousands of incident postmortems.
Marc Brooker: VP and Distinguished Engineer at AWS, whose extensive analysis of incident postmortems forms the foundation of the episode's discussion on system resilience and failure modes.
Good Performance for Bad Days (re:Invent 2022): A keynote by Marc Brooker highlighting the flaws in traditional performance evaluation and load testing, emphasizing the need to design for system behavior under duress.
Amazon Web Services (AWS): A comprehensive cloud platform where Marc Brooker serves as a VP and Distinguished Engineer, and the source of the thousands of incident postmortems analyzed.
Redis: An open-source, in-memory data structure store frequently used as a cache, mentioned as a common example of a caching solution.
TPC-C Benchmark: A standard industry benchmark for online transaction processing (OLTP) workloads, used to evaluate the performance of database systems.
YCSB (Yahoo! Cloud Serving Benchmark): An open-source benchmarking tool for evaluating the performance of 'cloud' data serving systems, often used for NoSQL databases.

Glossary

Cache: A temporary storage area that holds frequently accessed data, allowing for faster retrieval than accessing the original source.
Metastable Failure: A system failure state in distributed systems where a trigger causes the system to enter a bad state that persists even after the trigger is removed, often leading to unrecoverable collapse.
Postmortem: A structured analysis conducted after an incident or outage to understand its root causes, learn from it, and prevent recurrence.
Proximal Cause: The immediate, surface-level reason for an incident, such as a specific code bug or configuration error, often identified early in an investigation.
Herd Effect: A specific failure mechanism where a cache clearing causes a sudden, massive surge of requests to an underlying backend system, overwhelming it.
Open System (in load testing): A system where the rate of incoming requests is independent of the system's ability to process them, meaning requests can pile up if the system slows down.
Closed System (in load testing): A system where the rate of incoming requests is controlled by the system's processing speed, typically by waiting for a response before sending the next request.
Coordinated Omission: A flaw in load testing where the measurement of latency excludes requests that time out or are dropped, leading to an overly optimistic view of system performance under stress.
Backpressure: A mechanism in distributed systems to prevent overload by signaling upstream components to slow down or stop sending requests when a downstream service is nearing its capacity.
Load Shedding: The deliberate act of rejecting incoming requests or reducing functionality to protect a system from overload and prevent a complete collapse.
Goodput: The rate at which a system successfully processes and completes useful work, distinct from total throughput which might include failed or retried requests.

Sources / References

Original Article ↗

Full Transcript

HostFor decades, the default answer to any scaling problem in software has been simple: add a cache. It's almost a knee-jerk reaction, a universal panacea.

ExpertExcept, according to Marc Brooker, a VP and Distinguished Engineer at AWS, that "universal panacea" is actually a primary trigger for some of the most catastrophic, unrecoverable system failures seen in production today.

HostHe's arguing that caching, often seen as a scaling cure-all, is less a solution and more a ticking time bomb, wiring systems for what he calls "metastable" failures.

ExpertPrecisely. And his analysis isn't theoretical. It's grounded in 16 years of carrying a pager at AWS and an examination of somewhere between three and four *thousand* internal incident postmortems. That’s a massive, verifiable dataset.

HostThree thousand postmortems. That's an astonishing number. It suggests a level of insight that goes far beyond what most individual teams or even companies ever accumulate. What kind of picture does that paint about how systems really fail?

ExpertIt paints a picture where the tech industry, by and large, stops its root-cause analysis far too early. Brooker's recurring theme is that most postmortems halt at what he calls the "proximal cause"—the immediate code bug or the configuration error. "Oh, a null pointer exception here, a misconfigured timeout there," and then they declare the incident closed.

HostSo, they fix the symptom, but not the underlying condition.

ExpertExactly. He outlines a hierarchy for effective postmortems, pushing teams to dig deeper. The first question, if you can’t understand what happened, is about observability gaps. Why couldn't we see this coming? Why weren't our logs and metrics sufficient?

HostThat makes sense. You can't fix what you can't see.

ExpertThen, once the mechanics of the failure are understood, the next question becomes: why did our standard validation pipelines, our testing, our QA, fail to catch this? Why was this missed in testing?

HostAnd that implies a problem with the testing *strategy*, not just an individual bug.

ExpertRight. But the deepest, most valuable layer, the one that most teams never reach, is questioning the fundamental architectural assumptions. "Why did we assume a certain thing about the behavior of the system that we wouldn't have assumed before?" It's interrogating the foundational beliefs of the engineering team.

HostThat's a crucial distinction. It's moving from "what broke?" to "why did we *believe* it couldn't break this way?" That's a much harder question to answer, because it challenges deeply ingrained mental models.

ExpertIt forces a confrontation between whiteboard theory and production reality. On a whiteboard, systems often appear to scale linearly. Double the load, double the resources, everything is fine. Brooker's dataset, those thousands of postmortems, proves that's a fallacy. Real-world systems behave non-linearly. You hit a hidden bottleneck, and things don't just slow down proportionally; they cascade into collapse.

HostSo, the true value of a postmortem isn't just patching a bug, it's exposing the delta between how an engineering team *thinks* their system works and how it *actually* works under duress. Patching a null pointer is easy. Patching a flawed organizational mental model about system behavior is the real work.

ExpertIt's the difference between debugging code and debugging a culture. And one of the most significant flawed assumptions Brooker points to, the one that disproportionately leads to these catastrophic failures, is the unquestioning reliance on caching.

HostThe sacred cow of caching. It truly is treated as an unquestionable best practice. Slow database? Add Redis. API bottleneck? Throw a CDN in front of it.

ExpertPrecisely. It’s the default architectural reflex. But Brooker's findings identify this reflex as a massive architectural liability. Caches, in his view, are not a scaling cure-all; they are often performance crutches that mask underlying capacity deficits. They sweep bad database performance under the rug, to paraphrase a common observation.

HostBut how does something designed to *improve* performance actually become a liability? What's the mechanism here?

ExpertIt comes down to what he calls "metastable failures." This is a concept borrowed from physics and control systems. In distributed systems, a metastable failure occurs in "open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed."

HostSo, it's not just a transient glitch. It's a system getting stuck in a permanently broken loop.

ExpertExactly. And while an individual engineer might go years without seeing a full metastable collapse, Brooker states that if you look across the biggest, most impactful system postmortems across the industry, these metastable failures have been an underlying cause in probably a majority of them.

HostA majority? That's a staggering claim, given how ubiquitous caching is. What does that spiral look like in practice?

ExpertHe details a very specific mechanism of collapse, what he calls "the herd effect." It typically starts with a system using a cache to reduce load on a backend database. This lowers latency, it makes things feel fast, and critically, it hides the true capacity limits of that underlying database.

HostSo, you're running along, thinking your database is handling thousands of QPS, but it's really the cache doing the heavy lifting.

ExpertExactly. Then, the trigger. The cache empties. Maybe it's an eviction, a restart, or a network blip that clears it. Suddenly, all those requests that were hitting the cache now blast through to the underlying database.

HostAnd the database, which was never provisioned to handle that native load, just gets slammed.

ExpertIt's the "herd effect." The offered load, the raw number of requests, instantly spikes to a level the database was never designed to handle. Latency skyrockets. And here's where it gets insidious: in distributed systems, as latency increases, the number of requests *in flight* – the concurrency – also increases. Clients are waiting longer for responses, so they retry requests.

HostAnd those retries just make everything worse.

ExpertThe system enters a death spiral. It starts churning on useless retries, consuming more memory and CPU just trying to manage the backlog. What happens then is that your "goodput"—the actual successful completion of requests—drops to zero. The system is at 100% utilization, but it's not doing any useful work. It's just thrashing, processing retries and timeouts.

HostSo, it's effectively dead, but still alive and consuming resources. That's the definition of a stable, but broken, loop.

ExpertAnd the mordant irony is that caches create these "open-loop" feedback cycles. In a well-designed system, there's a feedback loop: if the system is slow, it pushes back, clients stop sending requests. But bad caches are typically open-loop. When they fail, clients, or automated microservices, just keep retrying. They hammer the vulnerable database further.

HostBecause the cache was insulating them from the actual state of the backend. They don't know it's overwhelmed until they start timing out, and by then it's too late.

ExpertThe database is trapped in a state of constant overload. It can never serve enough successful requests to refill the cache because it's too busy dealing with the herd. The system is stuck. The only way out is for operators to brutally shed load—often by completely turning off traffic, letting the database recover, manually warming the cache, and slowly reintroducing traffic. It's a manual, painful resuscitation.

HostSo, the industry's reliance on caches isn't fixing slow systems; it's just ensuring that when the system finally *does* break, it breaks so catastrophically that it requires human operators to manually intervene. It's sweeping the problem under the rug until the rug itself explodes.

ExpertA very expensive, very painful explosion. And this idea of understanding deep system context and failure modes is becoming even more critical with the rise of generative AI.

HostThat's a fascinating pivot. Most of the conversation around AI in software engineering is about how many lines of code it can write. How does Brooker connect AI to these architectural failures?

ExpertHe cuts through the hype with a very pragmatic thesis: AI is fundamentally shifting the primary engineering bottleneck. It's moving it away from writing code and toward understanding system context. He calls it an "extinction-level event for rules of thumb."

Host"Extinction-level event for rules of thumb." That's a strong statement. What does he mean by that?

ExpertFor decades, senior engineers have built up this vast library of heuristics, shortcuts, and implicit knowledge: knowing which flags to pass, which syntax to use, how long a specific implementation should take. AI, with its ability to generate code, suddenly commoditizes that entire implementation phase.

HostSo, all that muscle memory, that accumulated knowledge of "how to do it," becomes less valuable because an AI can now do it.

ExpertExactly. Brooker's perspective aligns with the idea that 90% of current engineering skills—the syntax, the implementation patterns—are going to zero in value. While the remaining 10%—the vision, managing complexity under uncertainty, understanding constraints, and especially failure modes—are going up a thousand-fold in value.

HostSo, the highest-value engineering skill isn't how fast you can type, it's how deeply you comprehend the system's architecture, its interaction points, and its inherent weaknesses.

ExpertPrecisely. An AI can whip up a microservice in seconds. But it cannot deduce the complex, non-linear failure modes, like the metastable collapse we just discussed, that will occur when that microservice interacts with a legacy database under peak load. That requires human comprehension, the ability to ask the right questions.

HostThat's a direct challenge to the "10x engineer" trope, isn't it? If your entire professional identity is wrapped up in knowing the exact syntax for a React hook or a Dockerfile, AI is indeed coming for your job.

ExpertBut if your identity is knowing *why* the system will catch fire when the third-party payment API adds 50 milliseconds of latency, then you're going to be a VP. Brooker's advice is that engineers must focus on tasks AI cannot do: reading postmortems, understanding the business and economic constraints, and writing clear, rigorous technical documents.

HostIt also implies a fascinating shift for junior engineers. You'd think they'd be at a disadvantage, lacking that accumulated experience.

ExpertCounter-intuitively, Brooker suggests they might actually be at an advantage. Senior engineers are burdened with "miscalibrated priors." They're confident in implementation reflexes that are no longer the bottleneck. He uses a bobsledding metaphor: a senior engineer knows the physics of the ice track perfectly, but AI has changed the sport entirely, making that specific track knowledge useless.

HostBecause the track itself is no longer the challenge; it's the entire ecosystem around it.

ExpertJunior engineers, on the other hand, enter the industry knowing they need to learn. If they're directed to learn system architecture, customer needs, and failure modes, rather than just grinding syntax speed, they will thrive. Brooker warns that the future may be "frustrating" for engineers who just want a "pure" coding career where they "start typing and don't stop for eight hours." Customer interaction and system-level thinking will become non-negotiable.

HostSo, the real engineering happens before the code is written, and after it breaks in production. Reading a five-page incident report is now mathematically more valuable to your career than memorizing a new JavaScript framework.

ExpertAbsolutely. And this brings us back to the fundamental flaw in how the industry often designs and tests systems: the failure of load testing.

HostYou're saying that even when teams *try* to prepare for these failures, their methods are often insufficient?

ExpertConsistently. Brooker's 3,000 postmortems highlight that teams are repeatedly surprised by how their systems behave under overload. This surprise stems from a fundamental flaw in how performance evaluation and load testing are conducted. He gave a keynote on this titled "Good Performance for Bad Days," arguing that the performance evaluation community is overly obsessed with "happy path" performance.

HostMeasuring throughput and latency when everything is working perfectly, rather than when it's under duress.

ExpertThe critical flaw lies in the difference between "closed" and "open" systems. Standard industry benchmarks, like TPC-C or YCSB, operate in a "closed loop." They send a request, wait for a response, and then send the next one. If the server slows down, the benchmark *kindly* slows down its rate of requests. Brooker notes, "These benchmarks suck... They're unrealistically kind."

HostUnrealistically kind. That's a damning assessment.

ExpertBecause almost all modern cloud architectures—APIs, web services—are "open systems." In the real world, if a server slows down, the internet does not kindly wait. Clients time out. They retry. The work doesn't disappear; it piles up. As he puts it: "Slowed down? Hope you can deal with more work later!"

HostThis disconnect leads to "coordinated omission," where load tests massively underestimate the impact of tail latency and completely fail to predict how a system will behave beyond its saturation point.

ExpertIt's why boasting about a system's throughput during a load test is like boasting about how fast a car can drive on an empty track, completely ignoring what happens when the brakes fail in traffic. A highly touted microservice architecture can be a brittle house of cards waiting for a single cache miss to trigger a total, unrecoverable outage.

HostSo, the solution is to embrace the hostility of real-world workloads, not try to pretend they're closed-loop.

ExpertExactly. Brooker advocates for designing "closed-loop" architectures to protect against "open-loop" reality. Systems must be built with strict backpressure mechanisms and concurrency limits. When a system reaches its saturation point, it must actively and ruthlessly reject new work.

HostShedding load, rather than trying to process requests it can't handle.

ExpertBecause accepting requests that the system cannot process in a timely manner only increases memory pressure, spikes concurrency, and risks pushing the system into that metastable collapse. Brooker's research highlights that unpredictable system performance under overload is the single largest contributor to system unavailability.

HostAnd yet, the industry remains obsessed with theoretical "nines" of uptime based on happy-path architecture diagrams, ignoring the worst-day scenario. The mark of a truly robust system isn't how fast it goes, but how gracefully it fails and rejects work on its worst day.

ExpertIt’s a complete inversion of priorities for many. And it circles back to the "best practices" trap.

HostThe best practices trap. Brooker explicitly says, "Best practices are seldom the best."

ExpertYou didn't fix the database query, Chad. You just hid it behind a cache, and now when that cache reboots, the database is going to get hit by a freight train of retries, only this time you’re calling it a “best practice.” The physics of distributed systems are unforgiving.

HostIt's wild to think a system can be entirely free of code bugs, have no hardware failures, no configuration errors, yet still suffer a massive, hours-long outage simply because the physics of the load pushed it into a metastable state. It challenges the binary "bug vs. feature" mindset of most developers.

ExpertAnd that leads to the AI reality check. If your entire professional identity is wrapped up in knowing the exact syntax for a new framework, AI is coming for your job.

HostWhich brings us back to the philosophy of the postmortem itself. What can listeners take away from Brooker’s 3,000 incident reports?

ExpertA postmortem is fundamentally a confession of ignorance. It’s a document that proves the whiteboard was wrong. An outage is an expensive tuition fee paid to learn the actual limits of your system. Stopping at the proximal bug means you paid the fee but skipped the class.

HostSo, the takeaway is: understand your system's actual failure modes, not just its happy path. Recognize that common "best practices" like caching can introduce more risk than they solve. And critically, that AI is shifting the value proposition for engineers from syntax to system comprehension.

ExpertThe shift is happening now. If you're not learning how to design for the worst day, or how to write a postmortem that goes beyond the superficial, then you're actively mispreparing for the future.

HostIt makes you wonder, then, how many companies are actively wasting their outages by not performing this deeper analysis. And what does that mean for the resilience of the systems we all rely on?