
You may call it RCA. Your last company called it COE. We call it CAPA — Corrective and Preventive Action. The name doesn't matter. The philosophy does.
And most organisations get the philosophy catastrophically wrong.
I was sitting with my team, reviewing the same incident. Same facts. Same timeline. Same data.
I had five parallel chains of why. They had one. Maybe two.
It wasn't that they were lazy or careless. They were asking a different question. They were asking: how do we stop this specific thing from happening again? I was asking: at how many points did our system fail to prevent this, and at how many points did it fail to even tell us it was happening?
Same document. Two completely different mental models of what a CAPA is for.
That's when I understood the problem.
Peter Drucker wrote this in Effective Executive, and is something that I keep coming back to. He said the most common mistake an executive makes is treating a generic situation as a series of unique events. Trying to solve each incident on its own merits instead of recognising that the incident is a symptom of a system gap.
The executive who does this, he wrote, ends up frustrated. Fixing leaks without ever getting control of the situation.
That is precisely what a bad CAPA looks like.
Someone misused a discount coupon. Six months after the symposium it was issued for. Codes that should have expired, hadn't. People who shouldn't have had access, did. And we found out not because any of our systems flagged it — but because someone stumbled across a transaction that looked off.
I'm one of the functional owners of that investigation. And the question I keep returning to isn't "how do we prevent coupon misuse." That's the one-chain question. The real question is: at how many points should our system have caught this, and why did each of those points fail?
The coupon should have had restricted access. It should have auto-expired. There should have been instrumentation watching discount rates by coupon cohort. The monthly financial review should have surfaced that discount spend was tracking above plan. Any one of these catching it would have been enough. None of them did.
That's not one failure. That's five or six failures that all need their own 5 Whys. And fixing only the coupon expiry — the obvious, surface fix — leaves the other four holes open.
Here is the core philosophy, stated plainly.
Every system you own is built for a specific environment with specific boundary conditions. Those conditions change. The environment changes. New failure modes emerge that the original design never anticipated. The system that worked perfectly eighteen months ago develops gaps — not because anyone did anything wrong, but because the world moved.
A CAPA is the mechanism by which the system owner discovers those gaps and redesigns around them.
Not documents the incident. Not assigns blame. Not closes a ticket.
It redesigns the system.
This is why Amazon requires a Director — an L8 — to own every CAPA, even when an L4 writes the first draft. Because only the person who owns the system has enough perspective to see all the interfaces, all the instrumentation points, all the places where the design assumed something that no longer holds. A junior person can document what happened. Only the system owner can see what the system should have done — and why it didn't.
We built this same principle into our own CAPA format. The accountability block at the top of every CAPA reads:
A CAPA represents a system failure. The Line Head / Function Head is accountable for this failure and must personally send this CAPA to leadership.
By sending this CAPA, I acknowledge that: a system under my ownership has failed. I commit to completing all action items by the dates specified. I am accountable for ensuring this failure will be prevented in the future.
Every word is intentional. Not "my team failed." Not "the process failed." A system under my ownership has failed. The Line Head signs it. The Line Head sends it. Not delegates it down and signs off on the output.
And then this:
Random CAPAs will be selected for review by CAPA Bar Raisers. If a CAPA is found to lack rigor, depth, or quality, the Line Head / Function Head is answerable for this failure in knowledge and rigor — not the CAPA author.
Not the CAPA author. The Line Head.
Because if the bar raiser reads it and finds one chain of 5 Whys, one root cause, one corrective action — the system owner didn't do their job. Doesn't matter how well the junior wrote it. The system owner is the only person with enough context to know whether the analysis reached deep enough. Signing without engaging is the failure.
I didn't always understand this.
For nineteen years, I thought I did. I'd written root cause analyses. I'd run post-mortems. I'd signed off on corrective action plans. I believed I was doing it properly.
Then I joined Amazon in 2019, and Rahat Patel became my guide.
My first COE at Amazon was about a prepaid wallet top-up failure. Customers couldn't purchase using Amazon Wallet for two hours because we missed a top-up trigger. Straightforward enough, I thought. Find the gap, fix the gap, close the document.
Rahat didn't let me close it in a day. Or two. It took three days of him reading every section, asking questions at every step, refusing to accept the first answer to any why. Not because he was difficult. Because he knew what a proper COE looked like — and he knew I wasn't there yet.
By the end of it I understood something I'd been getting wrong for nineteen years: I had been tracing one chain. He was making me see the whole fishbone.
I'm naming him here because he didn't crystallise the framework for me. He held a standard high enough that I had to crystallise it myself. That's the harder and better kind of teaching. (Rahat Patel)
The question that separates a real CAPA from documentation theatre is this:
Even with the worst human intent — even if someone actively tried to break this system — why did the design allow it?
When you ask that question, you stop looking for the person who made the mistake. You start looking for every layer of the system that should have caught the mistake and didn't. Prevention. Detection. Signal. Response. Each layer. Each failure point. Each its own parallel chain of why.
That's what I was doing when I sat with five parallel 5 Whys while my team had one.
I wasn't smarter than them. I had a different frame. I was mentally running a fishbone — looking at the whole array of failure across every dimension of the system — while they were tracing a single path from incident to fix.
Both are valid analytical moves. Only one produces a system redesign.
This is Part 1 of a five-part series on CAPA done properly. In the next piece, I'll walk through why fishbone comes before 5 Whys — and why doing it the other way around is the single most common mistake in root cause analysis.
The template — our full CAPA format, including detection actions, prevention actions, post-implementation measurement, and closure criteria — comes in Part 4.
For now: the next time someone on your team brings you a CAPA to sign, ask them one question.
How many parallel failure chains does this have?
If the answer is one, the work isn't done.
~Discovering Turiya@work@life


