A Mostly Sane Guide to Root Cause Analysis
It finally happened. A Friday afternoon not long before a huge launch, somebody deployed a last-minute change that crashed the database, and you were on-call. It took a harrowing three hours to roll back the code, find the backups and run the restore, but by sheer force of will you were finally able to get it all back up.
But now the real work begins. Customers are mad, leadership is anxious about something like this happening again, and as the on-call, all eyes are on you to write a post-mortem to explain what happened and how this will never happen again.
So today, we’re gonna do a root cause analysis, the heart and engine of the post-mortem discussion.
Why we root cause
At its heart, a root cause analysis is the process of taking an incident and determining what systemically caused it. In a root cause analysis, it’s not enough to be able to explain why it happened, you have to be able to explain why it was only a matter of time that this issue was going to occur. It’s making the best of whatever downtime, or lost data, or customer pain you’ve endured, and an attempt for a team learn as much as possible from it, and if you’re lucky, find things you can do to prevent incidents like this from happening again.
From the top
The way that I’ve always been brought up to do root cause analysis is with a bastardized version of the Five Whys technique. It’s far from perfect, but it works well I think because it’s a simple and flexible way to think about root cause analysis.
Here’s our v0 of the root cause analysis from the story above:
- The production database crashed.
- Why? An engineer pushed code that added an expensive query that exhausted the database’s memory.
To start, this answer shouldn’t feel satisfying. Human error contributes to almost every incident, but for that exact reason, pinning the blame on a person doing something dumb isn’t productive.
We can strengthen this analysis by focusing on the crash itself to start, and shrinking the size of the logical leaps:
- The production database crashed.
- Why? The production database ran out of memory.
- Why? Query XYZ exhausted the production database’s memory.
- Why? Executing the query required storing millions of records in memory.
- Why did we have this query? The query is a key part of our new reports tool.
- Why does it store so much? We didn’t optimize the query before deploying it.
- Why? Executing the query required storing millions of records in memory.
- Why? Query XYZ exhausted the production database’s memory.
- Why? The production database ran out of memory.
While the analysis (so far) doesn’t come to a different conclusion from our v0, it’s getting more robust, and we can better justify our assessment. Let’s extend our line of questioning a couple more steps:
- The production database crashed.
- Why? The production database ran out of memory.
- Why? Query XYZ exhausted the production database’s memory.
- Why? Executing the query required storing millions of records in memory.
- Why did we have this query? We used the query to build a new report.
- Why does it store so much? We didn’t optimize the query before deploying it.
- Why? We don’t have any tools to test our code against data at scale.
- Why wasn’t this caught in code review? The reviewer did not have SQL performance expertise.
- Why didn’t someone else review? The engineers with the strongest SQL expertise were out of office.
- Why didn’t this reviewer have expertise? We don’t have documented best practices on common pitfalls in SQL.
- Why? Executing the query required storing millions of records in memory.
- Why? Query XYZ exhausted the production database’s memory.
- Why? The production database ran out of memory.
Unlike a more “standard” Five Whys analysis you’ll see, I really like this branching format. Instead of looking for just the “root” cause I can pursue each line of causation independently. This helps me because I can start pursuing multiple suggestions for a post-mortem, and the contributing factors matter.
What makes a cause, a “root” cause?
Some of the branches ended at the fifth Why, and some ended earlier. It really doesn’t matter whether you stop at five. What matters is that at the end of each branch, you’ve reached what you’d consider to be the root cause.
This is easier said than done. When researching for this post, I found most definitions of “root cause” to be incredibly vague and hand-wavy. Rather than philosophizing about the nature of cause and effect in our reality, I’ll instead take a more functional view on what a root cause is: in a world without the root cause, the incident could not have occurred, 95% of the time.
There’s a few general questions I like to use to feel out if I’m satisfied with a root cause:
- Was it one person’s silly mistake? If one mistake amounts to an incident, it’s not the root cause. Look deeper at the systems (technical or organizational) that allowed or enabled that person make that mistake instead.
- Does the effect feel inevitable given the cause? In the above example, without having any way to test a query’s performance at scale, it was only a matter of time before the database melted down over a bad query. If it feels like something was bound to happen given the cause you’re presenting, you’re probably onto something.
- Was it a deliberate decision? Deliberate policy or architecture decisions are useful as root causes because they dictate how our day-to-day, smaller decisions are made. Not requiring code review or picking the wrong approach for building a feature are both good examples of this; if I pick a system design that makes it hard to write performant code, less performant code will be written. Fixing these key decisions has massive downstream effects.
The hardest part of this exercise will always be knowing when to stop. Given enough iterations, you can almost always root cause any incident to “we had to ship this thing quickly and didn’t budget enough time,” for example. Back to the functional viewpoint of things, you can generally stop at the point your analysis is giving you good ideas of what to fix or improve. It’s a delicate balance between trying to be intellectually honest while also acknowledging that you’re here to fix problems, not write philosophy.
What’s Next
You brought your root cause analysis to the incident review, and it was met with a hearty round of applause. (I kid. I’ve never seen a post-mortem get that emotional.) Now you’re looking at the root causes, and let the team brainstorm ways to mitigate them:
- The problematic query was used for a minor feature in the reports tool.
- The reviewer did not have SQL performance expertise.
- We don’t have any tools to test our code against data at scale.
One engineer offered to go back to product and see if there was a less intensive way to build a similar report, while another person offered to do a small class on SQL performance. Since you already had a small staging environment with synthetic data, you offered to beef up the amount of synthetic data in staging to make it easier to test reports at scale before it takes down production. A fantastic start and another incident in the books.
There’s a lot of ways to root cause, and to be honest, the methods I’ve seen in software engineering would not be considered root cause analysis in other fields. What I described here, of course, is one method that I’ve seen work in software engineering, and presents a balance between rigorous thinking and constant demands on time. Unlike any kind of analysis a flight investigator, say, would do, this is barely an analysis, but in software, these kinds of things are written in a couple days, not several months. It’s really about extracting as much learning in as little time as possible.
Always learning, ✌️
Some Further Reading
The importance of psychological safety in incident management
Incident.io’s basic treatise on how to conduct incident management with psychological safety, and why it matters. I touched on the concept here, but this one goes in a bit more depth.
Inhumanity of Root Cause Analysis
This is a cynical, but nevertheless very informed opinion of the way root cause analysis is used in organizations.
Pagerduty’s main guide on running an incident analysis. By default, Pagerduty’s guides are what I use when trying to think about incident management from the ground up.