In my past experience as an SRE I’ve learned some valuable lessons about how to respond and learn from incidents.
Declare and run retros for the small incidents. It's less stressful, and action items become much more actionable.
Decrease the time it takes to analyze an incident. You'll remember more, and will learn more from the incident.
Alert on pain felt by people — not computers. The only reason we declare incidents at all is because of the people on the other side of them.
Let’s dive into each of these lessons a little deeper, and how they can help you build a better system for pragmatic incident response.
1. Focus on the Small Incidents First
The habit of ignoring small issues often leads to bigger issues. You should run retrospectives for small incidents (slowdowns, minor bugs, etc.) because these often have the most actionable takeaways — instead of creating “re-architect async pipeline” as a Jira ticket that never happens. Focusing on low-stakes incidents and retrospectives are a great intro to behavior change across your organization.
Let’s look at an obvious example of small incident focus — my apartment. I live in an old candy factory retrofitted into apartment units. We have an elevator (thank god), but some of the buttons don't light up when you press them. The LED also displays mismatched floors from the button labels. The elevator goes up and down but generally you can tell that things are wrong.
One day I came home and noticed an OUT OF ORDER notice on the elevator door. Was I surprised? Not at all. Of course an elevator with mislabeled buttons and broken LEDs would stop working.
This isn’t too far off from an important software lesson — ignoring small issues often leads to much bigger issues.
This is why you should focus on the small incidents first. I see a lot of companies that say we need to fix incident response, every time we have an incident it’s just chaos! They are talking about unexpected downtime at a high severity level and they want to improve it.
I say that’s the wrong incident to use to fix your incident response system. Focus on the small ones first. Running retros for small incidents can help you build strong incident response models because they have the most actionable takeaways and they are the best way to change behavior.
If you have a high-stakes, 12-hour incident and you run a one-hour retrospective you're not going to get the results you want. You need to start small. Run retros for bugs that were introduced, maybe a bad data migration that didn't really impact anything but took up a couple hours of your day.
Heidi Waterhouse captured this idea really well in her piece on reliability.
A plane with many malfunctioning call buttons may also be poorly maintained in other ways, like faulty checking for turbine blade microfractures or landing gear behavior.
I couldn't agree more. The small things are typically indicators of bigger things down the road.
2. Track Mean Time to Retro (MTTR)
It’s important to think about what you measure in your organization. You should be measuring how you're improving, and the most important metric here is Mean Time to Retro (MTTR) Everyone should be tracking MTTR. It’s a great statistic to improve incident response in your team because it helps you understand the delay between incidents and retrospectives.
The easiest way to have a bad incident retro is to wait two weeks. It’s better to get into a room quickly versus get everything perfectly prepared but wait a long time.
Tracking MTTR can help you hold prompt and consistent retrospectives after incidents. Set a timer and make an SLO or SLA for yourself that says this is how long we take for retros.
Retro time will vary. If it’s a SEV1 clear schedules, because you need to have retro within 24 hours. SEV3 you have much more leniency.
I also like tracking the ratio of retros to declared incidents. This is a metric that should go up. You should see your ratio of retrospectives to incidents increasing. You can split that number by severity as well. If you have a low retro ratio for a SEV1 versus a SEV3, that might be okay at the beginning (remember, start small) but you want them to eventually become equal.
3. Alert on Degraded Experience with the Service, Not Much Else
The severity of incidents is directly linked to customer pain. We would not declare SEV1s if there weren't a lot of people feeling a lot of pain.
Alerting on computer vitals is an easy way to create alert fatigue. As your company starts to scale you are going to use more CPU, you're going to use more memory. Tying alerts to computer vitals is not a good sign.
If I run my heart will beat faster, it’s just doing its job. Paging people at 2:00am because disc capacity is at 80% and won't run out of space until next month is a good way to lose great teammates. I have worked with people that have left companies strictly because they got paged too many times for stuff that didn’t matter.
This is why you need to alert on a degraded experience with the service, and not much else. CPU burning hot at 90% is not necessarily a bad thing. Create SLOs that are tied to customer experience and alert on those. People experiencing problems with the service is the only thing you should learn on — for the most part.
Soundcloud developers wrote about this, explaining you should alert on symptoms, not causes. My fast heartbeat is not necessarily a problem. But if my elevated heart rate leads to lightheadedness and I fall — that’s a problem. So I need to be able to alert on something like that. You can apply the same thought to other potential causes of an outage. Paging alerts that wake you up in the night should only be based on symptoms.
Want to read this again? But instead of reading watch a video? Check out Bobby’s original presentation here