That's a wrap!
We hosted "WTF is Incident Management" on May 12, 2021. We invited four very knowledgeable panelists to discuss how they define incident management, what changes they'd make if they could start again from scratch, how to manage team stress after an incident, and other subjects. Our panelists were: host Matt Stratton (Staff Developer Advocate at Pulumi), Emily Ruppe (Incident Commander at Twilio), Alina Anderson (Sr. TPM Site Reliability Engineering at Outreach), and our own CEO and co-founder, Robert Ross. Here are the highlights!
Matt: What is incident management?
Bobby: Incident management is the whole cycle of receiving an alert (for example, from your customers or from an automated system that pages the person on-call) to the point where you’ve called the fire department (and maybe you’re part of the fire department) and they come to help mitigate the issue and resolve it - all the way to the incident retrospective. It’s a pretty broad surface area in terms of timeline.
Emily: I see it as the process of managing the people and expectations around incident response. We have responders, we have the pages that get kicked off, our understanding of what’s wrong, mitigating that, finding a resolution, and doing analysis afterwards. The overarching management is making sure we have the people we need and that they understand what they're supposed to be doing; that we’re communicating and updating the expectations around that response.
Alina: There isn't one "true" definition. So much is dependent upon the organization. It’s an organizational muscle for executing the highest urgency of work that impacts your business.
Matt: We throw around the terms “Incident Management” and “Incident Response” - it’s important to remember that incident management is inclusive of incident response.
Matt: What are the biggest pain points, or toil generators, around incident management?
Alina: We can get stuck chasing low-value admin post-incident tasks that aren't driving insight. At the end of the day, what are the insights that we can feed back to move the ball forward?
Emily: There’s this impossible battle between the complexity of our systems and the need for a simple process. Complex systems fail in complex ways, but in order for us to recall something under duress, it has to be easy to recall. We need a clear process, and we have to iterate. Ask if the process is getting in the way. Can we have a simple response to complex failure, and can we change things easily?
Matt: During post-incident reviews, we talk a lot about “what happened,” but part of the post-incident review should be “let’s look at the process.” Not just what went wrong, but also: "what do we learn from our distributed system? what logging was bad? what was cumbersome?”
Alina: In a high-growth organization, sometimes trying to find out who owns a task and finding the right person to page can create delay and struggle. Just reducing the time it takes to figure out who to contact can make a huge difference.
Matt: If you had the opportunity to start completely fresh--if you had a magic incident wand--what tools or processes would you call essential, and what would you not?
Emily: If I could start over, I’d try to change the mindset. Incidents aren’t bad. An incident is an opportunity for us to learn about different facets of our organization. Starting some sort of call when an incident is kicked off (not trying to communicate over chat or email, but a video call or a conference call) can catch people up and enable quicker and more nuanced communication. Building a process around communication in a way that’s fast--and also where you have context about how other team members are handling an incident--that creates understanding a lot faster.
Matt: Incidents are a gift!
Emily: Maybe not to our customers...
Matt: There's so much that goes into the measure of stability, or quality of our work, based on the number of incidents. A CIO might ask, “how many P1s did we have?” But the number of incidents is not necessarily a reflection of quality of your system.
Bobby: I like what Emily said. The Latin root of the word “incident” means “an event,” and it isn’t necessarily bad. If I could start over, I’d measure how many retros a team has. How many retros for P3s? How many for P1s? I’d also look at diversity of responders. If the same person is always responding to an incident, that’s probably bad. I’d also start looking for pieces tied to human behavior during and after the incident; that may unearth a lot of other conversations. Like: why wasn’t there a retro? Why did the retro take two weeks? Tracking your mean time to resolution is not the right metric. If you’re going to track MTTR, that should be mean time to retrospective. Too many companies are willing to have retrospectives happen a few weeks after major outages. You lose a lot of context as the days go on.
Matt: You run into that expected “long distance to retro” when you think about what the retro is for. If a retro is for root cause analysis, it might take a long time. But if it’s considered a learning experience, it can take less time.
Matt: How do we experience stress after an incident?
Alina: I’ve observed that because your brain is working so hard in an incident, often you cannot realize that your nervous system is engaged, tense. I’ve observed teams staying on the call and just kind of exhaling a little bit, maybe talking about non-work stuff for a few minutes. It helps to come down from adrenaline and establish a connection to others. Make space for the "team exhale" and integration back into the day-to-day.
Emily: It’s important to acknowledge that incidents are physically stressful. Your body has a rush of adrenaline. If you don’t acknowledge that, your brain won't be functioning the way it needs to. In order to be resilient, you have to have a period of recovery. It’s important to build in some breathing room. In order to do good retros, take a minute post-incident to reset.
Alina: If you have tooling that can capture key info, that creates breathing room for you. If you can trust the tools doing this for you, it’s much easier to take a break.
Bobby: On breathing room - there’s an "Efficiency is the Enemy" blog post. You can’t react to things effectively if you don’t have breathing room on either side of something. When you try to be efficient, you react poorly.
Matt: To normalize incident response, if you practice it in a non-stressful way, you get better at handling it. Game days are good for this, and everyone should know what’s broken. Don’t make them a surprise. Engineering teams already have to deal with enough surprises.
Matt: Let’s pivot to incoming questions. A panel attendee is asking: coming from an org where dev teams feel frustrated, how would you make sure your IM teams have visibility rather than being a box that has to be checked?
Alina: In an incident commander role, there's no world where every IM on call is going to know everything that’s going on. They have to map a complex matrix.
Matt: Firefighters will say if you see commander in the white helmet picking up the wrench, take it away, because they’re not supposed to be doing that work. They're doing the commanding, not the on-the-spot fixes.
Emily: As an incident commander, one of my strengths is to be able to ask questions. If these are pain points for you, we can help you dig into it. I don’t have to know the full depth and breadth of all of these things.
Bobby: Incident commanders are excellent question-askers!
Emily: That’s my whole job. My job is to not have fear around asking the stupid questions. Ask all of the obvious questions, because maybe the answer isn’t obvious.
Bobby: Also - asking why your assumption is different from another person’s assumption.
Matt: In one of my previous roles, for a long time, only engineers were incident commanders. Then someone in product wanted to be an incident commander. Turns out product people are amazing at this! They know how the product is designed, and they know how to prioritize work.
Watch the panel on-demand
Thanks again to all of our fabulous panelists and to everyone who attended this discussion! Click here to watch the panel in its entirety, on-demand.