This is the first in a series of interviews with experts about reliability, incident response, and related topics.
Jeff Smith has been in the technology industry for over 20 years, oscillating between management and individual contributor. Jeff currently serves as the Director of Production Operations for Centro, an advertising software company headquartered in Chicago, Illinois. Previously he served as the Manager of Site Reliability Engineering at Grubhub.
Jeff is passionate about DevOps transformations in organizations, large and small, with a particular interest in the psychological aspects of problems in companies. He lives in Chicago with his wife, Stephanie, and their two kids, Ella and Xander. Jeff is also the author of Operations Anti-Patterns, DevOps Solutions with Manning publishing.
Q1. People can feel a lot of pain from being on call. What are some of the things your teams have done to make it easier to deal with?
The biggest thing you can do to take away the pain of on-call is to actively work to make on-call not painful. This sounds incredibly simple, but it’s amazing how many organizations find themselves resigned to accept the state of on-call as an intractable byproduct of their environment. But prioritizing fixes to on-call problems is the only fix. Find ways to carve out time to specifically tackle those recurring issues. More specifically, empower the on-call person to prioritize their own work during the on-call shift. Tackling the work while it’s top of mind keeps the context of the issue front and center. Submitting a ticket and having it wallow in the backlog could mean that when it does come to the sprint, people have lost some of the specific context that could be helped to resolve the issue permanently.
How do you approach getting management to give your team the bandwidth they need to prioritize that work?
The thing to recognize about priorities, whether it be setting them or requesting time for them, is to understand the nature of tradeoffs. When you’re working with any member of leadership, you have to craft your argument under the framework of “what you get” versus “what you have to give up.” Too many people treat prioritization as some sort of magic time fairy, where you just declare something is a priority, and then suddenly the days have 28 hours in them. You have to think about the gain in terms of what the decision-maker gets out of the tradeoff.
When it comes to on-call, the best argument to make is job satisfaction. It sounds like a soft squishy metric, but it’s something you can actually create proxy metrics for. The proxy metric I use is “Interruptions per day.” An interruption is basically any time an on-call event diverts you from what you were doing in order to deal with the on-call scenario. (You could use “contacts per day” as well, but the word interruption tends to evoke the undesirability of that contact) Those interruptions can often correlate to job satisfaction and focus. For added emphasis, you can graph those interruptions across time. If you have a cluster of pages that happen off-hours, the story pretty much tells itself.
Now you can compare the level of effort to correct these recurring issues versus the time it takes to fill an open position. Not just the job search but also the onboarding and training before that engineer becomes a fully productive member of the team. You can also compare it to the pressure your team could face if the size of the team ever contracted, whether through natural attrition or as a direct result of the state of on-call. If four team members find the on-call process intrusive to their lives, then three team members will feel the intrusion more acutely.
With this in hand, find out what it is your team can give up or postpone in terms of work to make room for this new effort. Remember, it’s not a matter of just being allowed to work on the on-call scenario. You have to create the time for it, which means trading out some other committed work. When you’re looking at work, you can think of it across the axis of
- Importance - Meaning, what is the impact that the work will have on the business
- Urgency - How soon does this work need to be done. Urgency and importance are not always correlated.
- Effort - How much energy is it going to take to complete this task. This could influence your decision based on the urgency of it. A task that’s needed in 3 months is an eligible candidate for delay, unless of course that work is going to take 2.5 months to complete.
Using these three axes, you can begin to evaluate the work on your plate and identify possible candidates for delay or outright cancellation in support of your on-call objectives.
Q2. What suggestions do you have for people who want to hold better incident retrospectives and learn more from incidents?
Approach the incident from a purely learning perspective. Don’t go into it looking to assign blame or with the goal of “ensuring that this never happens again.” Blame is seldom useful and usually laid at the wrong feet. Preventing the issue in the future will be a natural outcome of the learning process.
You’ll also need to be prepared to dig deep. You’ll have to go beyond the surface-level questions of who did what and focus in on the why. What made an engineer think that was the right action to take? How did the engineer’s idea of how the system worked compare to the idea of how it actually works? Why were those two thought models so divergent? You’ll be amazed at how frequently someone’s perspective on a system is not only flawed, but that perspective is shared across many engineers. You’ll also learn that simple things, such as how a component is named, can lead an engineer down an incorrect path of thought. A parameter flag named “active,” which takes a boolean value, can easily be misinterpreted so that an engineer provides the inverse value of what they expected. Rooting out these misconceptions are some of the easiest ways to improve not only people’s understanding of the system but also enlighten them on how people think through problems. Be sure to challenge any assumptions that are made so that the group understands where those assumptions originate from.
Running a retrospective isn’t easy, however. It requires a delicate balance of persistence, curiosity, and guidance. Find people who embody these traits, preferably someone who was not involved with the incident response. You may think that a technical person needs to run these meetings, but in my experience, that’s not necessary.
Q3. There’s been an explosion of infrastructure tools over the last five years. How do you learn about new tools, and how do you evaluate them?
I think tools are overrated. More specifically, I think the discovery of new tools receives too much energy. I discover new tools as I encounter problems that I can’t solve. At that point, I go out and I look for tools that address my problems. I find that many people in the field find a tool and then back their way into the problem that it solves for them. But if a problem is a big enough pain point, I assure you, you’ll have no problem finding tools that meet your needs. Finding tools is only difficult when you haven’t properly identified your problem.
As far as evaluating tools goes, I go back to the pain points discussion. If you understand your pain points, then choosing a tool is pretty easy. You’ll have a clear understanding of what your requirements are and what your “nice to haves” are. After you’ve used that to eliminate options, your remaining palette should be evaluated on the momentum behind the tool and the community that surrounds it. I don’t remember who said it, but there was a speaker who said something along the lines of “When you pick a tool, you’re also picking a community.” That stuck with me. Make sure you’re choosing a community that you’re happy to participate and interact with, especially if your tool is open source.
Once you’ve identified requirements, community, and viability, you can just throw a dart at the remaining options and run with it. I think there’s very little value in over scrutinizing a tool choice. The more you work with a tool it’s inevitable that you’ll eventually become frustrated by it. It’s just the nature of things. You’ll make decisions that seem sound in the moment, but future you will be cursing you for it. I feel that journey is all but inescapable, so wasting time debating the syntactical sugar of a tool bears no fruit.
Q4. Systems are getting more complex all of the time. How do you think about the tradeoffs around complexity when building and managing systems?
Complexity is always at the top of mind for me. Complexity is a beast that seems to only grow and the more you feed it, the more complex things get. I think one thing to be cognizant of is sort of the floor for complexity. Casey Rosenthal and Nora Jones nail it in their book Chaos Engineering where they say, “In fact, a consequence of the Law of Requisite Variety is that any control system must have at least as much complexity as the system that it controls.” This is an example of how in certain circumstances, complexity has to become an accepted reality and attempts to “simplify” things are really just ignoring facets of the problem.
As you begin building systems, the complexity conversation becomes a part of the lifecycle of the tool you’re building. When we design a management system at Centro, we start off with the simplest thing possible. Taking an agile approach, we’re trying to make sure that the problem we’re solving for is actually going to work. Lines of code is not a good predictor of success. We might start with something as simple as a shell script. Over time, not only do we learn how the system is used, but we uncover edge cases we would have never thought of sitting at a whiteboard solving for a problem we knew very little about. Once we get this additional information, we begin to understand the problem better, but also the level of complexity necessary to solve it. Then the question becomes balancing the downside of not solving the edge case versus the complexity that solving it brings. Sometimes a basic limitation to the system needs to be acknowledged.
I find that our infrastructure automation is often challenged with small, one-off use cases by a small number of developers. Do I want to make the automation more complex for one or two uses a year? No, because I’ll have to deal with that complexity constantly. If there becomes a groundswell of support for this new feature/use case, then I’ll re-evaluate it. But complexity can never be reduced once introduced. So make sure it’s worth it.
In the end, the complexity conversation comes to a question of tradeoffs. Sometimes your application may just not handle a scenario to avoid added complexity that only delivers the correct result a small percentage of the time. In some cases, it’s perfectly acceptable to have things choke and toss it up to a human to evaluate.