Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.
You talk about The Reliability Stack, which is a term I like a lot. Can you explain what it is?
When people talk about service level objectives, they tend to use the term SLO to encompass the entire process, but really there are three primary components at play. You have service level indicators, or SLIs, which are measurements of the performance of your service from your users’ point of view. SLIs inform service level objectives, or SLOs, which are essentially just targets for how often your SLIs should be in a good state. SLOs in turn power error budgets, which are a way of measuring how you’ve performed against your target over a window of time.
When developing slides for a talk at DevOpsDays NYC 2019 I came up with a little graphic showing this, with SLIs on the bottom, SLOs in the middle, and error budgets at the top. I found myself staring at this diagram for a bit, thinking to myself that we can’t just keep referring to all of this as “doing SLOs,” since SLOs are really just one component. So, I came up with the term The Reliability Stack to refer to how these three components interact.
You say that service level indicators (SLIs) are the most important concept in the book. Why is that? And what makes a good SLI?
SLIs are the most important part of SLO-based approaches to reliability for a few reasons. The first is that they’re explicitly measurements of the performance of your service that should get as close to the user experience as possible. There are tons of metrics and other telemetry you can collect about a service, but they’re not all ones that users care about. For example, a user does not care about the error rate between your service and the database behind it — they care about if the service itself is returning the data they’re asking for.
I say that SLIs are the most important part of the book because they force you to think about things from your users’ perspective. I often say that “Your users determine your reliability, not you,” and developing meaningful SLIs is the first step in understanding that.
The second reason I find SLIs so important is because they inform the rest of your stack. If your SLIs aren’t meaningfully measuring your users’ experiences, then the SLO targets and error budgets informed by these SLIs will be flawed themselves.
A good SLI is a measurement that mimics your users’ experiences as closely as possible. Think about the humans involved before anything else.
When I first started learning about SRE, Service Level Objectives (SLOs) are the thing that I struggled with most. I understood the concept but had trouble determining what the SLOs for some of the services I supported should be. What would you tell someone who’s trying to figure out SLOs for the first time, or has struggled with them?
I’d also tell this person they need to remember that SLOs are not SLAs, or service level agreements. There is no formal contract in place with SLOs. Your SLI measurements and SLO targets can — and should! — change over time as you refine things or as the state of the world changes. That’s all fine! Get started, look at the numbers, and use them to have conversations about the state of your service or if the measurements you’ve chosen aren’t the correct ones.
Lots of people struggle with SLOs at first, and that’s basically the reason there is going to be an entire book about them. As you said, the concepts are easy to understand, but actually implementing them can be difficult. People need a lot more help than just some definitions to get everything off the ground.
I love this quote from the book: “You may never get to the point of having reasonable SLO targets or calculated error budgets that you can use to trigger decision making. But, taking a step back and thinking about your service from your users’ perspective can be a watershed moment for how your team, your organization, or your company thinks about reliability.” We should be thinking about users and aiming for incremental improvement, right?
That’s correct. This is as much of a business decision as anything else. The most important aspect of any computer system are the humans involved. This could be external customers, internal users, company stakeholders, etc. Ultimately we design computer systems to perform some sort of task for these humans, so take a step back and keep them in mind when choosing what to measure. Once you have a better idea about what the experiences of these people are, you have a better idea about what your next reliability improvements might be. And hopefully you can make them before you lose users to a competitor.
You mention that when a service is exceeding its error budget, that it’s time to slow down and focus on reliability. In many organizations there’s a lot of pressure to ship new features. Do you have any advice for someone on a team that’s implementing SRE, in terms of getting that org level buy-in on shifting focus to reliability when needed?
This is one of — if that the most — difficult parts of the entire process. Using SLO-derived data to make decisions can be counterintuitive for people that have been doing it another way. But, in truth, these kinds of approaches can actually help you ship features quicker.
The best advice I can give is to just get started. Find one team or one product and start measuring things with a target in mind, and drive your decision making and efforts towards that target. Even when there is pressure to ship new features, no one wants their service performing unreliably either. That’s how you get angry customers. SLO-based approaches give you an indicator that tells you when you can ship features, and when you need to focus on reliability.
This same indicator can now tell you when you can move as fast as you want! This should actually make development teams and product teams happier! Have error budget remaining? Ship features all day and all night! Have you exhausted your error budget? Slow down for a bit, pay down some tech debt, whatever it is you need to do to make things reliable again. Once you have: ship features to your heart’s content!
In the book you say that “SLO-based approaches to reliability have become incredibly popular in recent years, which has actually been detrimental in some ways.” In what ways?
I find that the tech industry has a problem where we take popular phrases or approaches and dilute them into nothing. You’ve seen it with the term DevOps, for example. It was originally intended to describe a different approach to how you work — kinda like how SLOs describe a different approach to how you work. Now you have engineers with the title DevOps Engineer. Or organizations named DevOps. What does that even mean? If you’re following the original intended definition, it means as much as saying someone is an SLO Engineer or an Agile Engineer.
I’m afraid we’ll be seeing the term SLO get diluted in the same manner, losing sight of what the entire philosophy actually means. Perhaps we’re already seeing that today. Hopefully the book can help address some of this.
One of my favorite sections of the preview chapters is where you discuss implied agreements and Hyrum’s Law. You also mention that you think teams shouldn’t hide SLOs from users. If we start offering a service that people use, and we don’t set availability expectations, then people will all expect whatever they choose to?
Sure. Making SLOs discoverable can be a vulnerable experience, but they’re also an important communications tool. Let your users know what level of reliability you’re looking to achieve, both so they understand and so they have the opportunity to say, “Hey, wait a second. That’s not good enough for my purposes.” Then you have the data you need to fix things before you lose customers without ever knowing why.
Additionally, in my experience people will generally expect the future to look like the past, especially when they don’t know what level of service you’re actually aiming to provide them. This is why you should aim to never operate too far above your SLO target. If people are happy with 99.5% availability, for example, don’t run at 99.99% for too long. The reason for this is because now user expectations might have changed, and now they’ll demand 99.99% when they were truly and honestly happy at 99.5% before. And now you have to spend exponentially more resources keeping things running at 99.99% as opposed to the more reasonable 99.5%.
I’d also be missing out on a chance to mention something else here. People often conflate availability and reliability when talking about computers. But, reliability is much more than that. Reliability is saying “this system is performing how it is supposed to be performing,” not just that it’s available for use. Being available is certainly an important part of that story, but it’s not everything.
Should oncall teams only be alerting on SLO violations or potential ones? How do we do that and not miss the underlying failures, until it’s too late to prevent a breach of the SLO?
In a perfect world, the only page a team receives is when you’re burning through your error budget at a rate that you cannot recover from without human intervention. It takes a lot of maturity and time to get there, but I’ve worked on teams where we successfully got to this point. It’s not impossible!
The idea is that if your SLI is meaningfully measuring what your users need from your service, why wake people up in the middle of the night for any reason beyond this SLI reporting things are bad for your users too much of the time? Why wake someone up if error rates have spiked if that doesn’t actually impact the user experience in any way?
I’m absolutely not saying you shouldn’t be measuring other things. You should, and these metrics should be easily discoverable by your team. They’re vital to troubleshooting issues. But, I always tell people when they’re trying to figure out what to alert on: “Is this a thing that should be waking you up at 03:00 in the morning on a Sunday?”
A better idea is perhaps to open tickets for underlying issues like you described and just have your team address them during normal working hours. I’m not saying you should abandon other ways of measuring your service, just that maybe you should abandon the things that wake you up or otherwise distract you while you’re not working normal hours.
Has your memory usage suddenly spiked? Sure, you should maybe investigate that. But, you can probably wait to do so at your convenience if users aren’t impacted.
We’ve seen SRE become very popular over the last several years, and different companies are adopting it in different ways. Do you think every company that’s doing SRE or DevOps can benefit from using SLOs?
Yes, I really do. As you stated in one of the previous questions, I think you can get benefits from the process even if you never get to the point of having meaningful targets or error-budget-driven decision making. At the heart of the entire philosophy is the idea that you need to be thinking about all of the humans involved with your service first — from customers to the engineers that support it. That’s the very first step with an SLO-based approach — thinking about people. And, even if that’s the only step you ever take in “using SLOs,” I think that’s a worthy step to take no matter what.
Anything else you’d like to add?
Stop worrying so much about nitty-gritty details and numbers. Measure the stuff that means the most to your users, and everyone involved will end up much happier. That’s the goal here: happier humans.
Thanks so much for your insights, Alex.
Thanks so much, Rich! This was fun.