Site Reliability Engineering (SRE) is a practice for managing the reliability of systems that began at Google in the early 2000s. Ben Treynor Sloss from Google started the first SRE team and coined the name. As he puts it in the book Site Reliability Engineering (published by Google), SRE teams are generally responsible for “the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” That’s a similar list to the responsibilities of many traditional Operations teams.
SRE was initially a Google innovation, but it’s now spread much more extensively in the industry, mainly due to Google opening up about the practice and evangelizing it. USENIX launched the first SREcon in 2014, and Google published the Google SRE Book in 2016. As more companies have adopted SRE in their own ways, the meaning of the term has evolved. Every SRE team is different, but many themes will apply to SREs across companies.
We’ll next a look at the Engineering and Reliability parts of SRE, SRE team structures, and then talk about some ways you can get started as an SRE.
As Ben Treynor Sloss says in the Google SRE book, “SRE is what happens when you ask a software engineer to design an operations team.” In building the SRE team at Google, he focused on hiring software engineers, as opposed to traditional systems administrators. The idea was to make the SRE team a force multiplier. By automating more of the work they do, the SRE team could manage the rapidly growing systems at Google while scaling the team less.
Toil is an essential concept in SRE. As Vivek Rau points out in the Google SRE book, toil isn’t simply work that the Goole SREs don’t want to do. He defines some criteria for what kind of work is considered toil:
- No enduring value
- Scales with service growth
By working to eliminate the tasks we do that are toil, we give ourselves time to focus on more exciting work. Google expects its SREs to spend 50% or more of their time working on engineering projects outside of regular operational work.
It’s important to note that the SRE focus on automation didn’t arise in a vacuum. The early 2000s saw Configuration Management emerge as a practice and Infrastructure as Code. Titles like Operations Engineer and Infrastructure Engineer became more common, as teams realized they could leverage automation to make the tasks they do more repeatable. Tools like Puppet, Chef, Ansible, and later Terraform, have allowed people to provision and maintain large infrastructures through code.
The level of engineering work that happens in an SRE shop outside of Google can vary pretty widely. In some shops, teams will be mainly focused on using tools like Terraform and Kubernetes that are written by someone else, and using APIs to glue things together. Other shops will have SREs that write a lot of code, and even write services that manage things like provisioning or deployments. Some of those tools become open sourced and shared with large communities.
SREs at Google contribute code to the services written by Google’s Sofware Engineers (SWEs). In other shops, there may be a more explicit boundary between who commits code for those services, and SREs may only be able to land code to their SRE team’s codebases.
If you’re looking at potential SRE roles, it’s important to get an idea of what kind of engineering work the team does and how that fits with your skills and interests.
Reliability is the core job of SREs, and it encompasses many areas (see that list in the first paragraph). We’ll focus here on some key concepts that are unique to SRE: service level indicators, service level objectives, and error budgets.
Service Level Indicators (SLIs)
Let’s start with service level indicators. The chapter in the Google SRE book that covers them (written by Chris Jones, John Wilkes, and Niall Murphy with Cody Smith) defines an SLI as “a carefully defined quantitative measure of some aspect of the level of service that is provided.” Said more simply, an SLI is a metric of some sort that indicates something about the performance or health of the service. Some common examples of SLIs are request latency and error rate, but there can be many others depending on your service. It’s important to note that “the CPU of the host is at 75 percent usage” is not the kind of thing that makes for a good SLI. We want to look at metrics that describe the health of the service to the people that are using it. An end-user doesn’t care about your CPU usage. They only care that they can do what they need to with your application in a reasonable amount of time.
Service Level Objectives (SLOs)
Next are service level objectives (SLOs). The Google SRE book defines an SLO as “a target value or range of values for a service level that is measured by an SLI.” So what does that mean? An SLO is the higher-order measure of the service’s health and performance from the perspective of the user. When you think about SLOs, think about things that impact users. An example of an SLO from the book is, “99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms (measured across all the backend servers).” It’s very precise. We can infer that if the Get RPC calls fail more than 99% of the time, or complete slower than 100 milliseconds, users will be unhappy. In the end, that’s what reliability is for, making the people who use our applications happy.
Error budgets are the last of these three pillars of reliability. Error budgets are covered in another chapter of the Google SRE book, written by Marc Alvidrez. Alvidrez defines an error budget as “a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter.” SLIs and SLOs bubble up into error budgets, and error budgets are the clear, quantitative measure of how our service is doing reliability wise. Error budgets also give us guard rails for how we operate our services. If your service is well below its error budget for the quarter, that’s a great time to innovate and take risks. If your service is running out of error budget, that’s a time to slow down and be more conservative. With an error budget, we have a number that lets us know whether we’re at risk of making our users unhappy.
SLIs, SLOs, and error budgets represent a huge shift in thinking from how traditional sysadmins looked at systems. One of the biggest things this impacts is what teams send alerts on. I recently interviewed Alex Hidalgo, who is writing a book called Implementing Service Level Objectives for O’Reilly, and I asked him about this topic. He said:
“In a perfect world, the only page a team receives is when you’re burning through your error budget at a rate that you cannot recover from without human intervention. It takes a lot of maturity and time to get there, but I’ve worked on teams where we successfully got to this point. It’s not impossible! The idea is that if your SLI is meaningfully measuring what your users need from your service, why wake people up in the middle of the night for any reason beyond this SLI reporting things are bad for your users too much of the time? Why wake someone up if error rates have spiked if that doesn’t actually impact the user experience in any way?”
So instead of Pager Duty or your NOC calling someone at 3 AM to tell them that a CPU is at 75 percent usage, you alert on situations where you’re at risk of running out of error budget.
But what about the lower-level problems? What if a disk is full or a host is down? We want to know about that still, right? Well, we do, but not at 3 AM. Instead, we can create a ticket in our ticketing system to let us know about the problem. If those conditions are impacting users, we’ll get an alert about it indirectly based on the error budget, and we can deal with the problem then. If not, we can clean up that disk or provision a new host during regular working hours.
As Alex mentioned, it takes some work to get to that point, but it’s well worth it if you can make it there.
Every SRE team is different and will have its own structure, but some popular patterns for team structures have emerged. At Google, while SREs belong to a team together, they work embedded on teams that are building services. Another common pattern is to have a core team of SREs that support all of the other teams. Let’s compare these two options.
The pattern where SREs are embedded in service teams makes a lot of sense. People who specialize in a specific service or several are likely to gain a lot of expertise and domain knowledge about how those services work. That can be very handy when troubleshooting, or rolling out new versions of the service. But I think there’s at least the potential that people working in this model end up heads down, focused on the services they support, and not sharing knowledge across the broader SRE team. I’ve also spoken to several Google SREs who supported different kinds of services, and it became clear to me pretty quickly that their experiences differed quite a bit depending on the service team they were embedded with. Some factors that may come into play are the age of the service, and the risk tolerance of the company when it comes to operating that service.
The core SRE team pattern, where there is a central team of SREs, also has advantages. This team can potentially work closer together and communicate better, but that comes with likely having less expertise in the services, depending on how many services there are. Teams in this pattern often operate more as a higher tier of escalation for problems, and provide infrastructure and tools that the service teams use. A big focus of this kind of team is creating self-service infrastructure. They want to make their infra as easy as possible for other teams to consume. They may manage Kubernetes clusters that all of the service teams run their apps on, for example.
There are other possible team structures, too, like hybrids of these approaches (embedded SREs with a central team to escalate to). Netflix is an interesting example, as it has core SRE and Reliability teams, but it also has many independent service teams. One of Netflix’s core principles is “Freedom and Responsibility,” meaning that you have the freedom to do things the way you want, but you’re also on the hook for them working. A service team at Netflix can use the tools the core teams have provided, but in some cases they have the option to use their own tooling if they prefer.
Of course, the right answer is to use the structure and team that fits best for your team and organization. As an individual SRE you’re not likely to have control over the team structure, but it’s important to think about it.
SRE is a vast topic, and in this piece we’ve just scratched the surface. I’ve covered some ideas that I think are core to SRE, but there are many other things to learn about. A great place to start is the Google SRE books. There are three of them now that Google has made available to read for free online. If you’re new to SRE, start with the Site Reliability Engineering book. It’s the first of them that was written and covers many of the Google SRE practices. The book is an anthology of articles by different authors on different topics, so feel free to skip around, you don’t need to read it cover to cover. There are some other books on SRE as well, like Seeking SRE by David Blank-Edelman. Earlier I mentioned Alex Hidalgo’s upcoming book on SLOs. I’ve read the preview chapters, and I think it will be excellent. Alex’s book will be released in October of 2020, but you can read the preview chapters now if you have an O’Reilly account.
Unless you’re focusing specifically on Windows, core knowledge of Linux is a must-have skill for SREs. There are a lot of books and courses out there. I’ve taken one of the Red Hat certification prep classes at Linux Academy. I wasn’t even planning to take the test for the cert, but that course was a great way to brush up on Linux skills. Linux Academy has many other classes available, and the company is not paying me to say any of this. Another resource I highly recommend is the zines Julia Evans does about Linux and other topics. I have several of them, and she is fantastic at explaining technical topics visually.
Containers and Kubernetes are other things I’d focus on. Kubernetes is in widespread use now, and many SRE teams will be looking for experience with it. Another online class I’ve taken that I can recommend is Bret Fisher’s Kubernetes Mastery class on Udemy. Bret is great at explaining things, and he often puts the course on sale, so watch for that. A free resource is the excellent TGIK streams that the VMWare team does, Fridays at 1 PM Pacific time. Besides the live streams, they have a huge archive of past shows. And to plug my work, I host a podcast called Kube Cuddle, where I interview people from the Kubernetes community.
There are many other topics it would help to research as well. Rather than trying to list a bunch of them, I’m going to direct you to the Awesome SRE list on GitHub. It has links to resources on things like being oncall, capacity planning, incident management, and more.
I hope this overview of SRE was helpful to you, and I wish you the best of luck if you are pursuing a new role as an SRE or trying to land one.