What is SRE?
Site Reliability Engineering (SRE) is a practice for managing the reliability of systems. Google originally developed SRE in the early-2000s when Ben Treynor Sloss started the first SRE team, coined the name, and set the tone for the industry.
Treynor Sloss states in the guide Site Reliability Engineering, “SRE is what happens when you ask a software engineer to design an operations team.” In building the SRE team at Google, he focused on hiring software engineers as opposed to traditional systems administrators. The idea was to make the SRE team a force multiplier. By automating more of their work, the SRE team could manage the rapidly growing systems at Google while scaling the team less.
What does an SRE or SRE Team do?
SRE teams are generally responsible for “the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).”
An SRE’s time is split between operational work and on-call duties. These responsibilities may include implementing automation, creating new features, or scaling their system in order to increase site reliability and performance. Additional responsibilities include:
Monitoring and maintaining performance: Keeping track of performance and analyzing any issues to ensure the system processes information correctly and promptly.
Improving performance: Ensuring delivery pipelines are efficient by minimizing latency, or the time it takes for a website or application to respond to a user’s action.
Implementing automation: Looking for ways to enhance and/or automate operational tasks.
Incident response: When a problem, such as a system outage occurs, or other issue arises, SRE’s will enact incident response protocols to handle these situations.
Capacity planning: Managing and planning the system’s workload in anticipation of the future.
SRE’s and Automation
To understand why SRE’s must spend time on automation, you must understand toil. Toil is an essential concept of SRE. As Vivek Rau points out in the Google SRE book, “toil isn’t simply the work that the Google SREs don’t want to do. He defines some criteria for what kind of work is considered toil:
Work with no enduring value
If we can eliminate toiling work, we give ourselves time to focus on the more engaging work. Google expects its SREs to spend 50% or more of their time working on engineering projects outside of regular operational work.
It’s important to note that the SRE focus on automation didn’t develop overnight. The early 2000s saw Configuration Management emerge as a practice and Infrastructure as Code. Titles like Operations Engineer and Infrastructure Engineer became more common, as teams realized they could leverage automation to make the tasks they do more repeatable. Resources like Puppet, Chef, Ansible, and later Terraform, have allowed people to provision and maintain large infrastructures through code.
The level of engineering work that happens in an SRE organization outside of Google can vary pretty widely. In some organizations, teams will be focused on using tools like Terraform and Kubernetes that are written by someone else and using APIs to glue things together. Other organizations will have SREs that write a lot of code and even write services that manage things like provisioning or deployments. Some of those tools become open-sourced and shared with large communities.
SREs at Google contribute code to the services written by Google’s Software Engineers (SWEs). There may be a more explicit boundary between who commits code for those services in other organizations, and SREs may only be able to land code to their SRE team’s codebases.
If you’re looking at potential SRE roles, it’s important to get an idea of what kind of engineering work the team does and how it fits your skills and interests.
SRE’s, Reliability, and Incidents
Reliability is the core job of SREs. It encompasses many areas (see that list under “What does an SRE or SRE Team do?”). Some key concepts around reliability are service level indicators, service level objectives, and error budgets. SLIs, SLOs, and error budgets represent a considerable shift in thinking from how traditional sysadmins looked at systems. They are defined as follows:
Service Level Indicators (SLIs): A carefully defined quantitative measure of some aspect of the level of service that is provided
Service Level Objectives (SLOs): A target value or range of values for a service level that is measured by an SLI
Error Budgets: A clear, objective metric that determines how unreliable the service is allowed to be within a single quarter
These three concepts have the most significant impact on what incidents and when teams should send alerts. For example, instead of alerting someone to tell them that a CPU is at 75 percent usage, you alert them on situations where you’re at risk of running out of error budget.
But what about the lower-level problems? What if a disk is full or a host is down? We want to know about that still, right? Well, we do, but not at 3 AM. Instead, we can create a ticket in our ticketing system to let us know about the problem. If those conditions impact users, we’ll get an alert about it indirectly, based on the error budget, and deal with the problem then. If not, we can clean up the disk or provision a new host during regular working hours.
The Vast World of SRE
The topics under SRE are wide-ranging and even contentious, depending on who you speak to. While SRE was initially a Google innovation, it has now spread much more extensively in the industry. For example, USENIX launched the first SREcon in 2014, and Google published the Google SRE Book in 2016. As it’s still an emerging practice, you may find that SRE definitions and responsibilities are evolving and vary from business to business. Every SRE team is different, but many themes will apply to SREs across companies.
FireHydrant was founded by SREs who felt existing tools weren’t effective for leveraging SRE best practices. Our platform was built with a framework so that any business interested in adopting SRE culture can jump in and start fighting incidents in a consistent and scalable manner (check out “How it Works” to learn more).
In the meantime, here are some SRE resources for additional learning!
A great place to start is the Google SRE books. There are three available to read for free online. If you’re new to SRE, start with the Site Reliability Engineering book. The book is an anthology of articles by different authors on different topics, so feel free to skip around. You don’t need to read it cover to cover.
Core knowledge of Linux is a must-have skill for SREs, check out the Red Hat certification prep classes at Linux Academy and read the zines Julia Evans does about Linux and other topics.
Containers and Kubernetes are other topics to focus on. Try this online class from Bret Fisher called Kubernetes Mastery on Udemy.
Also, watch new and old eps of TGIK (TGI Kubernetes) from VMWare, which is live on Fridays at 1 PM PT.
There are many other resources, so check out this extensive list, Awesome SRE on GitHub, which includes topics such as being on-call, capacity planning, incident management, and more.