What is “On Call”?
In the engineering world, being “on call” means you need to be available to be contacted if an incident or issue arises. This chosen engineer or group of engineers may be on call regardless if it’s during the workday or after regular business hours.
Talk to anyone who has been on call, and you will hear tales of being paged in the middle of the night, or that dreadful page at 5 PM on a Friday, right as their on-call rotation is ending.
Who is On Call, and What is an On-Call Team?
Software engineering organizations are often split between those who build the software and those who keep the software running. The latter will generally have an on-call rotation to kick off an incident response process when a sudden event happens, such as a system outage, bad deployment, security breach, etc. They’ll also manage the process through investigating, mitigating, communicating, and working towards a resolution.
These teams that ‘keep the system running,’ so to speak, are composed of sysadmins, DevOps, or SRE’s (or Site Reliability Engineers, see our previous post “What is SRE,” where we break it down).
The members of your on-call team depend on your organization’s engineering culture. For example, we designed FireHydrant for flexibility, but we built our standard recommended practice to accommodate SRE practices or organizations seeking to adopt SRE practices.
How to Schedule an On-Call Team for Effective Alert Monitoring
Effective on-call monitoring and on-call incident response require assigning engineers who will tend to this specific task while on their shift. For example, you can easily use software to appoint on-call engineers during an alert. It is best practice to have “on-call rotations” to lessen the weight of incident responses and to always have someone effectively monitoring all systems for any alerts at any time. More than one issue could potentially occur during a single shift. On-call rotations help to balance the workload if this happens.
What Happens During On Call?
During an on-call rotation, the primary engineer on duty, and in some cases their back-ups, needs to always be ready to tend to any system issues as they arise. They will need their computer and access to the internet, and most importantly, access to how they will be alerted to any incidents (generally through an alerting provider).
When a problem arises, their reaction is the response process. Here are some things that may take place when an issue arises within a system or application:
Alert: Typically, engineering organizations will utilize an alerting tool that’s configured to notify whoever is on-call that there’s an issue that needs their attention.
Incident declaration: Your organization’s incident process will determine your next steps after the alert hand-off, but depending on the severity of the incident, you may need to sound the alarm - that is, declare there is an incident.
Again, depending on your organization and the severity of your incident, this may require you to alert a specific group of stakeholders and subject matter experts (SMEs), create a Slack room and conference bridge to gather, and/or create communications via a status page.
Customer communications: If the problem affects customers, the incident response process may require the on-call engineer to create an external notification via a status page.
Investigate, Mitigate, Resolve, Escalate: The most important part of being on-call is your ability to do these four things quickly. Once you’ve gathered your team and notified the right people, you can focus on investigating the causes behind the incident and work on mitigating and resolving to restore everything into working order. Also, knowing when you need to escalate to get things in order.
Notice of restoration: Once service is restored, the on-call team member informs all parties (internally and externally), as necessary, that the incident is resolved.
Retrospective: We must learn from incidents to prevent them in the future. Generally, the primary on-call engineer will have to create a retrospective, also commonly known as a postmortem, and hold a meeting to discuss their findings.
These are some of the actions an on-call team may take to resolve an incident. The total amount of steps or time to resolution varies depending on the issue’s severity or complexity. There could also be more than one issue that happens at any one time.
Additionally, you can automate the repeatable parts of the process, such as the escalation processes or spinning up a Slack or Zoom bridge. Leveraging an incident management platform that can automate your process to your specifications can go a long way in not only saving you some time and manual work but create consistency in your scalable incident lifecycle.