Copied

Captain's Log: A first look at our architecture for Signals

TL;DR: Signals must be resilient, and we're excited about the pattern we've implemented to make it so.

By Robert Ross on 11/8/2023

Welcome to the first Signals Captain's Log! My name is Robert, and I'm a recovering on-call engineer and the CEO of FireHydrant. When we started our journey of building Signals, a viable replacement for PagerDuty, OpsGenie, etc, we decided very early that we would tell everyone what makes Signals unique, and what better way than to tell you how we’re building it (without revealing too much 😉). Let's jump in.

Resiliency as a feature

Signals must be resilient. Our core Rails application (lovingly named “laddertruck”) has high-reliability standards, but alerting is different. Alerting needs an even higher standard for speed and resiliency. We love Ruby, but we also love Golang, and we've deployed a new service in Go called “Siren,” – and it operates independently of Laddertruck entirely. That means laddertruck can have an incident, and Siren will still notify your on-call team.

Protocol Buffers as the contract

In our experience, API changes are a common culprit for incidents, and internal APIs are not spared from this reality. To combat this, we've expanded our use case of protocol buffers as the contract between Laddertruck and Siren. For example, we’ve defined an escalation policy (with some parts removed) as:

                                
message EscalationPolicy {
  message Step {
	string id = 1;
	Target target = 2;
	google.protobuf.Duration timeout = 3;
  }

  string id = 1;
  string organization_id = 2;
  string team_id = 3;
  string name = 4;

  repeated Step steps = 6;
}

Defining messages for all of the config Siren needs means that Laddertruck can serialize a message and store it in a resilient location, such as Google Object Storage (GCS).

Google Cloud Storage (or any object storage)

For Siren to send notifications, it needs a few core configurations:

Escalation Policies
Schedules
Notification Preferences
Rules

In our experience as on-call engineers, these configurations rarely change. So, to give us incredible resiliency, we push these serialized configurations on creation and updates to GCS. Here's an example snippet of us moving notification preferences from Laddertruck to GCS.

                                
def upload_escalation_policy(policy)
  # The object storage key for this policy
  path = "organizations/#{policy.organization_id}/escalation_policies/#{policy.id}.json"

  # If we've discarded the policy, remove the object from object storage
  if policy.discarded?
	delete_object(path)
	return policy.id
  end

  # Otherwise, upload the policy to object storage
  upload_object(policy.to_proto, path)

  policy.id
end

Storing these settings in object storage gives us a high availability of the most crucial settings Siren needs to operate. When a new object is pushed, we notify Siren that it needs to reload its current state.

After Siren receives a notification that it should update its in-memory version of a team's configuration, it instantly becomes the new loaded version after pulling the object from object storage. If Laddertruck becomes unavailable, Siren will continue to use the most recent version of a team's configuration. Additionally, Siren will occasionally attempt to pull versions from object storage in case of a missed notification to prevent stale configuration.

Wrapping Up

That's it for this Captain's Log of Signals! There's nothing like seeing a Datadog alert convert into a phone call to my phone in less than a second. But receiving a Signal and alerting because of it is where this functionality really starts to shine.

Teaser

Oh, and remember, Signals is in our Slack integration on Day One, too. Sign up to stay in the know.

See Signals in action

Experience a cost-effective alerting tool designed specifically for how modern DevOps teams work.

Join the waitlist