Dan Slimmon has written a great post with the same title focussing on the tradeoffs between the sensitivity and specificity of alarms for your monitoring systems and how to avoid base rate fallacy. They also painted a perfect analogy between smoke alarms vs. car alarms to explain the base rate fallacy in the same post.
Tl;Dr is that when a smoke alarm goes off, we immediately get up and do some action, i.e., check for fire, call 9-1-1 or run out of the house. When the car alarm goes off, we don’t even bother getting up.
Translating that analogy to on-call means that every team should strive to have a minimally appropriate number of smoke alarms for their team-owned services and convert most smoke alarms to car alarms when possible.
I completely agree that this analogy encourages every team I have been part of to adopt it. The one aspect from Dan’s post that I would like to address in this post is what happens when teams fail to do so, i.e., having too many smoke alarms or ignoring them for too long.
The long-term effects on the team for keeping car alarms as smoke alarms can lead to severe on-call fatigue and overall reduced team productivity. In this post, we’ll dive into why it happens and how to fix it in 1-2 months.
Alarms and Fatigue
Does this happen to you? A smoke alarm in the house gets low on battery and is constantly chirping every few minutes? It’s annoying.
If that happened in the middle of the night, I could guarantee that the following day would be unpleasant. Now imagine that happening day after day and week after week. The fatigue is going to build up, and your productivity is going to plummet.
The simple solution is to replace that battery. Many people take out the alarm entirely because they have to buy a battery before it can be replaced. The risk is that if a fire happens, the alarm is not there to alert you. But the fatigue always outweighs the risk.
The same thing happens when a service alarm is not tweaked correctly, and its alarms remain smoke alarms for too long. The page is going to fatigue the engineers and can lead to severe on-call fatigue for the team.
But you may be saying, “if there is one alarm, team or on-call can fix that. What’s the big issue?”
When a team owns several services, and many pages are firing every night, detecting these false-positive smoke alarms is not that straightforward. As a result, they keep on getting moved from on-call to on-call week-over-week, i.e.; kicking a can down the road.
We need a mechanism to detect such nuisance.
How to detect False Positive Smoke Alarms
It starts with an on-call review. Every team with an on-call schedule must have an operational review once every two weeks or, worst case, once every month. Once you have that scheduled, here is a simple process to follow:
Filter out all pages that fired during off-office hours.
Sort them by firing rate.
Look at the top 3 pages and see if any action was taken to resolve them.
If no action was taken, demote the smoke alarm to the car alarm, i.e., sensitivity or threshold.
If an action was taken, convert that to a P0 ticket and prioritize that as part of the next sprint planning.
In short weeks, the team will see a massive reduction in on-call fatigue because on-call slept through the night. And a second-order effect, i.e., productivity gain.
Do we need any smoke alarms?
The downside of the above exercise is that engineers get over-excited and downgrade almost every alarm to a car alarm by changing thresholds or removing the alarm entirely.
Every team/service still needs to have smoke alarms, just like every house needs at least one smoke alarm.
The operational reviews are only to ensure that unnecessary smoke alarms are caught in time. It’s not to covert valid smoke alarms to useless ones.
Ideally, teams should be only degrading 1-2 alarms max per week from smoke to car alarm. If there are more than 2, the on-call was terrible. That said, the best way to detect if your team is falling into the trap of “CONVERT ALL THE SMOKE ALARM” is to run it through trials. Run each alarm that you plan to degrade through a dry-run scenario. Here’s how to do that:
Assume the alarm fires,
Identify all the reasons it could fire and actions that someone will have to take to resolve it.
If, for most cases, there is no action, consider separating the actionable scenarios into a separate smoke alarm.
For all non-actionable scenarios, convert it to the car alarm.
👉 If you like this content, please consider subscribing to this free newsletter
Or follow me on Twitter as I tweet about these topics more often than writing blogs. The topics I tweet about are software engineering, productivity, mental models, and personal development.