Minor Gripe

2019-12-02 -- Quick and dirty outage reports

Chris Ertel

Introduction

I’ve been assembling furniture and rearranging my office into a semblance of a standing-desk, so I’m gonna be pick a quick topic for today.

This morning around 0930 I was awakened by a text from a friend who was distressed that a community I sysadmin seemed to have no server active. I immediately sat up in bed, stretched a little, and ~~sprung~~ shuffled, yawning, into action.

I did what I’d consider a bare-minimum outage response and report, and I figure I’ll share that today.

Oh no my server is on fire wat do

It’s 0200 in the morning, or maybe you’re in the middle of stir-frying dinner, or maybe you’re on a going down I-45 near Centerville and there’s not even AM band radio preachers to keep you company but somehow your pager still works. Outages tend to happen at the most inconvenient (to us) of times.

The first thing to do is don’t panic. Whatever is broken is broken, and odds are you won’t make it better by freaking out. In a century or less you’ll be worm food, so in the grand scheme of things an hour of downtime for an employer you’ll likely be leaving in a couple of years is no biggie. Perspective in all things in life is a key thing to happiness, and ops is no exception.

Once you have successfully not panicked, or you have panicked and gotten some cold water and taken a quick breather, it’s time to Work The Problem. You don’t have to be a steely-eyed missile man but you do have to be professional. A proper handling of an outage, of any variety, basically boils down to these steps:

Notify people who may be affected by the outage that something may be wrong.
Confirm the outage.
Update people on the status of the outage–e.g., if it is real or not.
Investigate the outage.
Update people on the progress towards resolution.
Fix the outage, or kludge something into place to keep the money funnel running.
Confirm that the fix works.
Update people on the resolution.
Write an outage report.

I’ll talk about the importance of each of these steps in turn.

Take notes with timestamps at all points during this process.

Step 1: Notify people who could be affected

We do this first because at the very first credible sign or report of something amiss, you don’t want to be seen as being caught off-guard. At best, it makes your team look incompetent, and at worst, it lets panic set in in the rest of the org.

I have had the first report of an outage (ongoing for over a half hour) be from the CEO inquiring sweetly as to what the hell we were doing about things. I have seen customer success teams craft elaborate outage theories in the five minutes it takes to notice an alert and properly triage it.

Nip all that in the bud, tell people you’re looking into it immediately. It might be a false alarm, it might not, but the rest of the org needs to feel assured that Somebody Knows and more importantly that they’re in good hands.

Step 2: Confirm the outage

Make sure that the reported problem is reproducible. There is nothing more annoying than leaping into action when the real problem was that somebody’s computer wasn’t attached to the right network. In one case, we had internal users actually reporting a site outage because Reddit was down. I can’t make this up.

Even worse, especially if you have certain types of automated ops tooling you can fire up without thinking too hard, it can be easy to accidentally create a real outage while leaping into responding to a fake outage–swapping over to a standby, restoring from backups, all sorts of things that can interrupt the money funnel.

Make sure that you can confirm the outage.

Step 3: Update people on the status of the outage

Once you know if the outage is for-reals (or hopefully not!), you need to immediately notify people about its status.

This serves three essential purposes:

It shows signs of life and calm response from the ops org.
It prevents knock-on cries of “outage!” if it’s real.
It prevents confusion from the business if it’s not real.

Step 4: Investigate the outage

Now, you’ve got a little breathing room. Check running processes, tail logs, scan dashboards, skim stacktraces.

The things you’re trying to figure out at this stage are:

Who is affected?
What functionality in the money funnel is affected?
What is the fastest route to getting the money funnel unblocked and how risky is it?
What could go wrong when you enact that fix?
Will the fix trash the data you need to investigate the root causes later?

Step 5: Update people on resolution progress

By this point, you should have a decent knowledge-base for triage, and can make useful statements about what’s wrong and how to fix it.

Share this newfound wisdom with the rest of the business. They may take this opportunity to inform you of additional constraints: “Well, Henderson is back from lunch soon, so hold off on the restart until she can be updated”, “Twitter is blowing up over this, we want to make a public demo of our DR efforts”, etc.

Step 6: Fix the outage

Only now, after the above work, should you actually attempt to fix anything. The business is in the loop, you have a clear picture of what you need to do, and you can do things without tripping yourselves up.

Note that a full fix is often not possible–restoring minimum functionality, passing new instructions to the customer service folks, or even hacking an nginx config to hard-return certain values without hitting the app layer are all kludges that are acceptable if it unplugs the money funnel.

Again, make sure you update your notes on when this happens and what you did. Tools like teleport can make recording and logging these actions much easier.

Step 7: Confirm the fix

Having fixed the outage, you want to make sure that the “fix” really works–especially in the case of kludges. Failure to do this can cause further excitement when it fails after sounding the all-clear. In pathological cases, a broken fix can look like a whole new outage and kick off the process that results in still more broken fixes.

So, make sure you actually have fixed or addressed the problem. Usually, the easiest way to do this is by removing the fix, confirming that the old behavior resumes, and then applying the fix again.

Some systems or business processes make confirming the fix hard to do (say, evening newsletter mailings or something that you can’t just trigger and test on demand). In these cases, I wish you the best of luck.

Step 8: Update people on the resolution

The rest of the business has been waiting for the all-clear. Let them know they can go back to shoving customers into the money funnel.

This is the final sign-of-life to show the business that the ops team has them taken care of, and closes that loop. Don’t forget this step.

If a kludge is in place, this is also the right time to mention its constraints or just generally help promote calm amid the storm.

Step 9: Write an outage report

After doing all this (or, pro-tip, during) you should write an outage report for future generations about the outage. It should include:

Impact on the money funnel (how long was the outage, how does this compare to normal business activity, how much revenue was lost)
Timeline of events (first reported and by who, first response and by who, what observations did the ops team make during investigation, what was done to resolve it, when key communications were made)
Analysis of what caused the outage, if known. Do an RCA–even if the root cause turns out to be “the universe is amoral and entropy preys upon mankind”.
Summary of resolution and its limitations
Steps taken to prevent the outage in the future or to at least get advanced warning

This report needs to be read by your ops team, and be available to help build cases for better processes/more funding/saner business requirements.

Conclusion

For almost all small-to-medium orgs, this is your playbook. This is your North Star. This is the rote pattern you will follow when you’re tired and hungover and your phone is blowing up and Slack is having their own outage and your dog is throwing up in the corner of your home office and basically everything is going wrong. Practice this in every one of your incidents, planned or not or large or small.

If you deal with massive distributed systems, have dozens of people on call, whatever, then feel free to use a different approach. That said, this is a good starting place for the rest of us.

If this becomes second nature, it becomes boring and predictable. There is nothing you want more during incident response than to be bored and unsurprised by how things unfold. Practice this way so you’ll play this way when it counts.

Tags: ops practices