The Darkest Timeline: When to Host a Failure Party and Nuke All That Process

TroyFromCommunity

Roxanne” by The Police 🎵

I was reminded of a meme this week from Community.

Troy’s popular scene from “The Darkest Timeline” is how bad incidents really happen.

As Abed says “Just so you know, Jeff, you are now creating 6 different timelines” and Jeff, future poor incident manager disbelieving, says “Of course, I am Abed.” Let’s watch this amazing 2 minute clip together that starts off with rolling a 1.

Now for the breakdown…

  1. A bunch of people roll a bunch of dice that lead to today “the incident.”
  2. An ignored or misunderstood, or unseen issue due to the complexity of the environment (the rolling ball) has existed for a while and no one noticed.
  3. A change is made to the environment with a bad process for an engineer to operate in (Jeff hitting his head on a poorly placed fan because he is too tall).
  4. A single, visible mistake interacting with that ball starts a cascade of failures (Annie slipping on the ball trying to help Jeff).
  5. A dependent system fails that had low visibility or only one person knew about (Pierce kicking a box, hitting a purse, with a gun in it).
  6. That dependent system on its own single handedly takes out a lead engineer on the bridge (Pierce shooting himself in the leg) and three people have to go fix that one problem in a silo (Abed, Pierce, Annie) cause it’s that bad.
  7. Others start seeing the impact and jump on (Shirley).
  8. Someone spins off to bring in SRE to help with comms (Jeff) cause it’s now too wild and an actual incident needs to be open and an incident manager identified.
  9. Someone tries to be an incident manager but doesn’t know how to let go of the problem and focus on just that (Jeff).
  10. Someone jumps on the bridge with another downstream impact, makes a change they think is a good idea, and makes the situation worse (Britta starting a fire).

Finally, a person who has experience with large scale transformation to make incidents better jumps on the bridge, looks at the whole situation and says, what actually happened was that months ago we rolled a bunch of dice and that’s why we’re here (Troy walking in with the pizza) and realized we should have listened to the engineer who said “we’re creating 6 different timelines” (complexity) and should have addressed the Norwegian Troll doll that “stares at us when we sleep” (process).

Fear is How Teams Got Here, But We’re Close

If this whole post was going to be about Community I would dive into why the story behind the Norwegian Troll Doll and season 3 episode 4 “Remedial Chaos Theory” is a metaphor for VUCA environments drowning in authoritative leadership but it would be much better as a freakin’ hilarious argument on a podcast. Let’s focus instead on what VUCA is. 

VUCA (Click for the quadrant chart from Harvard Business Review) stands for “Volatility, Uncertainty, Complexity, and Ambiguity” and is an acronym used to describe the types of environments professionals now unfortunately work in.

Whether it’s software engineering, working in a hospital, working on airplanes, or supply chain – big companies or small – global business or local – we all now work in a world with a lot of tools, a lot of rules, a lot of customers, and a lot of process against time. Time never changed as a coefficient.

If we want to get rid of fear, we have to first embrace that in today’s world there are (1) different types of complexity and thus (2) different types of failure from changes within it. This is best explored in Amy C Edmonson’s book “The Fearless Organization” where she identifies 3 types of failure that deserve different types of responses. Go read it. To encourage you; however, just look at this 1 chart.

From Amy Edmonson | Understand 3 Types of Failure to Activate Bolder Decision Making & Better Teamwork & Stern Strategy Group

She expands on how this affects other industries, but in software engineering, devops, and infrastructure I can confirm that high severity incidents are a result of complex failures on top of simple changes – and sadly not often intelligent failures and mistaken almost always as basic “preventable” failures.

That last line makes me grit my teeth to write. Trust me, if you have a high severity incident where you are losing revenue or data loss it was NOT a preventable failure. It’s going to look a whole lot like it is one, but it’s not. It’s a VUCA complex failure.

The worst part is after a complex failure, a high-severity incident, is software engineering teams tend to recreate fears in those environments by responding to that failure as if it was avoidable and preventable. We add processes like cooldowns, more eyeballs (approvals), and we try to prevent ourselves from getting to production faster . Research says the opposite is what is needed to pay down risk and safety in a VUCA world.

Automation and better CICD, gitops, and isolated deployments is one step to addressing VUCA, but if teams have all of that and still have “Norwegian Troll Doll” processes (and operating in the mindset failures are preventable), which is to say anything that looks like high vigilance, triplicated, manual reporting and approvals where data entry requires an engineer to do it physically, then teams still end up with VUCA failures. Teams need to be willing to delete entire processes and talk about the data they were giving to rebuild something better around that metric.

This is not without good intention. It’s not immediately obvious that adding those things creates risk because now there is too much process and manual entry to remember. I’ve recreated Table 7.5 “Productive Responses to Different Types of Failure” (pg 180), from Amy’s book, just to show why, again, you should buy it. The context is needed to know where teams struggle in cloudops the most.

When we treat complex failure (The Darkest Timelines, SEV 0s or 1s) like Preventable Failures we create fear. We may say we encourage learning, curiosity, and humility but then respond with our actions in a way that is not considered a productive response to that curiosity.

I’ve highlighted in bold some key ones that happen in that transformation for cloud teams. Preventable failures where process improvement, training, and clear sanctions are productive responses are for cases like fraud, crime, or culture deviations that harm companies (eg. Firing an employee for Racist comments as a clear sanction or stealing money). Preventable failure responses are not for what teams see on incidents today – those are Complex Failures and eventually Intelligent Failures. After all, a team can’t have a blameless culture if the team addresses it with sanctions.

We must ask ourselves: Are our teams close to having “Intelligent Failures” (wanted risk) and have moved forward to being in the right model, but haven’t let go of the culture of preventable failure responses because they don’t know they can let go of a process due to not having sought diverse perspectives or seen it modeled? If so: Talk about what you want to delete and then rip off the bandaid.

Moving from Complex Failure to Intelligent Failure

You can see in the chart before that the responses are completely different.

For example, instead of firing a junior engineer for causing a production outage you buy them a cake. This is a suggestion I stole from Staff Software Engineer and Ex-Amazonian Lee McKeeman at Google after I responded “You have the SEV 1 Party!!” to a post where an engineer had been fired for breaking production. We both had the same bold and confident response – that that not only is wrong, but the opposite of what a team should do.

Instead of rewarding teams for responding to incidents, we can reward them for the worst ones they have because that’s where the most opportunity is by hosting failure parties. Many leaders want to take risk and have Intelligent Failures, but to do that you have to model first the right behaviors to Complex Failures and try to get to production faster, which means, as wild as it sounds, removing manual tracking processes and approvals that make that harder. 

You have to want to break production so you can break it less and less every time.

Many teams are often creating complex and intelligent failures, preparing as much as they can for their tests, but still respond as if when a change goes wrong those were preventable failures. There is unnecessary tracking, deemed as necessary, and haven’t had the “roundtable” to identify how it’s all being used. They may assume, “well this engineer didn’t follow the process” instead of questioning if the process is right to begin with and actually too complex and slowing getting to production to a point where people just “tick the box” or end up ticking the box 3…4 times end to end.

It’s the response we get wrong to failure. 
Because we assume our process is right, working, and saving us money by reducing risk.

I know that a failure can happen and the mistake can actually have happened 8 months before a production outage. An engineer can get unlucky running a tf apply, but the code decision may not have been their choice and they may miss it in the tf plan because they don’t know and trust their peers.

So next time someone breaks production – DM in that person and say “Epic change, my friend.” and the next time someone does an exceptional job on a bridge, is calm and kind even if there is fire everywhere, message them and say “You did a really good job.” If you are a leader (engineers or managers), instead of rewarding only what got done in recognition platforms, write instead about how awesome that incident was to leadership after it happened and buy a cake for the engineers on it. Because if we don’t have that mental mindset we will absolutely never get there.

Modeling What You Need

I write about what the games industry needs because we need to still explain why in our world to our community to be so as “extreme” to buy a cake. Amy didn’t talk about games, but I think many companies are close and maybe just haven’t yet put together “a cake budget” or have forgotten that the first response is to CHEER the engineer on loudly who made the change.

My favorite comment is summarized in, “Well at least we don’t work in hospitals” and what is so fascinating is if you read Amy’s work that’s exactly what she writes about – high performing hospitals. In those worlds, fear has no room to grow and leaders showing up in clown suits even when mistakes cost lives. This in turn results in less mistakes and more lives saved. It’s hard to host a SEV 1 Party if you don’t have the budget to buy the cakes because no one knows or believes they should buy them – you’re still educating on the need for cakes.

Her work helped me know this blog was the right call (a blog I started before I read the book) – it’s on leaders to model the cultures they want to have, even when they don’t have that culture quite yet, even when its scary, even when we don’t have agreement. We have to hope that others have the curiosity and are willing to give their support to try.

This industry has a long way to go but is SO CLOSE. I am so excited by it and living in that shift because it is incredibly hard and takes about 2 years to fully realize in increments. When I write about it it’s like all the energy comes back – and I can see hints of where teams get better where they believe in the idea that it’s truly possible if we are willing to start over and remove some of the worst parts while pressing gas on the best parts.

What’s Next

I plan to expand on this in another post around the financial reasons we should laugh during and after an incident. I’m really excited about that one because I like saving money. Stay tuned :).

PS. I already had the blog SEV 1 Party before I read Amy’s book but credit goes to Alan Page who is the one who talks about her book and the ideas in it often. Had I not read his Substack I may have never read it.