“All You Need to Know – Orchestral Version” by Gryffin, SLANDER, Calle Lehmann, Max Aruj 🎵
I’ve written before about the need to break production faster.
The ’23 DORA State of DevOps report presents a culmination in that journey for what we need to evaluate in games. As a reminder…
Breaking production helps with:
- Creating an environment of psychological safety to retain employees
- Handling the problem space where high severity incidents no longer are 100% caused by developers (sometimes customers, vendors, or even QA testing)
- Learning from small outages or non-revenue impacting incidents more often to keep systems top shape
- Being the reason your butt will be saved in the future because your team continues to be on top of their game
This post is about severity, why we need it in the DORA report, with a reminder: Please break production. It includes some very dumb math in a SWAG spreadsheet.
What I’m Trying to Accomplish with Dumb Math: Convincing The Games Industry to Deploy Faster
When asked why I want to move faster, the initial assumption is “She hates being inefficient.”
I do.
That’s expensive on its own.
But inefficiency from releasing slowly is not the biggest issue for me. Hidden SEV 0 causes are.
Inefficiency today in software engineering can come from not understanding that continuous deployment means continuous. Barriers to production comes from a real place – that real place being 6+ figure size outages (I say 6+ figures because I used to work at Amazon where that’s a low number). We have to first acknowledge that truth so we can face it.
For me I am much more terrified of inefficiency because of how challenging it is to manage infrastructure and software in production while moving slowly. I have seen and know it costs more money than breaking production a lot in small increments.
Often teams think it’s a technical solve – more automation, do some chaos engineering – but it’s also a cultural solve that requires letting go process.
I see the time spent in cycles (or as DORA refers to it – change lead time) as tech debt that’s never getting paid down and revenue lost from the most offensive forms of breaking quality. The business always moves faster and so do players. They deserve the best. But what is “the best”?
Teams can’t possibly be subject matter experts on everything. There’s too much technology and too many people involved to consume it. This presents a challenge – how do we continue to meet demand, learn, and break production more to break it less until players never notice?
Break production. Faster. The previous blogs and also this one in addition are only to help explain why and convince you to be on mission. Being off mission is no longer a real option. Woohoo!
It’s 2023 and the State of DevOps Report is out. Read It.
The 2023 State of DevOps report released by Google and DORA is excellent because it quantifies how different teams – whether you are an elite performing team or low performing team – deploy changes against their change failure rate based on a real survey and research by people smarter than I am.
Click this damn link and read it and then use those KPIs.
What it’s missing for me is “How risky is it to move faster in games?”
God wouldn’t it be great to know?! Wouldn’t it be…GREAT IF WE GOT OUR OWN INDUSTRY METRICS FOR THE (checks most updated metrics) multi-billion dollar industry that’s bigger than music, TV, and film combined?
Wouldn’t it be great to know what our metrics should be based on how often we need to deploy in a way that functions with our many client-server, DLC, and testing danger-zones for our devops shared relationships problems? Oh well. Maybe someday Google will read my blog. 🙂
Special thank you to those in the Modern Testing slack, who shared me that companies like Financial Times are doing 30K changes per year with a team of 250 people (that’s almost 3 changes a day per person up from only 12 changes a year). Charity Majors also debunked the myth that requirements for SOX, GDPR, and other compliance-focused requirements are incompatible with modern devops practices with 15 company examples.
Despite this, there is still a concern that on-demand, continuous deployment, is risky in games. I will probably spend 10 years on this mission to see if games can solve our industry-specific challenges because we definitely have unique workloads.
This all said – Trust me. Continuous deployment for games and all her shared workloads is not nearly as risky compared to moving slowly.
In the case of DORA metrics a low performing team has a change failure rate of 64% and deploys between once per week or once per month, while an elite performing team deploys on-demand with a change failure rate of 5%. If a team is tracking that metric manually, and not using git, a team misses failures in a meaningful way because engineers don’t manually report what needs to get fixed ASAP.
Actually don’t trust me.
Let’s hypothesize how much money a fake team could lose.
The Missing Perspective: Revenue Lost
We can’t really do “the financial risk” exercise for failed changes until we we spread change failure out along % of incidents caused, how much revenue was lost, and the severity level of those incidents. This is different per company, per industry, per business so from that perspective all of my math is finger wavy.
For this exercise, I first made a fake SEV 0 – 3 categorization diagram. Atlassian for example, has only 3 incident severity levels. Splunk? Has 5. You may have 0 or quantified everything as “on fucking fire” because that’s where you’re at and that’s okay. We all know DevOps is a Lord of the Rings quest. We’ve got to change culture at a rate and pace that doesn’t break the good parts of it.
If a team is looking to move faster in devops for getting code into production eventually I highly recommend backtracking on anything broken and categorizing your past failed changes (or post-mortems) first by severity. Otherwise it’s very hard to get perspective on the future.
For the sake of the “How much money am I losing based on deployment frequency,” exercise we’re going with 4 severity levels. I also input fake revenue numbers which vary, wildly, per game or company or collection of centralized workloads. The examples use the revenue numbers inputted below and my revenue numbers are not based on any one company or anyone game in order to abstract this – if you are an indie or not heavily backend integrated, you likely don’t lose $60K an hour in a full production outage on all systems. The simple math here was $1K per minute of full hard downtime and 0 players online for a SEV 0.
I absolutely cannot talk about revenue for games I know and have worked on so this $1K/min number is estimated based on taking the League of Legends ’22 total public revenue reported ($1.8B), estimating down to the minute from 365 days ($3.4K) and then saying “Well if I made 1/3 of that and rounded down it’d be about $1K/min for a total outage” which may not even be how Riot sees it either. 🙂 $1K/min for me though is a simple way to understand why speed in which a team responds based on how often they practice (deploy and fail) responding matters. It’s also an easier number to work with for a SWAG. So our hypothetical company made a game (or games) not as successful as League of Legends, but not flops either.
Some Examples with Dumb Math We Should Replace with Real Financials
I’ve hinted that this is all example math – your milage will vary. Greatly.
You probably have a $/lost per second you know based on your company or your team when there is a full outage. If you have microservices, the likelihood that you today experience a 100% outage on all services at once is exceptionally small. It may even be never on some of your workloads because of how they function and that is how you are architected.
You would have to make a change that would be truly catastrophic.
A Note About Team Post Activities, Bridge Time, and Unclassified Change Failures
Something I realized going through this exercise is that the more lower severity incidents you cause, eg. incidents that don’t cause revenue impact but ones you want to report on to learn, the more expensive it becomes for you in people time to be transparent.
You have to decide, as you get better, whether you want to do a post-mortem, an incident report, and where the line is for the “failed change” at that point and the post-incident activities for cleanup and improving. To account for this I included an “unclassified” percentage – which is to say, a team of 25 will only do post activities for a certain percentage of truly innovative and lesson learning experiences – otherwise you lose money in the post-activities.
A Note about Incident Severity & Number Changing With Time
Something I’m personally fascinated by is if number of incidents, once you have severity, changes per severity as you move from 1 change per day per eng to on-demand or continuous in deployment frequency. I don’t know how to capture this because I do not know it’s been analyzed for games workloads.
Does a “Failed client binary build” really count vs a server deployment? I could rarely see for a binary deployment failure a team counting that as a SEV 3 for example – game teams do many, many builds a day on-demand and they fail for all kinds of reasons before hitting production (players). Generally game teams don’t want players to have to download client builds often (or they chunk and patch them on a predictable cadence more based on content and marketing). But a server-side deployment straight to production failing may need to count as a SEV 3. This is why I am more about revenue impacting incidents and time spent on post-incident activities for non-revenue impacting ones. Also team size here is 25 engineers, not other parties.
Base Line: Stone-Age DevOps or Low Performing Teams
I don’t know a single team in any games company that I could say is a “low performing team” by DORA metrics. I will provide this terrifying math anyway so we can look back and say “We ain’t doing that” which is to say game studios are not deploying once per week to once per month with a change failure rate of 64%.
My guess is though if a team is here, that team likely is not measuring it in a way that would show this math OR a game is not making enough money / doesn’t have backend services for which a SEV 0 would cause a full-outage of $60K for a SEV 0 before it recovered.
Team Medium Performing
Some in games exist in this state. The industry does not really know where we should be (yet) to balance risk against deployments. I say this having asked a bunch of DevOps people in ’22 on Twitter before I joined Zynga (I mention on this blog I aim not to talk about my current work specifically and speak in histories). All of us are listening, leaning in, to try to figure out where we should be between client deployments and server-side deployments for microservices.
You may be starting to have few, if any, SEV 0s, and they are often not caused by your team. They may be partial and not full outages and thus your revenue impact may be lower overall, but you aren’t tracking it in the last 10 years against architecture changes holistically.
Team High Performing – Dream Team
Some teams are in this state – they don’t have a lot of SEV 0s, if any. They are fully microserviced. They are more focused on paying down SEV 1 and SEV 2s, individual revenue impact of mircoservices, and cost to the team in people time involved in fixes or mitigations. Because they are deploying frequently, only a subset of failed deployments will become post-activities (post-mortems). Alternatively, fixes are also likely small and may only require a push forward with the fix.
Do I think that it’s 92% that become small push forward fixes after a failed deployment? Not really. That seems high. But do I think a team of 25 engineers should invest in 834 post-mortems? Hell no. Which is why I back into this now from the perspective that if you’re doing a lot of SEV 3s with no revenue lost at this point, those are investments for learning and teams need to be choosy about what they are really trying to learn between SEV 3s and tiny hiccups.
Team Elite (Avengers) – On-Demand and Continuous Deployment Adopters
I wish more people were here but we need to see it first at industry-scale. In this model, releases could also be automated image pulls on a routine schedule to keep freshness that fail, not necessarily deployed by an engineer. This adds additional complexity and areas to be mitigated through tests and rejection.
I’ve thought – does people time investment increase in SEV 3s disproportionately to SEV 1s in this new world? Maybe it does. Maybe it doesn’t. I don’t think we understand this area between SEV 3s with post activities and fixes we make a quick ticket for as opportunities to fix a failed change.
It’s Okay to Disagree with This Math Because it’s a SWAG
I don’t present the numbers above as numbers we should use. They are placeholder.
“Molly you math is wrong!! RAWR. What is this 94% bullshit for 1272 failed changes?”
Games make different amounts of money and have a whole ton of different backends organized in different team topologies across this wild industry. Also maybe it costs more in losses with more deploys but that’s okay because you get a 2150% productivity increase? It’s all in the percentages which is why it would be better to use real historical figures against git.
I present these as the examples I wish I had and wish I could see across this industry. I present them as one framework or way of looking at the problem in a way that speaks to those who look at dollar signs. Many mostly care about customer revenue loss during outages, not low severity learnings but as you get better, your time goes into low severity learnings as investments and you may not be losing any revenue.
It helps us not still see all incidents as “on fire.” In reality, all of us are human and not everyone has lived on a pager so they don’t know what it’s like, how often it happens, why, and the spectrum of what alerts really mean. These days “incident” means a lot of things to a lot of people.
I present the above because we could do this today.
We do not really know the real cost of doing nothing while the world moves faster and puts pressure on game systems. This causes high severity incidents that don’t come from team’s own changes and choices, but from lack of continuous deployment.
Finally if you want to improve on this or change the revenue metrics to match your game, you can do so by cloning this sheet. Please do it!!
I would welcome any improvements, playing around with the percentages for especially SEV 3 and also opinions on “We aren’t going to post-mortem this SEV 3 and here is why” and other frameworks for measuring revenue impact of outages or change failure definitions – it’s a collective mission for everyone in games to provide the best experiences to players by becoming engineers who can break production more to break it a whole lot less until players never notice.
But mostly, if you take away one from thing from this it’s this:
Learning is an investment.
And it probably won’t cost this industry as much as we think to deploy faster against the eventual productivity gain no matter how dumb my very public math is.
—
Other fun posts on Incidents, breaking production, and failure.