In Defense of Change Management: Hope is a Four Letter Word

CAB

Counting Stars” by One Republic from the live album “One Night in Malibu” 🎵

I took a 2 week break from writing. Gasp.

Good thing this isn’t actually a side hustle.

I do think some would have preferred me to stay quiet.

I need examples for my future daughter so she knows how to stand up for her professional wisdom even when it is hard 😊.

Also, my house needs a jackhammer & sump pump to the foundation in three rooms because water comes into my office when it rains, I’m in 2nd trimester and not hitting REM, and the games industry still lays off people every other week.

If that isn’t enough to make someone scared to blog about spicy topics (that really aren’t that spicy, let’s be real for a second in the context of world events – I’m talking about devops), then you’d understand when I say it is much safer to just shut up.

But leaders know that modeling fear isn’t okay when you are not the only person who is hopeful for change in our industry. Being quiet for too long can be seen as giving up – nothing irks me more than an environment where people no longer feel safe to share what they believe or their professional wisdom. They start to take it easy. The tech industry itself is challenging today with its decisions and layoffs, making many quieter. I’m not going to model that by not writing.

My mission, outside of my job, is to share what I have learned so others aren’t trapped into making the same mistakes I’ve made in the past with regards to believing that process makes us safer and continuous deployment doesn’t apply to games or infrastructure.

And while it is the job of engineers, managers, employees to live in any process required of them, it is not their job to forget everything they know and their past expertise. It’s their job to share it, transparently. And it is our decisions as leaders whether or not we will listen on what is truly possible and has been done by others – champion them, sponsor them, and support those ideas to make sure you have a safe, open and truly transparent culture that wants to learn and innovate.

The title of this blog post was clickbait, but I will give it a fair shot.

As Accelerate by Nicole Forsgren PhD, Jen Humble, and Gene Kim the authors say, “Every organization will have some kind of process for making changes to their production environments…in large organizations, we often see change management processes that take days or weeks, requiring each change to be reviewed by a change advisory board (CAB) external to the team in addition to team-level reviews, such as a formal code review process.” (Forsgren, 78).[1]

Oh dear lord.

What is Great about Change Management Tools

I said I would give it a fair shot.

There are a lot of Change Management solutions – some homegrown, some vendors. ManageEngine’s ServiceDesk Plus for example (where the header image is from) or Solarwinds ServiceDesk. They sell their tools and features with the idea that the more process and features teams have around preventing change, the better.

Some Change Management tools provide centralized visibility into cross-company changes, even if not always in the same systems. People want a database to look up what change broke production in an incident if it wasn’t their own, across systems they do not maintain or have access to as well.

In addition, Change Management tools provide notification features into preferred chat, email channels, for when changes happen. While we can also do these kinds of integrations with Git (or GitLab if that is one’s jam), companies still sometimes want to find a solution for everywhere they are making changes that isn’t Git/GitLab. Because companies want, or rather vendors want to sell companies on, every kind of change living in one place for safety, solving for only Git isn’t enough.

It’s a mess of a problem to centralize change – Change Management tools can attempt to solve that problem.

The Problem Is Centralized Change with Manual Approvals Doesn’t Prevent Production Failures and Might Cause More of Them

Change Management vendor tools are often designed with additional process due to an older cultural belief that more information, more steps, and more layers of approval makes teams safer.

We now know, that often it is the complexity and volume of information that leads to the opposite, failure, requiring more autonomy and lines of ownership, less layers, less process, less approvals.

We no longer live in a world where any one individual, no matter their tenure, title, or level of experience, is able to “know everything” to be able to prevent a failure – even the smartest architects may not truly know the impact of their own change until it lives in the real environments it touches.

The research team behind Accelerate dove specifically into this area to understand “Are external approvers necessary or even safer for production changes?” They considered an external approver either a manager (me!) or change advisory board (CAB).

I would extend Accelerate’s definition of external approvers to anyone who is not an individual localized to the engineers pushing the code. I would also go as far to say double-reviewing in a separate tool adds complexity, which is deadly in production processes especially if there are multiple “stall” paths where engineers have to wait, leave their code and context. When we talk about flow state, we refer to its importance to engineering productivity as well as production stability. Staying in the same tools is paramount to staying in a flow state, the state of knowing what you are doing confidently enough to make the change in a safe way. The less switching you have to do, the less systems and processes you have to remember to execute a change, the safer the change is going to be.

It is clear to me that the only thing that actually makes us all safer is letting engineers push code to production. This is the way we build architectural guardrails – we break the right things and learn what to fix and in turn make our systems themselves more fault tolerant.

When the authors studied change approval against software delivery performance and safety, they found it was not only useless for performance, but dangerous to add people guardrails outside of peer review even for high-risk changes.

“We found that approval only for high-risk changes was not correlated with software delivery performance. Teams that reported no approval process or used peer review achieved higher software delivery performance. Finally teams that required approval by an external body achieved lower performance.”

Accelerate – Forsgren, 79.

The slow down on delivery seems obvious, while the danger does not – if we push a PR in a team, team member engineers review it – let’s say for the sake of this example 1 to 2 people. If that team has to submit to an external body, someone who isn’t living in that specific change and its downstream production environment due to their scope such as a manager, a Tech Director, change advisory board, a customer, then that team is slower and pushes to production slower. It is perhaps the external approver they are waiting on, doesn’t actually understand the change at all, because to understand it truly they would have had to contribute years of their life to learn that expertise. It is perhaps, simply, that that entity only wants to be notified – and there is a difference between those functions. It is perhaps, the conversation of the change should have happened earlier, outside the code, education, before it was made.

Some may argue at that point “We’re safer! An external approver could stop a bad change right when it is about to go out! They could catch failure.” The problem? That isn’t what the rest of the data says.

“We investigated further the case of approval by an external body to see if this practice correlated with stability. We found that external approvers were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.”

Accelerate – Forsgren, 79.

The book goes on to argue for light weight change approval processes for code-based changes mainly focused on internal team code reviews. Let people who touch that part of the code and the servers they are responsible for be autonomous to get to production faster – use any failures found to build the right tests and automate rejection of changes that fail in deployment pipelines that protect systems and customers.

These are the foundational cultural requirements of continuous deployment with good continuous integration. That starts with trying cell-based architecture into low-med-high risk environments as automated rollouts from trunk-based, merge controlled pipelines that don’t care about external change processes – they may notify them, but they don’t depend on them or others to execute. The only conclusion one can come to after reading the data is that adding more people as approvers, especially those who aren’t in that localized team, makes everyone, including customers, less safe.

Additionally, risking lowering a company’s deployment frequency puts a company in the path of danger, not only because “they aren’t moving fast enough” as a business, but because they are moving at a place where they aren’t going to be able to keep up with tech debt that will naturally cause production outages on its own.

Resources will instead be spent on everyone, impossibly, staying on top of an ever-growing ecosystem of languages, open-source dependencies, security packages via their eyes with no time to write tests for failures. We want to protect our friends, but the truth is we cannot possibly be experts in the subject matters of all our friends anymore.

This isn’t to say don’t collaborate with people external to a team. Absolutely do! But often that collaboration, preparedness, questions about architecture, timeline planning, lives and breaths outside of stone-age devops processes like dense Change Management layers in places more appropriate such as a discussion forum, a design document, roadmap, or in a PR. Everything else is extra mess and risk.

Continuous

It is okay to be scared of doing continuous delivery or deployment. Often we are scared not because of the word deployment, but because of the word continuous. But it is also important in that fear to hear what the data says and have a sense of hope.

The research is now over a decade old with the first annual 2014 State of DevOps Report that led to the book being written (after the 2012 Puppet “State of DevOps” report). We’re uncovering the spectrum between delivery and deployment, trunks in different contexts (client platforms and backend server-side deployments), stability and reviews. We now know it is possible, that by moving faster, we make ourselves safer than if we did the opposite.

Cultural shifts, if they do happen, take at least 2-3 years. There are a lot of opinions. Cultural change is hard while drowning in the day jobs of migrations, maintenance, and new projects. That kind of culture ask, to believe continuous deployment is not only possible, but needed, on top of what is happening in the games industry and each of our personal lives where failure draws attention, and you are putting the very definition of safety on its head – it is easy to see, why for many, it is simply safer to say nothing on topics they are passionate about. You might offend someone, somewhere. They might really believe your voice is wrong. Make less of it. It may not land you a job or may put your existing ones at risk. Don’t create noise. Just be quiet.

But you should stand up for your background, your expertise, especially if it is for the right reasons – have patience, be kind, and share what we know so others learn. Never stop writing.

Some changes, both code and cultural, are inevitable. 😊

Other fun posts on CICD, Incidents, Breaking Production, and Failure.

[1] Forsgren N, Humble J, and Kim G (2018). “Chapter 7: Management Practices for Software.” Accelerate: Building and Scaling High Performing Technology Organizations.Portland. It Revolution. pp. 78-79.