Yes, I Love Ops: Because We Do Not Fear Production.

…Ready for it? I don’t know what to say since a twist of fate is when it all broke down…so I’ll take my time ðŸŽµ

Someone asked this week if I loved ops so I had to say that I FREAKING LOVE OPS and so does my team.

I loved it so much that three years ago at AWS I signed myself up for a bridge I didn’t need to be on concerned that the stars of people being evil and a launch would align (they did). “Only ping me if this specific thing happens and get me back on that bridge.”

It was 2020. Buzz. My eyes shot open. I re-read the 4:30 AM message. “You were right. There’s a DDoS.” I put on some pants (lol), downed a coke while hopping on the bridge (no time for coffee, only to ACK) – to smiles? So many smiles. How could something so bad that was happening have us all so…happy? Because we knew we’d be okay and we were about to do something awesome to rescue infrastructure together across three different teams no one, not the game nor vendors, had caused and everyone had said “That will never happen.” It happened. All of us trusted each other. We then migrated 800 servers together at 5 AM during a DDoS of a world wide game launch. Anyone I ever met on that bridge I will remember for the rest of my life. If you’re smiling, keep reading.

I am extremely grateful because that and other adventures solidified for me just how much I loved infrastructure, ops, and keeping a cool head. I’m not always perfect at it – but that’s why I love it. I used to suck at having a cool head because I didn’t understand it – I feared it. I had made so many games on the client that getting better at the backend under fire was what I needed to grow as a person, but it was living epically surrounded by people who were curious and not scared that taught me not to fear, but to love when any part of production is not working as intended and that, if epically down, to bring it back up fast requires you building trust months in advance. They taught me that making the process around code releases, change, infrastructure as code, monitoring, and games better can only be tested and realized in those moments and is measured with trust and speed as paid down through uptime ($$ per minute). I love being decisive in the moment and enabling people to be their best – to quickly understand what their best is if I don’t know them- and just go. I love doing the work of understanding the people in advance so that in those critical moments, they can and will be their best even if I’m not in the room – to and with each other. I love getting rid of us being scared – I used to be so scared when I touched servers! Now? I know exactly what I’m scared of – I’m scared of people being scared of infrastructure.

I love ops and so does my team – some of whom also come from the client to the backend – we care deeply about what we monitor on infrastructure, how we respond to it, making ourselves better. This post about Kubernetes upgrades and what it takes to train engineers should be a good indicator that I like doing those types of things. This expertise is rare for Kubernetes, my team is rare, and I am so incredibly proud, impressed, and amazed by what they do every day. When we hit challenges, they are typically ones the industry is struggling with, not us alone. That’s incredible. We care a ton about how our systems perform. We simply care a lot about doing our jobs well and learning from it when it doesn’t go right. We care a ton about doing infrastructure well and when we are tested, we excel.

Enlightenment is when teams leave the assumption that change is something we, as developers and those who manage infrastructure, control exclusively.

Growth is when teams realize predictability is the byproduct of operational excellence, and innovation is the byproduct of exceptional team performance in unpredictable conditions they want to have.

I was talking to my life mentee (and former intern for Ker-Chunk Games!), Kaitlyn Anderson, who when she graduates I hope to find her the job of her dreams because she’s incredibly talented, what my job is like, and what it’s like to be an engineer or architect on our team. And guess what she said? “Wow that sounds really awesome.” (It is!)

How we talk about what we do is the message we send to the people we care about and who they will become

As a leader if we constantly say “people hate ops,” we say it doesn’t have value to our people – However that is likely untrue – it has incredible value to me.

Others love ops too. It is why the Cloud Native Computing Foundation landscape is enormous. There are startups who know there is a business in doing infrastructure maintenance, monitoring, and pipelines for others and enabling enterprises, but they cannot do the core business (infrastructure maintenance and updates) that is often the most dangerous, exciting, and hard. I love it – because in microservices its safer and also more complex – I made AWS laugh this week by saying “The complexity keeps us safer!” because it does! Believe me – I can delete a host and tell you with confidence people will not notice and why that is but that’s because I am surrounded by geniuses who care a ton about that and I care a lot about enabling them to continue to care about it not fear it. I tend to find that people hate ops because they had bad experiences where they had a bad change on infrastructure and were punished for turning the key and brought down production – they never got to be championed to learn and that’s the type of stuff I run at – people who have been blocked from advancing because actions injected fear into their careers. These days, by the time a change is made, it’s usually a system of processes that got there, perhaps too many. Somebody is going to be on the hook for that final binary push – if a deployment pipeline is not fully automated yet and teams have not accepted that failure is a blessing in disguise and you gotta embrace it with laughter – gitops is a philosophy that requires teams to want to remove people and process obstructions that slow teams from getting to production faster, not an architectural pattern alone, trust me.

One has to be willing to push changes faster and let the PRs control the future. To be willing to have fire to design better firehoses and firefighters – to take a measured approach and not adopt those things overnight. In fact, if teams try less and not more, they have more risk. I know these things to be true – there’s no amount of unwanted verbosity that will ever make that untrue when I have lived and seen that more approvals while complexity increases, the more likely teams will actually have failure from process confusion. There is only patience and willingness to demonstrate through to transformation with small steps where teams need to actually delete entire processes and that’s okay. It’s a Lord of the Rings adventure – and those are my favorite because they are career making leadership adventures where your job is on the line.

All around us third parties say – No Ops! Ops is dead! DevOps is dead. But: Somebody has to do it. Somebody has to say “are we doing this well?” and “what should we risk to increase our velocity?” Startups want to do a small part of it because it’s complicated. When people don’t want to do ops, the hardcore tasks, maintaining upgrades or how their EC2 instances, their K8s clusters, their databases, their storage perform against a use case, how much memory and cpu their components eat up – that’s a business…I like money and having a job so I tend to run at businesses others don’t want to do with complex people problems that make them scared they’ll lose their jobs. At AAA scale what saves money is owning the businesses other people don’t want to do on top of an enterprise discount. As a director, I really love ops. Everything else?

Is a sales play.

If you like that, be on a team that wants to kick players offline to kick them offline less and less every time – to get better, to be so good at a job no one knows it’s happening, to work yourself into invisibility – to constantly report transparently on even small issues – Go do infra ops – that’s site reliability engineering, infrastructure management and maintenance, monitoring that infrastructure, building it out, understanding how to deploy it, how to change it, how to monitor it, how to delete it, how to evolve it and making changes safer (by doing more of them and automating) until players don’t notice. Infrastructure changes are some of the most challenging in the world – I know someone who can automate deleting 1000s of S3 buckets and also someone who can create and delete 1000s of accounts – if that scares you, do infrastructure and it will never scare you again. What scares one person, another can find boring (or make them drink battery acid?). Those people? I like to find those people and be that person that tells them how truly incredible and amazing they are at their jobs and explain that to the rest of the world who is not so nearly as fearless as they are. I love ops and enabling the people who do it – to the person I said my team is amazing:

I meant it.

Next week I’m going to write about how categorizing severity makes fearless teams faster by using kindness to create speed. I hope this industry continues to live to tell the tales of what it takes for teams to stay so high performing that we anticipate what can and will happen in a world of increasing complexity where the fundamental requirement is for us to trust each other, make mistakes smaller and smaller, and have speed through kindness first, judgement second.

I regret nothing and yes, I love ops.

Image Credit: NASA on Unsplash – Bright Center Star Cluster