Cell-Based Architecture: Lower the Blast Radius. Continuous Deployment Is Here.

Just Some Volcano

In My Blood” by The Score 🎵

Several in this network are discussing the business benefits of accelerated devops – get to production faster, release faster – it saves money in people time. That’s an argument that can be easily made. DORA is spearheading that conversation.

We, I’m included in that group, believe in continuous integration with continuous delivery (and deployment) knowing CICD makes our collective industry blast radius smaller. Our lives better.

But –

Others remain defiant that slowing down makes changes safer for infrastructure and saves more money in risk. “Just do build farms first! Just do the customer apps first!” Do everything, but infrastructure.

They are worried about angering customers. Players.
Those concerns are valid.
But here is what else is happening. Infrastructure is now applications.

Slowing down engineers scares me more than CICD. Going slow increases the blast radius of changes because teams architect for slowness not scale. They build different skill sets- they build tools for retroactive eyeballs. Constant emails hoping everyone will see it. Never miss the news. And they prioritize all of this before automation or worse try to do both at the same time which is hard to staff. “Eventually. We’ll automate.” Or “Make time for it!…But still do all this other process too.” I believe applying so much process actually angers customers worse over time by architecturally increasing the blast radius. I argue, start over. Don’t replace. Let go. Find areas to delete all the people gates outside of PRs reviews. Commit your company to the mission so everyone understands why. That’s what Accelerate supports.

Intentionally going slow is associated with monolithic architectures and not having enough cell-based architecture. It’s associated with being architecturally not resilient and fault tolerant, or simply, unsafe.

Metaphorically it’s associated with ships that have no compartments, and thus when they hit an iceberg, sink because they can’t control the flooding or were too busy to realize the iceberg was there. But more importantly if you are using microservices you may not actually be having as many high severity incidents as you did 10 years ago caused by your own changes.

This post shows the architectural reasons it’s not only possible to continuously deploy to production for changes, but why industry leaders should work towards continuously deploying to production.

This includes server-side applications, core infrastructure agent updates and daemonsets, compute changes, automatic image pulls, and even all the way to game client builds, client package managers and DLC updates. Everything but that final bare minimum client build going to the distribution platform where only marketing gates stops a client binary (“the release binary the player has to download from the distributor without the DLC that gets pulled on launch”). We’re not all here today, but we will be – all on top of each other continuously deploying.

Teams should design around that problem now – Much of that complex landscape is already here with others deploying changes first. Assume others are moving faster – everyone from vendors, cloud providers, and especially players.

First, two important definitions:

Continuous Delivery: This is when an application or an infrastructure change is (1) always release ready but still (2) pushed manually through environment or release stages. In this phase, teams actively work to remove approvers from any part of the process. They have a key focus on the piece of the stack always being ready to go out the door. Most teams are working towards this today.

Continuous Deployment: This is when an application or an infrastructure change is pushed based on automated processes through each environment with no manual gates after PR is approved and merged. We write some code, it gets a PR review, it is merged. It rolls through each environment with unit tests, integration tests (continuous integration), and eventually makes it into a production environment if it passes.

When I talk about these two things I often get a shocked face at the idea of doing continuous deployment to production, especially any infrastructure changes, and I think that comes from not knowing about cell-based architecture at its core and not having seen it done well for 1000s of accounts or hundreds of workloads and tens of thousands of servers that all need that change.

Not all changes will be candidates for continuous deployment by nature of what they are, but there are many, many changes made today not only for applications, but for infrastructure that can and should be – and I believe, largely, eventually, most will be.

Some Examples

Today, infrastructure is managed via code, just like a backend server-side applications written in Python, php, or Go. For example, monitoring agents can be managed via Helm charts. The creation of networking resources can also be written in Terraform. The resource configuration for CPU and memory of individual Kubernetes components can also be managed via code.

With this in mind, almost everything we do operationally is turning towards the processes of managing software than the processes of managing hardware.

Fundamentally it also means, there is old operational and business patterns that remain at organizationally large institutions from what I call “Stone-age DevOps” or early IT. This includes everything from Change Advisory Boards to Change systems that don’t work with Git. That makes this shift extremely hard.

Let’s revisit the definition of continuous deployment though:

Continuous Deployment: We write some code, it gets a PR review, we merge. It is automatically rolled through each environment…eventually makes it into a production environment – or with cell-based architecture – several separated production environments growing towards a larger blast radius.

We may hear the words “Canary” and “Blue/Green” thrown around a lot for application deployments. We may even hear them thrown around for Kubernetes clusters. But something we do not throw around nearly enough is “cell-based architecture.”

Kubernetes is Similar to Cell-Based Architecture, but Microservices

Cell-based architecture is designing patterns in architecture with the idea of isolating one’s blast radius for each part.

Kubernetes is great for applications because each application can run in a container and many containers can run in a pod bundled together. Many pods can sit in a namespace only accessible by one team. And many namespaces can exist across many different hosts (“servers”). All those hosts? Those can be controlled by another, isolated set of hosts, called a “control plane” – think the central command of a space ship is a room that can control the escape hatches at the other end of the ship and kick them off the ship if they are malfunctioning at any time. That’s Kubernetes.

But that’s not cell-based architecture because it doesn’t tell me anything about the customer blast radius.

Let’s say we are not using Kubernetes and we’re speaking more holistically. A lot of teams do cell-based design by designing an existing application or workload – I have my game server application, it can sit on a compute host (“an instance”). Maybe it can run across a lot of hosts (“a group of nodes”).

Now add another application – do they both sit on the same host? Should they? Maybe they should to save costs. Maybe they are dependent on each other. Maybe one is a monitoring application for the game server. One is hosting a matchmaking application. Now add that the game has multiple game modes – and each mode runs in its own application, it’s own sessions for gamers. Both need to use both the matchmaking application and both are using monitoring. Do both of those game modes live on the same hosts? If I make updates to both those applications or to the hosts themselves, do both games get affected if the update goes wrong?

This is what drives teams to think about cell-based architecture for infrastructure. It becomes even more complicated if it involves designing multi-tenancy for multiple games. Often the default is “Put every new game in another account” but as teams build platform engineering teams, they want to build applications that multiple internal customers can use. We end up with multiple interconnected blasts for changes anyway to save on operational costs.

We can isolate by pools of nodes, we can isolate by clusters, we can isolate through networking, we can isolate through microservices. Because of this – not all changes committed in code are the same. While guidance for devops is “Make small changes” in infrastructure, often the changes I see are independent of length of code change or even size of PR. It’s knowing and understanding the actual blast radius of a change based on the environment it is hitting in reverse.

As environments get bigger, and teams have more of them, with more co-dependencies on each other it’s harder to know our blast radius. So we begin to think about cell-based architecture instead as “customer pools.”

Cell-Based Customer Pools

AWS compares cell-based architecture to the bulkhead of a ship – if a ship gets breached with water and water starts entering the ship, it can only get so far because stops can be put in place at the end of each cell.

I was actually really happy to see at Re:Invent 2023, Sr. SDM Vipul Sabhaya from Amazon EKS talked about in the “Inner Workings of Amazing EKS” this concept. For EKS, each cell holds 1000 clusters and they minimize the area of impact through progressive deployments. I’ve included these two important slides here.

It’s easy to look at this and think, “That doesn’t affect me – I do not manage 1000s of clusters.” This is true – but think about the types of operational changes Amazon had to do in order to be able to manage 1000s of clusters.

Teams can’t, at scale, past a certain number of clusters keep up with the business needs and asks from customers. They have to be willing to accept some risk and continue to press for isolation and cellular architecture – and they have to absolutely be willing to continuously deliver and deploy some changes. A 3 hour window between environments, even if continuous delivery with a manual gate and not deployment, is so small it might as well be a scheduled deployment with a test fail fallback plan.

I actually think a lot of teams are already doing something similar today on a small scale and manually. For example, they may manually deploy to development, to then manually deploy to stage, to then manually after a lot of eyeballs, deploy to production. The times between those may be completely arbitrary on why a change lives in an environment. It may be spidy senses. Or it may need to live there for a bit – but for me that’s only if those environments are production. Stage tells you very little.

For the most part, many, and I mean many, infrastructure changes cannot be truly tested with out real. production. traffic. This is true for new instance types. This is true for understanding resource limits and cpu and memory configurations. Which means those in infrastructure need to still see production traffic on isolated hosts or they end up having things break and the quality bar for customers lowers. Teams could test a change across 10 staging environments and then deploy to 10 production environments. Or, teams could test a change in one small stage, and one small production environment. Then 2 medium staging environments and 2 medium production environments…and so on.

Teams have to test in prod.

Cell-based patterns like the above mean teams can try to see this in “smaller” (by customers, not compute) production environments – whether that’s in small pools in the same Kubernetes cluster or that’s in games that are smaller games. It means teams can design deployment patterns around how many people are using an environment and start from “least amount” to “most” just like in the image above – with progressively smaller soak zones. Work backwards from largest volcano blast radius to smallest island.

If a team is already operating this way there is almost no excuse for why, eventually, that team wouldn’t start trying to continuously deploy to at least staging environments weekly and then daily. Because that would mean that (1) they would be well on their way to having good cell-based architecture patterns for continuous deployment but (2) also would begin to catch issues in stage, like missed deprecations, daily and build better tests. They would know with confidence that when they moved to their smallest production environment, they had already automated testing the bare minimum and knew what acceptance should look like.

I hope we can all in this industry at least agree that we should meet the bare minimum for quality in infrastructure.

PS. This blog now has a puzzle in it. The answer is a famous quote. To guess it, you’ll have to share in an adventure. Same terms as Alan’s – For the first person to guess correctly by end of year, I’ll donate $250 to their preferred US 501c3 charity. If not, I’ll donate to mine :). You can guess by sending me a DM, text, Discord, or a LinkedIn Message/comment.

Header Image by Sylvain Cleymans[1] from Unsplash.

[1] This is the first clue to the puzzle.

Other fun posts on CICD, Incidents, Breaking Production, and Failure.