Cluster Sprawl: How Many K8s Clusters and People To Manage Them Do You Need?

“Mirage – For Assassin’s Creed Mirage” by OneRepublic, Mishaal Tamer 🎵

This question comes up a lot.

“How many clusters should a company have?”

Followed by – “How much headcount do they need to manage that?”

So I went searching to see who else outside my remit has asked this question and found the best Reddit thread from a year ago. Let’s break it down.

The original poster brought up that Mercedes-Benz has 900 clusters. Chick-fil-a has one per restaurant at 2600+ clusters and 8000+ nodes. This was a huge people investment to rollout including multiple support teams just for Kubernetes which they expand on in their post in detail.

Then I found people who looked at all this and said, Huh? We only have 1…

While, without disclosing, I can confidently say 1 is not where I am at by any means: This person is a real person with 67 upvotes. Unless teams have specific reasons – 900 to 2600 clusters should probably not be the goal out the gate and teams should resist the urge to have more clusters if they are using Kubernetes because Kubernetes in and of itself is designed for isolation.

Learn the isolation models first as part of having 1.

My favorite comment is from user muson_IT of which could use tonal improvement.

I can imagine if I called anyone around me “dumbasses” I would not have the privilege of maintaining my current job… Muson_IT states, “I’m from k8s cost optimization company (CAST.AI), so I have wider perspective how many companies organize k8s. Most competent DevOps, with most good practices usually just grow clusters. 1500+ nodes above this limit I rarely see bigger clusters…Then there are dumbasses which use k8s, but haven’t really let go VMs mentality and don’t really get k8s, would create cluster for single (often weak) reasons and end up with hundred of tiny 2 to 20 nodes clusters. Those teams usually are bad at best practices like providing IAM permissions not to app, but nodes, rarely use requests or any affinities to control distribution. In case of latter seek professional help 🤣”

They aren’t dumbasses – but they are new to it.
I would classify them as early adopters.

If you know me – I refer to anything under 10 nodes as a “baby cluster” because, like this person said, teams tend to treat them more like managing VMs when they are learning concepts. Kubernetes maintainers rolling out consulting and adoption patterns should inspect those clusters and see if Kubernetes itself is being used well and if the apps on it are using its features appropriately. Often it means someone has to step in and teach early horizontally-scalable and fault-tolerant architecture in K8s – everything from what HPAs are, to PodDisruptionBudgets, to namespaces, to node pools. Customers simply do not know. Let’s teach them!

I’d personally like to thank my team for being the ones to help me see that. 🙂

Calculating Staff

I don’t like to use clusters as a KPI to manage a Kubernetes business long-term. It’s something to watch, great way to staff it short term, but it’s not a great way to measure staffing it long-term. There is a difference between customers and clusters. I prefer to staff based on the full responsibility of what is managed using the CNCF metric for “Percentage of Total Workloads: Application versus Auxiliary Workloads.” This helps me look at customers, clusters, nodes, applications, agents, as a percentage compared to everyone else’s responsibilities.

My math is:
# of Application builders * % Auxiliary workloads ÷ by 3

Some examples…First calculate (and measure) your % of auxiliary workloads by what you own if you are a K8s administrator against what else is on the clusters you are responsible for. You likely already do this and application owners do too, but maybe have not compared them.

Big Multi-Tenant Scenario: Let’s say a customer has 3 teams (a made up scenario) – Total customer app engineering is 25 people building 40 applications on a cluster across node pools. They have 2 clusters you manage and you manage 60% of the components on their clusters and Kubernetes itself. This is the ’22 CNCF metric for auxiliary workloads (monitoring, observability, sidecars, agents). That 60% metric is likely driven by large node count multi-tenant clusters, which often reside with the same team managing K8s itself and have lots of applications. You would need 5 people for that dedicated to K8s + auxiliary workloads and keep innovating.

Early Adopter Scenario: But if a customer team is new, an early adopter, and only has 2 customer app engineers using a small cluster with only 4 apps and because they are early stage, your responsibilities are using 80% (because applications aren’t yet migrated to the cluster) that’s only .5 of a FTE. This is also why you don’t want to stay in baby cluster state – you end up having more Ops headcount for small clusters than large clusters proportionally against what the consulting needs are.

Early Adopter Grows Up: Let’s take that early adopter and make them a robust adopter – let’s say they now have 5 people and only 8 apps, but a lot of nodes – you represent 30% of the responsibility for K8s and auxiliary pieces of the workload (the other end of the CNCF metric for measuring responsibility). They grew and launched and have a big cluster running lots of replicas and deployments of their apps because they are in high demand by users – but you only need .5 of an FTE still because they didn’t have a lot of applications, they just needed more compute.

This is why teams may not want to stay in baby cluster state – it can be more expensive in people. At a base level, auxiliary parts of the workload still cost to maintain, as represented by % of responsibility you own vs people using the cluster.

I’ve looked at it other ways because there are many types of clusters. I’m interested how others calculate responsibility costs in their teams at a high level across the spectrum of region based deployments, multi-tenant monsters, and build farms without breaking down and just retroactively looking at time spent, ticket hellholes and crying. I would love to know how engineers (both application owners and administrators) feel about it too.

I only know doing it by cluster doesn’t work at a certain size because the clusters themselves are too different, customers need different things, and responsibilities grow for auxiliary parts of the workload. This wasn’t something I knew before – but learned by living, listening, and being surrounded by exceptionally smart engineers who do too.

Regions & Proximity as a Factor

I read the words of user comrade-quinn who said “Why do people have so many clusters? Surely, outside of testing, you need one cluster. Namespaces can be used to break that up. Maybe add in one or two more in different regions/DCs, but people quoting, 10s, 100s, 1000s seems odd to me?”

What we ultimately learn from reading this thread is that – if a company is about serving lots of EXTERNAL customers then they could have hundreds to thousands of clusters because their customers absolutely cannot know about each other and shouldn’t as they are competing lines of business.

But if a company is about serving INTERNAL customers, where sometimes having shared lines of business is helpful and significantly lowers cost (and especially if it’s in the same team), then there are honestly better ways – more isolation through namespaces, node pools, and RBAC.

Answering the “How many clusters?” question depends on how close you need to be to your end user – many of the responses saw a divergence in number of clusters when they needed to be closer to end users – architects can’t release a node pool in another region, only a cluster. With this mental model: It doesn’t make sense to continually add clusters if a team is not aiming for low-latency, region-based deployments as one requirement. Teams that are, for example for session-based game servers, SHOULD consider region-based cluster deployments for popular games.

We Went Too Far and Have Too Many

Some users realized they had cluster sprawl and were working on consolidation. They had reached a tipping point that was too big for their teams (think in the 50+ cluster range). Scaling headcount probably is not on the table.

This kind of language would land me in really rough spots if I said this to customers and would not recommend. I empathize with the original poster here, let’s read this exchange of words. Muson_It clearly does not like having a lot of clusters.

I seriously appreciate that [deleted] worked in the fact that they are low key maintaining some sidecars.

Just some sidecars.
That were built.
That landed on the clusters.
Cause we needed ’em.

This is the reality for teams today.

It is very possible, that in those teams product and executive leadership do not know that those sidecars are there and have become part of the portfolio and you are now running a dealership too.

Now you’re managing application code as part of a portfolio of infrastructure, agents, various monitoring and observability tooling too while possibly still being goaled and the business tracked on adding more clusters. Does this sound like you?

Scary place to be.

In any case – I see exactly how [deleted] got there and I’m sure they can consolidate their clusters down, but still will have the “low key we manage sidecars also problem” until everyone realizes what sidecars are (through internal training) and that agent management, on top of infrastructure management of Kubernetes control plane upgrades was also part of their portfolio.

64 clusters for 4 containers a piece though, again, not right balance there but perhaps it’s GKE’s designing being somewhere between the world of EKS and Fargate from an abstraction level that makes it okay. I wouldn’t judge.

People get really snarky about Kubernetes on the internet.

Namespaces and Node Pools

I then found smart people who say it like it is without offending by having stage and production namespaces (and pools) in one cluster. This saves teams pain and gets them closer to production when they are in stage because it lowers their networking complexity and blends the two worlds.

In any case – I am in camp: Less clusters, more node pools, more namespaces. If you can’t give a great reason for why you HaVe So MaNy KuBeRneTeS cLuStERs but have not ever created a stage namespace then STOP!

Please STOP.
STOP ADDING CLUSTERS!

And talk about why.

Or be this person when asked how many clusters they have and use AWS SAM.

Header Image by Arnaud Mariat from Unsplash. M45 or Pleiades Cluster, group of stars in a blue nebulae.

SEV 1 Party