Thank You for Playing: How Amazing Teams Scale Kubernetes to Meet the Most Demanding Games

D20

Elodie’s Maze” by David Fleming from the Netflix film Damsel 🎵

It took me 3 dedicated months to begin to wrap my head around Kubernetes.

When we hear “That game runs on it” or “our games run on it” it may seem like a daunting task to understand how it all works – and it may seem scary not knowing “Will my infrastructure scale for my game?”

As a player it’s even more scary!

“Will it scale?” is a great question. It’s one that platform engineering teams try to mask the most pain points for so that their customers don’t have to worry about it. I want to explain the high level concepts to anyone who has a game that is touching a Kubernetes system so you can scratch the surface of just how much is possible (and we won’t even get into the real deep stuff).

First, Kubernetes is an open-source container orchestration solution for automating software deployment. While that my sound like another language, think of it like – we have an application for a game backend, it can sit in a container which is seen as an isolated piece of software. Several containers (or one, your call) can sit in a “pod.” This pod can live and die on its own without affecting anything else on that host (computer) if it’s limited in the amount of resources it is allowed to consume.

If we’re coming from the client – the similar mapping of a pod would be the client-side part of the game downloadable on your phone as an isolated construct. If we’re coming from the art world, the similar mapping would be not allowing Photoshop to eat up your entire machine because you loaded a 2398749273498327 GB file.

On the backend, games teams separate their applications into isolated constructs like pods and containers because it’s better for, you guessed it, the topic of the post! Scaling and reliability for players.

Before we get into the details – it’s important to note that this post is from the lens of using Amazon Web Services (AWS), but it doesn’t have to be. There are a few important concepts we should know about AWS specific backends if using an internal managed services provider built on AWS.

First – Auto Scaling Groups (ASGs): are used in almost all architectures on AWS to group hosts (instances – think “computers”) into pools. They dynamically determine how many machines we need at any given time. They are the boundary that prevent us from using too little or too many machines from automation. This is important if we run a very hungry bot script for testing gameplay, someone added an extra 0, and it makes enough requests to spin up 300 c7i.2xlarge hosts because of a typo. What we actually needed was…10. If an Auto Scaling group is set to max capacity of 20, we won’t accidentally spend $$$$ from a typo.

Second, Availability Zones: These are local to a ‘region’ – think “east coast” or “west coast” of the US. A region can have multiple Availability Zones and it is up to backend developers to decide how many to use – 3 or more is considered a best practice.

“But is that three data centers?”

No and that’s important. Each Availability Zone (AZ) can be 1 or more than one data center all located in close proximity and a no two AZs can share data centers. They all also have their own power and cooling. If we are using 3 AZs we are using more than 3 data centers at once. This is in case one burns to the ground or an entire set of host racks decides to quit this absolutely wild world and die which means a game would be 100% down if it picked exactly those racks in only one AZ.

Inside those AZs sits Amazon Elastic Kubernetes Service (EKS). Amazon EKS is a managed version of Kubernetes the open-source platform. If we are using Amazon EKS – Amazon manages the “control plane.”

We can compare this to the central command room in the Star Trek Enterprise. From here we can control everything including deleting (terminating) applications on a host (computer) – we can control our availability of our applications. Since Amazon manages this for developers what they are managing is the high availability and reliability of the control plane, our command center. We need it to stay up, but we don’t control much about it – only what version it is on and when we upgrade that.

I won’t get into too many details but a lot of important pieces of the software run on the control plane and when we use a managed service, we don’t get a lot of say into how many duplicates of that software there are in case one of those pieces stops working or the rack it sits on, again, decides to bite it.

Because of this Amazon notes that “This control plane consists of at least two API server instances and three etcd instances that run across three Availability Zones within an AWS Region.” in their documentation. Awesome – that’s helpful for scaling and reliability of our control plane because Amazon EKS has redundancy (2s and 3s of things!). Amazon also manages and scales the load on the control plane for users if it starts to get more noise from all the worker nodes – and the worker nodes are where all those fun game backend applications reside.

Platform Teams Manage Worker Node Scaling & Build Tools for Ease of Deployments

In the image above you can see that “API Server” sits in the control plane and these cute blue octagons called kube-proxy and kubelet talk to it. In Kubernetes, multiple containers are housed in a “pod” and multiple “pods” can live on a host. We want this because we want several copies of our applications on the backend running at once for redundancy.

We can spread traffic across all of those pods – and we can spread all of those pods across many hosts. Kubelet runs on each host and registers those nodes with the API Server. Essentially, when a new host (instance or “computer”) comes online, it says “hi” to the command center to let it know it can start taking pods (applications). Kube-proxy (which is optional – as my team has taught me that Cilium is the future) is used to manage networking – it can do TCP, UDP, and SCTP stream forwarding – simple or round robin.

But how do you know how many minimum hosts you need or maximum hosts you need? And how do you know how many resources on a machine your application is allowed to have? These are the questions platform engineering teams get all the time and monitor to make sure the cluster itself can scale by adding more machines and the application deployments scale along with them.

There are two high level ways to scale a game backend in Kubernetes (there are also many other features that impact scale but we won’t go into those here): Cluster Autoscaler and Horizontal Pod Autoscaler.

Cluster Autoscaler: If you aren’t using Karpenter (and not everyone is!!) Cluster Autoscaler automatically scales the number of nodes a cluster needs. This is done for a specific grouping of host nodes when pods fail to be scheduled based on minimum, desired, and maximum capacity settings provided by platform engineers. It is is built on top of AWS Auto Scaling Groups. Auto Scaling groups can both span availability zones OR be per availability zone (recommended). Either way, it is the cluster autoscaler that determines how many machines we get and when alongside the boundaries mentioned above for those ASGs.

Horizontal Pod Autoscaler: If what we need is more pods, not more nodes, because there is enough compute available and it’s not in use, the HPA, or Horizontal Pod Autoscaler for an application set determines how many we get. When a game starts to make a lot of requests to a specific service or application it’s ONLY that service that needs to scale (not necessarily the number of machines) and for that we need to scale its deployment and replicas.

Think of it like – if a bunch of players suddenly need a matchmaking feature, only the matchmaker on the backend needs to scale, but not any of the other applications for monitoring, chat, achievements, or store functionality. Now, if all of those started seeing increased demand and pods could no longer be scheduled, that’s when Cluster Autoscaler kicks in to give everyone more resources and scale up, up, and up – until it hits its maximum threshold as set by a team for cost reasons. That threshold, with planning, can be increased before a launch – but even in extreme moments of popularity – it is our exceptional teams that get all hands on deck, go in, and make the change to release the kraken. Usually there are warning flares well before this happens.

Having seen it done to the scale of 800 servers in an hour across four regions, I can share it is an extremely amazing experience to bring a game back from the brink and get ahead of an upcoming demand. All developers appreciate any patience, hate to keep players waiting, and are grateful when they can do it safely and prepared.

Wow! That’s…a Lot And I Only Half Understood It.

Girl. Same.

This isn’t where it stops. Kubernetes goes much deeper than this on a per application level configuration – we can even be choosey about which ones get priority and which ones don’t if we’re under fire. We can make the call on whether the gameplay stays live and all monitoring bites it as an order of priority. We get to control the thundering herd that comes from a bunch of applications needing resources from an extreme player uptick as long as we practice.

I am extremely grateful to my team who makes this a WHOLE LOT SIMPLER for me and for others every day. Truthfully I over-analyzed this post to make sure everything I said in it was accurate – without them I would make a lot of mistakes because I don’t want to embarrass them :).

I and those I work with are supremely proud to get to do our jobs for players. It was International Women’s Day on Friday, and usually every year I list more women than the character limit allows in a LinkedIn in a post – but I decided instead this time to recognize players with simply “Thank you for playing!” Without them I would not still be here.

It is extremely important to me and to teams that I work with to provide stability and high availability of games for players – many of who are women like I once was whose careers started by playing.

As hard as it is to simplify, I hope this post has shared a window for those who don’t develop on the backend of how fascinating this job is not only how challenging it is. I hope it inspired others, mentored[6] others, in the hopes that someday, they will want to feel the thrill of scaling up a game under fire or better without having a fire at all having practiced.

I hope it inspired another young girl, a player, to someday know this so well that she was prepared as much as those are around me who I am so lucky to know. I hope she too scales games for all her players who make up her diverse player base. I hope she influences them – I hope her teams influence her so they become the reason the game stayed online for us all – speed at their fingertips, no handlebars.

Header Image by Timothy Dykes from Unsplash.

[6] This is the sixth clue to the puzzle.