Calculating Uptime for a Platform in K8sNaaS and K8sCaaS Business Models

“Give You More” by HIGHSOCIETY🎵

I wrote two years ago about Kubernetes uptime being the combination of all its dependencies.

That message gets lost in the volume of the simplicity desired.

I get it.

I want uptime to be an easy button too.

However: True uptime is measuring availability with dependencies in a distributed system to get a theoretical maximum availability.

In that post, I stole this image from AWS.

I bring this up because I believe understanding the foundational theories behind platform engineering is critical. As we strive toward ambitious goals, it’s essential to pause and reflect on where we’ve been. Grasping the principles of uptime isn’t just academic—it’s the bedrock for scaling sustainably and achieving lasting success.

Let me share a story: Before my current role, I interviewed for a position managing a 40-person Kubernetes team. I asked the interviewer “What do YOU want from this role?” The answer? “Please build us dashboards.” My immediate thought was, “How many dashboards does the team already have, and what problems are they not solving?” In my experience, no one builds a 40-person infra team without engineers already creating dashboards. Ask any infrastructure engineer for uptime, and they’ll give you 30 dashboards they already built.

So, what do business leaders really want? It may be that they hope to distill a single uptime metric, or a handful of uptime metrics, that reflects the overall health of the business. Using Kubernetes and Amazon EKS as examples, the theory can apply to any team structured around a set of services offered together as one unified product.

The key takeaway? Understanding the “why” behind the systems we build isn’t just useful—it’s essential for taking meaningful action toward the future we envision.

Why Some Dislike the Control Plane Uptime Metric

Cluster (Control Plane) uptime can be a weak metric for innovation and operations and can easily become a vanity metric. Losing the Kubernetes control plane in Amazon EKS is rare as AWS is responsible for its availability and scalability. If you’re running Kubernetes on-prem, are a cloud provider, or contributing to the K8s open source repo, then control plane uptime is your problem—but as a cloud user creating a platform on top of it, it’s a distraction unless you’re debating multi-region or active-active cluster disaster recovery (an expensive choice few make).

The real risk? Accidental or malicious deletion or an incident so bad you need a new cluster and control plane in real time. No uptime SLA will save leadership if a critical production cluster disappears. I’d be more fearful of the thundering herd that would come from trying to switch to a brand new cluster because I’d know I wouldn’t have a job the next day no matter how psychologically safe I am as madness ensues trying to get hundreds of services back up in the right order if never practiced it with the service owners.

In fact, one will find that it’s such a useless metric for innovation that teams are not tracking in availability of 9s. Any availability metric that shows up as “100%” is suspect. I’d rather spend more time making sure absolutely no one can delete a production cluster OR practicing what happens if someone does with the 1000s of service owners using them. This nightmare scenario could take hours to recover initially, then days from co-dependencies to troubleshoot the new environment, followed by every inter-connected infrastructure backing up to a point of no return too. My point being – when your uptime metric hits 99.0% you will no longer care about a vanity metric and will deeply care so much more about everything and everyone else.

And finally, in the same way an incident can happen because someone ignored a response in a Jira ticket, disasters can start outside Kubernetes—networking, security, or scaling issues can break services while control plane or “cluster” uptime still shows 99.9%. A DDoS attack, for example, might overload services, leaving the cluster itself “available” but services unable to keep up. True reliability isn’t just about the control plane—it’s about ensuring services actually function. Prioritize resilience over illusions of perfection.

So, What about Node Uptimes for the Data Plane?

Tracking node uptime is useful, but not for measuring business impact. Third-party tools make it easy to monitor node availability, but does that mean it’s valuable? Maybe.

Node metrics help troubleshoot individual hosts, but in a well-designed cluster, a failing node may have zero service impact if workloads are properly distributed and using Kubernetes constructs like ReplicaSets. Kubelet availability can be misleading—it might just reflect spot instance churn or expected scaling behavior.

In fact, you want nodes to rotate during Kubernetes upgrades or normal operations. Node uptime is not a very helpful metric in aggregate and doesn’t tell me a lot about if services of my customers are running. It doesn’t tell me about the sidecars, pods, agents – auxiliary tools teams are responsible for – are healthy as well. Worse, when multiple nodes fail, you’ll get flooded with alerts without clear insights.

So, should you track node metrics? Absolutely—but with nuance. Rather than focusing on simple up/down status, prioritize CPU, memory, and workload performance to catch real issues before they escalate. Knowing this, it’s not great for the binary or a singular business metric and much better in detail.

Remember, when we are wanting uptime for a business, we want “theoretical maximum availability” based on the health of the aggregate of all dependencies because that is what true uptime is.

The Best Uptime Metric: Availability Measured Through Combined SLOs (Service Level Objectives)

Let’s revisit the image above—because capturing high-level availability in a distributed system serving thousands of services and customers is incredibly difficult. Saying, “Our business has been up 99.999% of the time,” sounds great—until you realize that allows just 5.26 minutes of downtime per year. At 99.9%, that jumps to 8.76 hours.

I love when cloud providers break down uptime for a single service they own. Outwardly, it may just show as “degraded,” while internally, it’s a SEV 1, 2, or 3. But the truth is, uptime is complex—and oversimplifying it does no one any favors. Availability isn’t only about a number—it’s about understanding what uptime truly means for your systems, your customers, and your business and calculating it as a distributed system.

In fact, AWS’s Ryan Reynolds (what a name!) wrote about this in 2020 in a blog post “Achieving ‘Five Nines’ in the Cloud for Justice and Public Safety” with

The theoretical availability is computed as 100% minus the product of the component failure rates (100% minus availability). For example, if a system uses two independent components, each with an availability of 99.9%, the resulting system availability is > 99.99%:

Calculating 9s — From “Achieving ‘Five Nines’ in the Cloud for Justice and Public Safety” by Ryan Reynolds

Your customer’s question is simple: “Is it available enough to trust with millions?” Outwardly, you measure incidents and degraded time, backed by health checks. Internally, it’s far more complex—every component needs an SLO. That means tracking every agent, monitoring tool, addon, and customer-facing service in kubernetes, then aggregating them into a meaningful uptime metric using the equation above to get to 1 value.

The challenge? Dependencies. If services aren’t independent, calculating real availability gets tricky. But what’s worse? Companies blindly aiming for 99.99% uptime without understanding what it actually means. Before chasing a number, ask: How many minutes of downtime per month does this allow? Do all services need the same SLO? Too often, “99.99%” is just a slide deck buzzword—real availability comes from intentional, data-driven SLOs that actually serve your business.

But let’s start with the main ones – let’s say, as many platform engineering teams do, a team services internal customers. Perhaps you draw the line at your responsibility being “only the things you manage.” This means you must get the theoretical maximum availability for only the dependencies in your system for which you are responsible in aggregate. That’s fair. But it’s still a giant list. Many Kubernetes stacks look like this:

The Cloud Provider: EKS, AKS, GKE etc. This includes the kube-apiserver, kube-scheduler, kube-controller-manager etc.
Data Plane: Individual worker hosts (eg EC2), kubelet availability on those nodes, kube-proxy (or Cilium)
AWS Adjunct – AWS Load Balancer Controller, EBS CSI Driver, Karpenter
Networking – CoreDNS, Istio, CNIs (eg. AWS VPC CNI, Cilium, Calico), Ingress Controllers, External DNS, NodeLocal DNSCache
Monitoring & Logging – Prometheus, Grafana Loki, Grafana Tempo, Honeycomb, Datadog, Fluentd, Fluentbit, Logstash, Splunk, Dynatrace, node-problem-detector, OpenTelemetry Collector etc.
Purpose Built Addons – Cert-manager, External Secrets Operator, Descheduler, Reflector, Reloader, Velero, OpenCost/Kubecost Collector, Cloudhealth Collector
Deployment Addons – ArgoCD, FluxCD, Terraform, Helm, Customize, Skaffold
Security Agents / Scanners & Policy Enforcers
And so on…

Any single one of these things could fail and some can affect each other if not configured properly. This is just within Kubernetes. This does NOT include the foundational elements these are then built on top of – for example if AWS IAM fails, not only do your services stop being able to talk to each other, but people can’t exactly triage it either.

Per the requirements above, for platform teams who offer either Namespace-as-a-Service (k8sNaaS) or Cluster-as-a-Service (K8sCaaS) business models this would NOT include:

Applications your customers are running in their namespaces (and all their SLOs)

As customers are responsible for their applications. But technically if you wanted uptime of the entire system, you would include that as well as they can have co-dependencies with things you are responsible for and you can break each other.

This is why I am extremely happy to see that Datadog has “Service Checks” as part of some of common Kubernetes integrations. For example externaldns has its own service checks for which you can build SLOs off of once you decide what you believe the service level indicator should be. They don’t have every addon teams need yet, but someone with a brain definitely is trying and deserves a raise.

If I had a service check that was easy to do for every single component in the stack and the CNCF stopped making more tools, I could get theoretical maximum availability for a combined platform offering. The challenge is – it isn’t easy. This is why when teams say “I want an uptime dashboard” they end up building 30 dashboards, figure out how to slice and dice by individual customers, on top of existing monitoring tools or inside monitoring tools, not even covering all the bases. Meanwhile, poor account teams for 3P Monitoring tools have to go around and ask “What are you trying to solve for?!?!?!?!??!” Some teams cut corners and pick 1 or 2 key services that, if down would really wreck their business, and report on those to tick a box…but still have incidents for other microservices anyway because at the end of the day, they still own them.

The larger effort, the one that would be really damn cool, the aggregate measure: requires time – often more than people think to get theoretical maximum availability.

The Hard Way

All that said, trying to get a small set of simplified uptimes that tell you something valuable about your business at a glance is a good thing to aim for. I laugh because I’ve been down this road and that’s okay – it’s a fun one to drive on with your friends if they are willing to stick out the incredibly long journey to decide if cutting has value or you just end up with even more dashboards.

Wisdom comes from repetition and innovation comes from practice.

I do wonder if there is value in a true Kubernetes / Platform uptime single, simple metric. I think it could be an awesome metric to have when done well and thoughtfully in aggregate. When I know the complexity of AWS services under the hood and know they did it (“Is EKS down or degraded?”), then I believe it is possible with dedication and exceptionally clear dependency mapping. But don’t cheapen it – it’s tempting to rush to get the final result without getting deep on the problem or doing the math as drawn through distributed SLOs.

Many teams today are creating SLOs for their services and it could be enough to list out all the ones you still need to do, and try to get as many of them done as you can. Eventually you start to discuss what are the views each persona wants?

Is the “company” as a whole up? (DownDetector.com view)
Is the game up?
Is the internal service up?
Is the auxiliary service up?
Is the security agent up?
Is the monitoring agent up?
Is the node up?
Is the control plane up?
Is the network up?

And so on…

Header Image by This is Engineering from Unsplash.

SEV 1 Party