In Pursuit of Hospitality in Platform Engineering

Hospitality

“Test Drive (Emotional Version)” by Mathias Fritsche as homage to the original “Test Drive” by John Powell from “How to Train Your Dragon”🎵

It’s been a while. I’d give a million reasons, but I am not trying to build a side hustle so no one needs to know. That said, I had a kid – she is the light of my life.

I’ve finally found a moment. There has been so much noise in devops so I began to reflect on the hype around platform engineering, self-service, and abstraction. At the same time Refactoring, an excellent Substack to follow if you are collecting them, had a guest article from Camille Fournier called “Creating a Platform Engineering Team” (Camille’s got a book out on the topic) where she wrote about overly operations-focused teams,

The code this team actively develops is mostly automation, templating, and one-off tools. They aren’t doing much to build better platform abstractions to manage complexity, or working on a better architecture to solve operational problems for good. Faced with the flaws of a system they can’t change, they reach for rules and processes, often cataloged in meticulous wikis.

My immediate response was both agreement and a mental “F*ck!” since I can no longer curse out loud in my own home without lifelong consequences for my daughter. You see, I believe and live in platform engineering teams. The challenge? I don’t know if I completely agree with the mantras surrounding it anymore.

We’ve grown this utter disdain at the idea of being support teams while simultaneously seeing each other as blockers – Platform engineering was supposed to be the solution and yet it still isn’t. Many industry teams are stuck between cheapening the experience of learning from hard architecture in defense of abstraction and simultaneously still tackling only low hanging fruit when it comes to customer onboarding. We take a process and make it a form the customer can fill out without talking to a person so we can call it self-service and platform engineering. Is your org suffering from true work now hidden behind centralized ticketing systems but under the hood none of it is actually abstracted and the architecture problems are still there? Meanwhile customers can no longer easily find or talk to a person on that team? Which part is the worse part – the fact customers can’t talk to a person first or the fact that when a team gets a customer request, it is still a template they execute instead of the customer? What about how when a net new customer needs to adopt a platform’s applications it may take a person who has been at the company for 12 years to airdrop into their team even though tons of documentation has been written?

Is this mantra of platform engineering even really possible?

This idea that: Shouldn’t your tool, infrastructure, abstraction, technology just work without…people? Shouldn’t we have less support? Isn’t self-service the same as abstraction and better architecture?

Maybe we should pause – Why are we obsessed with getting rid of touchpoints with each other and are we sure that is always the end goal?

Someone is going to read this and think I’m an idiot and that profit motivation never occurred to me except that it occurs to me all the time how much we love to talk about efficiency by deleting people in a market where we have absolutely obliterated people’s jobs for the last 24 months.

I began to realize that somewhere deep in me, I had a growing distaste for the words “Platform Engineering.” I was beginning to have a sour feeling on my tongue about the word “self-service.” These words – ones I loved two years ago…Why? Why was this all beginning to rub me so wrong?

I suppose it is because I quite enjoy the sort of hospitality that comes with providing a service to other people. And I was starting to realize that self-service didn’t actually mean better architecture or even abstraction. You can have both of these things without self-service and still have a better platform than a team who simply created a Jira ticket system / wiki process, called it self-service, and wiped their hands clean.

Yet so much anger has been generated (not excluding myself in that) by the feelings that we can’t, often, do things ourselves when it comes to software. We don’t want to block each other nor be blocked by each other in a market that pressures us to move fast or die trying and we’ve translated that into “abstract everything and remove all the people” assuming that on its own makes a better platform experience or is even possible. If it was, AWS wouldn’t have so much abstraction to begin with yet still need dedicated account teams, go to market strategies, more documentation than you can read in your lifetime, labs and training certifications to adopt. It is, after all, a platform. Why do we think as users of it, we won’t need the same when we build our platform engineering teams?

In our effort to chase profits, pay down efficiency, and catalogue opex under a microscope, teams have forgotten that good business, really great business, is self-servicing the cheap and indexing like a 5 star hotel into the consulting engagements that matter. We keep trying to outsource and automate the parts that make us human, take shortcuts, and regret what were once discussions that make us better understand architecture in the name of “it should be self-service.” We keep trying to solve for self-service and deleting people instead of solving for service and building the Michelin star restaurant of customer experiences. Only then will we make the abstractions people actually want and need.

Solving for Service instead of Self

I think about how vendors have account teams, sales engineers and users have centralized teams that provide consulting support to their internal customers. So many of these teams are being pressured to lower support and I’m guilty of having done that. It comes from a consulting background where agencies used to close support SOWs after the project was done – we have generated this kind of “I won’t do it for you” or “It’s their problem” attitude on top of abstracted spaces that teams end up in conflict because customers think they have done all the work themselves, alone, having no idea what’s under the abstraction and platform engineers don’t want to take on anymore than they already are.

Customers may insist a tool or platform provide “all or nothing” instead of appreciate what is maintained for them. Some have an unwillingness to spend the deep investment to understand what was already built or even take the time to learn it until they have a money problem. You can’t self-service your way out of needing to adopt a new AWS architecture. Many technical adoptions take an actual business catastrophe to drive first – that should be celebrated in pursuit of a solution, but alas, more often than not, it starts off as conflict because people wish the past had been different. In the name of the platform, teams stop embracing learning.

It takes ages to walk through and create the foundation for trust in abstracted, self-service environments with both new and existing customers. The time commitment to appreciate someone else’s internal platform is so often incompatible with the career growth requirements that ask people to adapt and move around in companies further compounding a vicious cycle where nothing is ever end-to-end self-service and the line for abstraction is blurry as hell. Launching and maintaining anything without long-term support is a pipe dream that we keep aiming for despite it creating a vitriolic feeling in customers and platform engineers alike due to the complexity of hosting anything in production.

I used to be this person that constantly wanted to lower support requirements and now I’m seriously questioning if that’s the KPI or metric that we should be framing with regards to measuring platform engineering and self-service at all.

Measuring Self vs Hospitality

After I realized reducing support maybe isn’t the goal, I began to compare the challenges with creating OKRs based on hypotheses where you assume, you know what the goal is (“Building more self-service workflows”), versus hypotheses where the goal is actually customer satisfaction.

For example – let’s take an OKR and try to define it for self-service with the assumption that reducing support means the solution is a self-service workflow. This is an extremely common assumption and one I’ve myself made as a mistake. You may write something like this:

Objective Build a better customer experience through self-service workflows that empower customers to need less technical enablement.
Key Result #1Reduce the number of support hours in the platform team by 20% per engineer as measured through time logged on tickets.
Key Result #2Reduce the number of support inquiries & Slack questions that require team response by 5 issues/week.

The issue with measuring time back too broadly is that once mundane issues are fast, repeated and “self-service” is that it doesn’t necessarily reduce the amount of support a team has. It assumes support is a negative that must be paid down – a thought I no longer believe as I watched support rise with adoption no matter what we did to “pay it down.” Measuring this metric through time logged on tickets by an engineer also is not as customer-centric as measuring how long a ticket sat in a decision flow, which is what the customer ultimately cares about and could be much longer than an engineer spent on it.

In fact the opposite may occur when approaching this way- novel issues that take longer to resolve now increase as teams begin to tackle more complexity and new architecture circumstances when self-service workflows cover the mundane. Teams struggle to separate elements of training a customer that has no experience with the platform from experienced customers who do. And none of it measures the quality and happiness of what the customer sees. When customers are happy, so is your team regardless of amount of support required.

Reducing support hours, tickets, and intervention can really only be measured for single, repeat, known issues that annoy all of us and are usually the “cheapest” to solve. These self-service workflows then lack the ROI most hope from the endeavor of building them.

The key results are often designed with the idea of getting time back to pursue goals elsewhere (“more time for innovation in my team or myself”) instead of in service to make the loop of the existing customer experience deeper and more profound.

The actual solution for efficiency and cost reduction by way of support may not be a robot AI response or backstage flow at all; take updating the generations of instance types – it may be centralizing the decision to transition all instance types for a set of customers instead of each customer chaotically determining when they will make the swap with finance. Except if you don’t own their budget it isn’t ultimately your choice to make that decision across all customers – you can only kindly influence them to agree. Self-service, or rather it being the customer’s choice on when you can upgrade generations on the stack, ends up becoming less efficient and more expensive because you can’t do it for everyone in tandem. This means a repeat task like generation swaps being more efficient isn’t being self-service at all. It is deciding who truly owns the decision and if you will instead enforce it in a central manner.

And finally in many cases, decisions in architecture are not repeatable, making them less ideal for self-service design. 

The ROI and efforts to be more efficient can’t be resolved by measuring for self and starting with an assumption that “self-service is the only way.” But perhaps they at least start the conversation of what matters.

What if we re-focused on measuring hospitality instead as platform engineering teams? Many teams run a CSAT on their customers (or similar surveys) as touchpoints, but don’t necessarily bake these same ideas into their OKRs or KPIs. When re-evaluating this with Camille’s words in my brain, I began to say “What if self-service isn’t always the end goal?” and instead focused on the types of questions I would want to know about customers. These questions were the ones in my mind. 

  • Were you able to accomplish your goals on your own? (Did you want to?)
  • Did you find the documentation enabled you to get started quickly?
  • Were we able to handle your issue in a timely manner?
  • Did you learn anything new from this experience?
  • Has adopting the platform improved your team’s quality of life?
  • Did you feel supported when adopting the platform?
  • Where do you wish it had gone faster?
  • Name one thing you wish you had been able to accomplish yourself without our team stepping in

With this in mind, as an exercise, I wrote for fun an objective and two key results that are specific yet do not assume “self-service” or “abstraction” is the solution or the platform even has all the bells and whistles the customer needs yet – rather that is one way to measure a key result but those key results could be achieved through other methods.

Objective Build a better and more efficient platform customer experience.
Key Result #1Increase the average response to 4 (out of 5 being the most satisfied) to “My issue was handled in a timely manner” in the post-ticket survey.
Key Result #2Increase the onboarding speed by 50% for net-new adopters by reducing the friction in team namespace access and role creation from initial onboarding request to first confirmed authentication.
Key Result #3Reduce the time it takes for teams to adopt new generations of instance types to 1 year from 5 years by putting all customers on the same schedule for generational upgrades.

Measuring for hospitality is perhaps the direction we should go because it addresses what creates stronger trust, reduces support in areas that don’t need it without risking the quality that comes from human touchpoints, and is harder to be influenced by external factors. Often teams try to put the “how” as their key result instead of the metric that has value.

At the end of the day, we’re all trying to save time and money – self-service is one way to do that but sometimes it isn’t the right way to achieve cost savings and efficiency, and sometimes it isn’t even what the customer really needs or wants even on your platform.

It’s more valuable to live in the pain.

I think about how often the journey of support, and even the journey of success, when it comes to infrastructure operations, doesn’t begin when the infrastructure spins up. It doesn’t begin when a migration is finished. It begins after when all parties realize that being successful in what we do comes with living in production together for the next 12, 24, 36, months.

We want to live in each other’s pain less and less because we are rewarded for leaving that pain with others in an effort to do new things. I hope to make sure we have not lost what it means to appreciate the hospitality in what we do and caution over-indexing in trying to put on the customer the complicated in an effort to rush or ignore the novel because it frustrates us instead of inspires.

I hope for Platform Engineering to mean reward comes from having hospitality, not trying to condense the experience of what it is that we deliver. Respect the commitment that comes with the decision a customer makes to deploy anything at all on your platform – and that not all costs are saved through self-service, not all efficiency comes from customer autonomy, and learning comes from living in the pain of adoption together.

–

Header Image by Alex Talki from Unsplash.

PS. Shoutout to Alan Page who finally revealed how to solve his puzzle. I lost it as I didn’t realize it started with a 0 index. But the main shoutout is for him actually donating to charities doing something about women’s rights in these incredible times.