How to Estimate Infrastructure Cost & Opportunity Instead of Fake A** Numbers

Estimate Infrastructure Costs

The Bank” by Paolo Bunovino from “I Medici (Original Soundtrack)” 🎵 

Apologies for the wait. I had to accept that the universe wanted me to write about what I like to call:

Fake Ass Numbers.

I hate them.

It’s so easy to see a fake number by the authenticity of the person who presented it through what drives them, and I’m lucky that I do not witness that in my job today.

I also am disappointed in when engineers avoid doing true cost predictions or opportunity estimates out of fear they’ll be wrong because they don’t want to present a fake number and are paralyzed by being wrong. And finally, I can see when people do the “what will it cost” and “what will we save” to tick a box instead of because they (1) want to get better at it or (2) just want their fix in with the groan of pushing it through. So how do we enable teams of engineers to get to a good spot? To feel safe to dart throw a number we can all get behind and say “This is worth it!”

I LOVE dart throwing and seeing how accurate I can be on reflection. And I’ve had people see me dart throw, not believe me, and then look back and say “How did you know?” I’ve also been swindled and I’ve also been wrong – that’s how I know.

The transition to IPv6 from IPv4 is going to be expensive despite the fact that it’s been unavoidable for a while (THE INTERNET IS TOO SMALL). There’s no way around it. I started to think about really big numbers the moment this post dropped from AWS. So did Y Combinator. My favorite part of the Y Combinator thread? Everyone estimating how expensive this IPv4 announcement will be before it blows up the internet next week.

You can use their numbers to go estimate all your own bullshit alongside Amazon VPC IP Address Manager IPAM to find all your static IPs you made to avoid NAT gateways…which as a tool to monitor your IP usage also has a cost…

My second favorite part? “Assuming AWS has 50% utilization on IPs they’ve assigned for EC2, this is a $1.28 billion/yr fee they created.” This made me smile. That was a really big number dart to throw. It felt possible but needs more clarification. It could be way off.

I also thought about fake numbers. I thought about how much they annoy me. How often they come from those who have failed forward in life and instead should have early on fell under Amy Edmondson’s The Fearless Organization‘s clear sanctions – not because their numbers were wrong but because their numbers were not transparent and were selfish.

If you kept reading and made it this far, you’ve seen “Fake Ass Numbers.” You may have also lived through the pain of people so scared about their budgets, because of lack of experience, that they waste 10x time in people costs going back and forth with those who know better on tiny numbers. They wasted the time of those who have had perspective. Both annoy me equally because both really hurt businesses and are hard to catch in systems. People who obsess over tiny numbers need to have someone who has lived come around and say “Did you know you are costing us about $1M in people costs with your approach and perhaps didn’t mean to? Maybe we can do this with a top down estimate – and then give leaders parts of the budget to own and trust them to do so by holding them accountable and providing a buffer.”

It is time to write about How to be an engineer empowered by financial math to come up with numbers that really matter to everyone around us and call out fake ass numbers.

But First: How to Recognize a Bullshit Number

We need everyone to believe in our math in order to prioritize our work and we also need to believe in it ourselves. The goal? Show just how accurate you can be without spending a ton of time (money). Put your estimate on the line with confidence to the people who absolutely love the exercise of dart throwing and getting better at it – watch yourself improve over your lifetime. Try not to waste time in the estimate stage so it doesn’t become considered time as part of a future bottoms up estimate.

Practice throwing the cost estimate at invisible targets until you get really good at seeing where they are going to be. Don’t avoid throwing the dart by saying “We don’t know. We just have to build it.”

Be excited by trying.

Then Throw and watch how those you report to are more happy that you threw the damn dart and tried than avoided throwing it or wasted tons of people time (money) trying to get the perfect estimate.

If you quote me anything over $10K in a sales scenario or in an engineering scenario I will ask you how you got to that number. If you quote me anything over $100K in any scenario, I will ask you how you got that number. If you quote me over $1 million, I better have worked with you before….and finally if you quote me over $100M in most scenarios and we didn’t work really hard to get to that number as a team it’s probably not true. Every single time I’ve called a bullshit number it was exactly that. Not real. Not a real number. Not a real deal. Because the right people weren’t really invested in trying and you could see it.

Bullshit numbers all have one thing in common – they cannot be worked backwards from in a functional way and they didn’t involve the right parties to get it.

Often they are a number the wrong party wanted because of a goal they needed to hit, not a number that was needed from a real opportunity driven by champions who were galvanizing the right people invested because of their need. Even small signals will tell you if someone isn’t ready to get on the same table – For example, an MNDA is not a company wide agreement. It’s an MNDA signed by the company to let it talk to others. One mistake I’ve seen is when junior account managers in sales or in production imply that a signed MNDA is working towards a larger deal, a larger opportunity, a larger publishing agreement, a larger investment. Their title may say enterprise, lead, head, VP – but this one mistake gives it away. It’s not even close. It’s a requirement to start – but to actually do that – you have to have a champion who can go around and is goaled on doing that. I know this because I’ve been that champion many times and I know the requirements I need to to morph into that champion doesn’t start with an MNDA. It starts with really small numbers sprinkled around several projects that tell a story and really strong relationship building with stakeholders who then knight you because you quantified and justified the opportunity – and they were ready to hear it and made space for your argument. You have to have all of those things – which means the pain point or opportunity needs to be really big to make it worth spending time to galvanize on top of doing the work.

There’s usually two ways to estimate an opportunity outside of the more transactional ways: Top down and Bottom up.

Top Down Estimate (Low Level of Detail, Broad Perspective): From the perspective of engineering, this is usually a finger-wavvy number based on a single hypothesis – you start at a high level and chunk out the problem. For example, Let’s say you want to estimate the migration of 10 games from EC2 and RDS monolithic architecture to instead be containerized on EKS and use a different database. You’re trying to estimate “what is the long-term value proposition of modernization for this company” as a engineer. You need to estimate people cost of the migration and infrastructure cost.

You have at your disposal only: The current cost of one game’s backend in a specific genre was $3.5M a year. It was on EC2 and RDS, had a specific numbers of CCUs and that number was expected to stay steady for another year. You want it to be cheaper and easier to maintain. You could use that as one number to inform what that game would cost when moved to EKS and then multiply that exact number by 10. The problem? You still need to factor in the people cost of that migration, the tooling you could build to make it easier for the other games, understand if all the other games have different CCUs and usage patterns that impact their costs. You may have heard when people do this it takes 6 months, 3 people, and their infrastructure costs savings are a certain percentage after. That number multiplied by 10 is probably not accurate. It’s certainly more accurate than a bullshit number because at least it was based on something.

Bottoms Up Estimate (High Level of Detail, Small Perspective): A bottoms up estimate would require you to get data about each game and some historical migrations to estimate your number. You’re starting from the detail and working up. You’d estimate the cost of what a specific tool would be. You may do a few POCs. Bottoms up estimates take an exceptionally long time. You’d pass this information to others to see if you could get discounts and savings plans or any pilot trials for any new services you had not used. You’d then use this to form a real estimate.

Because I like speed, I prefer to do a top down estimate and then ONE very small bottoms up estimate through a POC or a historical cost exploration exercises for bottoms up, apply a plus/minus 5-10% degree of accuracy value to both and then take the average of each degree of accuracy for the range. I like to cheat too and use external blogs and case studies into that math on top of that POC so I don’t have to invest in a bunch of POCs. I absolutely LOVE when people include case studies from other companies in their own estimates.

Other Conditions
There is tons of reading on the still-kinda-IPv4-ish internet about top down and bottom up estimates in sales which is why I used the above as an example for engineering. One other way to look at a top down is number of potential customers to estimate market size (we want 10% of all real estate transactions and there are an estimated total number of real estate transactions possible per year based on number of current home buyers) and then bottom up being actual customers who have already signed up / historical increase in a platform.

I’ve seen everything from “Do you know how easy this migration will be and thus how much help or credits they will need?” to “Do you know how long it could take this team based on the experience of their engineers?” “Long” or “short” or “Expensive” and “cheap” or “Huge cost savings” are just not acceptable answers because it has no value in enabling, prioritization, closing a project, a deal, or do a migration. Everything is relative to time. I wish it was acceptable to be so vague and not dart throw – it would mean I could be supremely lazy in the jobs I have had and there wouldn’t be so much risk on the line. I recently said something could save costs and I realized when I said it I hadn’t gotten yet to enable the person I had quoted do the quantification and that bothered me – the wave had moved faster than I had gotten to do my approach. I was behind the message. It wasn’t authentic. I hope to encourage others to tell me something will save cost while showing the estimate approach at the same time, not only the technical how understanding what I’m looking for and where we all sit in our lifetime of doing this.

Baby’s First Cost Estimate: Using Cost Calculators

If we’ve never estimated “how much is this new tool or service going to cost us” our first stab should be trying to find some kind of cost calculator for it. If it doesn’t exist, make one. Literally make a spreadsheet. Good vendors (and I know some are going to disagree with me) wear their price on their sleeve at least at a basic level and then have some kind of negotiable enterprise discount they may or may not disclose because if they don’t they know they will lose part of their funnel. Look for a pricing tab in a vendor’s website. If we aren’t willing to sit down and do the exercise (or the vendor isn’t) then neither are invested enough and that’s a signal.

If it’s AWS services, look to see if what you’re trying to use is in the AWS calculator ( https://calculator.aws/ ). I say this is your first stab because I’ve been you before the cost calculator existed or was really usable. It will get you part of the way there. My numbers with this approach were off because those prices are almost always negotiable at any scale – if it’s a new service, new startup, you may even be able to get some part of it for free. For example, even at AWS most services have some kind of free tier (the full list is here). Most startups need you. Literally: Ask. Can I use this for free for a certain amount of time as part of a pilot or POC and see what happens? I’m not saying this in a way to use a vendor – I’m saying that it’s a transaction. If something is new, you’re accepting that they may be trying to find product market fit and you may not be that fit and they could stop supporting you as a customer. You are taking a risk and so are they. And after you’ve done all of this, put together a 3-6 month spreadsheet of what each piece costs and why and ship it to the people around you who are going to use that tool. That’s baby’s first bottoms-up estimate.

Teenage Phase: Having The Talk

I’m not sure I’ve ever cussed so much in a blog post but because it’s financial math I’ve just seen so much shit financial math from people who want to own the world and from the perspective of the universe it feels right. Let’s talk about what the pre-teen and teenage phase of estimates looks like. What information did you forget when you did the estimate? What didn’t you know that no one told you when you first started as an engineer. It’s usually “The talk.”

Did your company already sign some sort of discount plan and you didn’t know because you were too junior to have access to that? That’s not your fault. Did you work for a company that isn’t internally transparent with their employees about their budgets, revenue/earnings? Not your fault. Make sure you solve the root cause there (work for companies who are transparent about those things so you can learn). Go get access to the discount plan and read it. Understand the numbers. Did someone else already sign an agreement? Can you get a better one by combining? When should you do that? Do you have FinOps? Do you have Savings Plans you need to know were bought? What are the terms? Read the docs. Learn. Get those who understand or worked on them to help you update your estimate. Now share it to your peers. See how it got better. Because you went through that exercise.

College: Discovering Price-Performance

It’s extremely hard to estimate changes like adopting new instance types on top of workloads that often also change as part of that adoption. Just as valuable as the above is estimating, instead, price-performance. This means you need two numbers: The cost of the instance type and the second a value for performance, which could be something like “requests per second per CPU core” in order to understand your exact dollar cost per request per core but it could also be “payload per request.” It could also even be time if time is your KPI value (and in many cases it is). Price-performance as a dollar per x depends truly on if you are measuring compute, storage, or time. For example you could end with $X/processing minute and the percentage delta improvement as your target because the length of time something took was that painful and business impacting.

You can do this after the fact if you have the data for those things in tools, but I think what is a better approach is to measure a price-performance for today, get a public figure for a percentage for how others have increased price-performance for a similar change, and then use that percentage to dart throw where you think your application or workload might be after you make your change. At the end of the day it will let you see if you were right and if your model for estimates is something you can reuse on similar changes across workloads.

Similarly if you are going bigger, let’s say instead of estimating the cost benefit of an instance change, to instead something like “If we swap vendors what will happen?” you still want to look at price-performance. At the end of the day if a vendor is meeting 80% of your use cases and does not have that many outages that cost you revenue loss, the swap of that vendor may cost 5x in people time than their actual cost. That’s the truth. The issue becomes if you identify that that vendor actually only meets 50% of your truly needed use cases. The reason this is a price-performance exercise is the performance piece is use cases. If you are this scenario what you really needed? Was a product manager. Performance from a feature perspective can be 0 – and you can’t divide by 0…because it’s undefined 🙂 .

Adulthood: Getting Price from All the People (Optional)

The final stage of estimates, and often one that takes many months on top of years experience and the support of a lot of people, is recognizing when you’ve gotten enough small estimates, enough POCs, that you need to actually galvanize a bunch of people across divisions so you all get the same deal all at once and it’s not terrible.

As an engineer, you don’t want to press for this too early or you won’t be successful. You also don’t want to “own” it if it’s already being owned by another party who was ahead of you and intends to spend more. That is the absolute truth – if that’s your situation, just get on the train.

Owners need to make it easy for others and have their own mini “internal go to market strategy” so that this all happens at the right time. It’s significantly time consuming. The worst part is when people don’t know they need to be roped into a larger plan and you recognize what I call “The wave” which is when everyone around you is talking about the same thing all at once because some vendor decided to start the wave for you and hammered everyone’s doors. Trust me – you’ll know when you’re in a wave because you’ll be riding it and there won’t be an owner (or you have to figure out who it is). Once it’s clear who the owner is, confirm it, get behind them, figure out from them when they are making the call for everyone, let them own the deal, and get them all the information they ask for from you as an engineer. That’s how you get your real, true, final estimate and repeat this cycle all over again.

Throw the Dart

In some cases we may never make it to adulthood of estimates. The goal of adulthood is to make all our choices as cheap as humanly possible while still getting what we want at the quality we need.

Some estimates will never grow up to find out where the dart would have really landed. We may find in the POC that there were other reasons not to keep going, like, the vendor itself shutting down the product.

Those things used to worry me and now they make me laugh. Once I embraced that I can’t control every future I stopped trying because I realized how much time, happiness, and in some cases my own money I wasted trying to control for the future. Now I mostly help others throw darts, teaching them to throw, and get better at it by seeing the journey as one worth going on.

If I can leave one takeaway of wisdom – if you are confident in your number and you did the work – make people laugh. Compare your estimate to something tangible. I used to estimate really expensive instances costs using gold bars thrown into toilets if I felt they were unnecessary from a compute perspective. Help people galvanize for you by remembering what you were saying. And when in doubt, compare it to the cost of people relative to you times 5 based on where the people are located who are doing the work (including the customers). “How many people is this change over the time of the change as the delta of sunk cost vs cost saved?” is always something anyone will care about. A $1M YoY cost saving that actually cost 1 US FTE Sr Eng, .5 US FTE Junior Eng, and a 10% management cost (FinOps, Production, Manager, Customers adopting) to implement end to end from idea over 6 months isn’t $1M in savings – it’s more like $700K which translates to a delta of 2 to 4 US Eng salaries depending on the company you work for, the location of each team member, and their level of experience. That’s decently good ROI if that cost is truly YoY.

If you’ve hit saving the value of 5x people cost for a junior engineer for a technical change in a 6 month window using the work of only one person, that’s significant. For a manager 10x or director 20x divided by number of people they are responsible for (in my eyes). This means I’m trying to look for a bunch of 5x people costs every time a change is proposed no matter where I work: If it’s not that, it’s a sunk cost and I haven’t gotten that message across clearly. The easier someone makes it for me to see that, the easier it is for me to galvanize it at scale and say yes regardless of if it’s my team or another team.

If the whole company operates that way? The easier it is to make the important things happen and achieve selflessness. The easier it is to actively look for those types of savings. If not – it just results in numbers driven by self-fulfilling fantasies easily seen as such through the final P&L.

Header Image by Jingming Pan from Unsplash.