Security is Job 0: Finding & Solving Shared Single Points of Failure (SPOFs)

Finding the gaps to find the SPOF

Major Tom” by Shiny Toy Guns, a redux of “Major Tom (Völlig losgelöst)” by Peter Schilling…a followup to David Bowie’s “Space Oddity” 🎵

Adaptive challenges require relentless willingness to seek other voices.

Talk to anyone in cloud and they will give you technical solutions to solve today’s weirdest security challenges as one of those perspectives. That’s great, but it is sadly never enough. The most common recommendation?

“Make sure you have great RBAC!!!!”

Of course.

Do it anyway, and revisit your role-based access controls yearly too, but don’t let that be the only security idea you recommend: Your teammates already know and are likely just struggling to live and operate within that. Sure, follow the principle of least privilege. Make sure everyone has an HSM (hardware security module). Start using policy as code. Don’t let everyone and their World of Warcraft alts be able to sudo. Try to do all of this while people re-org and change companies. This is the foundation and also a full time job for minimum 10 people at any large company. These are surface level solves, technical solves – and they are already hard enough.

On top of that though, if teams want to find their darkest vulns, they need another assumption:

Some vulns aren’t theirs. 🙂

That requires much more than technical solutions.

We may be forced to live with the deepest single points of failure, unless we call out the gap by sharing with the planets in the galaxies that have it – traversing space time and getting all those people to work together.

That requires an adaptive approach where we assume we can’t do it alone.

Predicting the Future…

Good RBAC isn’t going to prevent someone from hitting an already publicly accessible API endpoint because a team needs it to be publicly accessible. This is a natural “SPOF” or “single point of failure” that may not be avoidable by design and only vigilance will help in the event a DoS or DDOS happens. You can solve it technically.

For example, we can design architecture to be highly available, resilient, and fault tolerant (and somewhat cost tolerant) to a DDoS or DoS by either (1) prevention or (2) absorption or (3) a combination of both. If a team sets up AWS Web Application Firewall (AWS WAF) on top of an Amazon API Gateway endpoint, they can watch multiple resources like a hawk and restrict anomalous traffic to resources with pre-written managed rule sets that prevent SQL injection and cross-site scripting (XSS) attacks. This can prevent or allow specific IP addresses or requests from specific CIDR blocks. If a team uses AWS Shield Advanced, they can get someone else with access to even more data and visibility to watch infrastructure for them, but also monitor metrics for specific Amazon arns and if a DDoS is detected while looking at global threats that affect everyone.

If a team does get DDoSed on a Shield advanced protected resource (arn) and decides to eat it (mitigation can’t solve for everything), they can get credits back because they paid for AWS Shield Advanced but only if they had already added that known SPOF to it.

This is an important point – because it implies, a team knows everything they possibly can about their own stuff and did something about it – and that is the problem with technical solutions to adaptive challenges.

If the path is already there to mitigate an attack, I have to ask, isn’t that still staying on the dance floor and not really getting a balcony perspective?

We can do a better job of stopping the worst of the worst in the future:

…By Getting Perspective From Analyzing the Past

The most interesting SPOFs I’ve seen on enterprise architecture have been through (1) fundamental operating (not architecture) patterns that need to change and (2) co-dependencies with people outside the organization.

Those require adaptive solutions over long periods of time to get ahead of, lots of transparency, trust, and sharing privately. It requires “unlocking” who knows the SPOF by sharing incrementally what you’ve found, for example by sharing a post-mortem, doing an audit in a very specific area, looking at repo changes, to the right parties and making the right choices at the right time.

It requires first believing that there is one and that your team, your company, might not be the only one who can solve it alone. It requires hope that the other parties see it too and are prepared and willing to let something else go to prioritize it.

The deepest SPOFs are adaptive challenges where solving requires teams to: Relisten to the music from a new point of view and get on the freakin’ balcony. 😉

From Ronald Heifetz and Marty Linsky’s Leadership on the Line getting on the balcony during a dance (or rather a situation) “means taking yourself out of the dance, in your mind, even if only for a moment. The only way you can gain both a clearer view of reality and some perspective on the bigger picture is by distancing yourself from the fray. Otherwise you are likely to misperceive the situation and make the wrong diagnosis, leading you to misguided decisions about whether or how to intervene” (Heifetz & Linksy, 53).

While the book itself is a deep take on managing leadership and diplomacy in high profile political, business, and community circumstances, it absolutely applies to architecture. Namely, I can think of a few specific SPOFs I’ve gotten ahead of by having to do this and they were all terrifying, multi-company, time sensitive SPOFs where talking to each other across companies was more important than trying to fix it right there with a technical solution.

I’m not going to write about SPOFs I’ve found in detail because they are related to security at places I’ve lived, but I want to say the two adaptive approaches that were taken to find and solve for them so you too can do this to protect your own architecture:

Study the Choices of Others Over Time
I unfortunately once found a low level networking and data center dependency SPOF (not at AWS) using the Way Way Back Machine across three company websites acquired over a 10 year period. This would have been fine had nothing happened to that SPOF, but I assumed it could, prepared, and planned accordingly. And it did. To find the SPOF? I had to get on the balcony and look at things from a very broad perspective across time, people, and multiple companies. I had to relentlessly ask: Why did they make these choices and use these naming conventions? Why are they still the same? Everyone has skeletons in their closest.

It is very time consuming to protect businesses this way so you need to develop your own awareness and understanding about what architects and engineers will do (and not do) when under business pressure from cost or moving fast. Engineers and architects must be naturally curious enough to look at the choices of others, their past impact, their current impact, and their future impact in a way that is not self-serving or vision-locked into their goals at all.

Look for the Gaps: Then instead of Solving It – Find the Gap Owner
The other thing I reviewed is choices in documentation over time. In public technical docs, labs, training, of the tools you use – what are authors saying NOT to do? What are they saying to do? What are they saying that isn’t well explained or doesn’t make sense? What looks not fleshed out? What is causing your team pain?

What caused us to make a mistake? Then ask the two key followups: Why does the author want you to do it that way (or not want you to)?

And…

Do you trust that the choice for which you were required to operate around will be addressed by the party who put you in that situation and that it’s a high priority based on their leadership principles? Because if that’s the case – don’t try to solve it technically. Reach out and try to understand if they are trying to solve it and missing your perspective. You may unlock something you never knew existed and to solve it well, both parties need to talk to each other really deeply, transparently, and openly.

Security is Job 0 Getting Other Views Is Job 0

I would not advise talking shit about RBAC.

However for shared organizations’ and communities most impactful security changes, technical solves are not the only ones we need, but instead adaptive solutions that require a willingness to stop at nothing to get (and give) perspective in the safest ways we can incrementally.

This requires incredibly talented, clever engineers, on all sides, who value seeking other voices over being right to work towards getting in the same room, on the same balcony before going back to the dance floor.

– Molz (aka SPOF )

Image Credit: ESA/Hubble & NASA, R. Tully | Text Credit: European Space Agency (ESA) “Jun 30. 2023 Hubble Checks in on a Galactic Neighbor: The program to capture all of our neighboring galaxies was designed to use the 2-3% of Hubble time available between observations. It’s inefficient for Hubble to make back-to-back observations of objects that are in opposite parts of the sky. Observing programs like the one that captured ESO 174-1 fill the gaps between other observations. This way the telescope can move gradually from one observation to another, while still collecting data. These fill-in observing programs make the most out of every last minute of Hubble’s observing time.”