I Destroyed This Site: For 5 Minutes, Then 2, Then 0 Seconds

I had a realization last week that I needed to upgrade WordPress (again). I’ve been shamelessly upgrading it in a way that I like to call “Yolo” upgrades, aka just pressing the upgrade button. There’s a lot wrong with this – for one it meant I believed in my snapshots and never tested them while putting a ton of faith in WordPress alongside the co-dependencies of my plugins. So yesterday, I sat down on my couch, turned on the fireplace, opened up my laptop, and logged into AWS and like a completely normal individual said “I am going to bring down SEV 1 Party on purpose.”

I got more excited to break it than to prepare for breaking it.

I rolled up the sleeves of my Xbox One hoodie: How long do I want to bring it down for? What’s my recovery time objective? 2 Hours? 1 Hour?

Molly, game developers at Take-Two trust you with Kubernetes – you don’t get that luxury with your personal blog. People are going to think it’s running on some ungodly fault tolerant semi (It’s the opposite). Fine. 5 minutes. I am allowed to be down, For 5 minutes. In Production (Reminds herself she actually has no development environment for her personal blog).

Step 1: I checked my snapshots & decided in what AZ I would create a new site.

Currently I have Amazon Lightsail making daily snapshots. I ended up creating a new instance from the most recent daily because I hadn’t yet written this post nor made any changes for a week. This meant my Recovery Point Objective (RPO) was exactly 1 day. One of the reasons I like being lazy with my blog is if I do decide to SEV 1 screw it up I know what my own RPO is by a measurement of my laziness. I said I’d be consistent, mentors/mentees. I did not say I would not be lazy. The truth is, if I start posting more or updating it more, my RPO gets shorter. And then I have to care.

Step 2: I got so excited to be destructive I forgot to create a second Static IP and if this website had actual architectural thought behind it that wouldn’t have been such a big deal, but because it doesn’t, it was.

I checked to make sure the instance was up and the public IP worked – it did. Time to bring down my other site! Given that I had never done this on this website, here’s what you should know – when one does not practice disaster recovery or chaos engineering they will guaranteed mess it up. Practicing change prevents disaster because you cause it more often. This is multiplied by the complexity of the system you manage and people – in this case, not complex. You see the A record for my website needed to point to the new static IP and I had forgotten it wasn’t pointing to something wise like a load balancer or CloudFront because generally, I’m not used to working anymore on infrastructure that’s hyper abstracted.

But forgetting the static IP instead of the public IP, I stopped my website and within 60 seconds, the TTL for mollysheets.com, I visited and realized – my site was fully hard down. And I laughed really hard remembering I had effectively done nothing to make this blog resilient. I started a 5 minute mental timer, which first started with me putting on an orchestral version of Eye of the Tiger by Joseph William Morgan (in homage to the original Survivor masterpiece). I also still had to wait for the instances to come back online due to order of operations failure, which is another thing that will happen if you don’t prepare.

Step 3: Thanks, Jetpack!

Jetpack kicked in and about 2 minutes in shot me the email version of “Molly you dipsh—-” but way nicer. I smiled, “Yay! My website really does tell me when it’s down without me having to setup any complicated third party monitoring or Cloudwatch alarms.” I had enough time waiting for instances to come back online to put a soundtrack to my <actual disaster> I created while making a real life DR plan <in production> and also plenty of time to create a static IP for the instance I should have had fully up as a cloned environment. 2 minutes left. I attached a static IP to the other instance so I could re-test my swap. I was also, now, back fully online.

Let’s go again. After attaching the static IP to my DR instance I repeated my exercise – change the A record (Yup, you read that correctly). All of this could have gone faster had I set my TTL to less than 60 seconds while I was doing my, quite frankly “Yolo DR strategy.” Given that I wanted less steps in my process I kept the TTL in Route53 as is. I tested swapping between both active-active instances with my A records – awesome! I stopped my instance again on the main site, swapped, then swapped back. Took a mental note for some future ideas. Now I had two active active hosts, either of which I could upgrade that were identical.

Step 4: Upgrade WordPress

I upgraded WordPress on one host while my other active instance was still live. The typical process to follow for WordPress upgrades is upgrade WordPress core if you have an update, then themes and plugins. If you’re wondering if I’ve done it in a different order before the answer is, yes, and I’ve broken the crap out of WordPress – always with a way to rollback (longer), hotswap / canary to an active environment, or upgrade the copy and move over (the smart way). The upgrade went smoothly because I don’t actually have a ton of plugins and keep it upgraded regularly. The theme I use is also not complicated – but now I feel like I have a practiced method to upgrade WordPress that isn’t Yolo upgrading and generally know my snapshots work and how to quickly get a copy of my site up anywhere in Lightsail, a fully managed service,…which I used in a hybrid-managed way. The dream would be to upgrade while keeping the hotswap live and then canary to the new version, but that requires a more complicated approach and not being lazy. It also requires spending money that I’d rather spend on truffle cheeses and French wine.

I now know that if I lowered the TTL I could effectively be down for the speed of my fingers.

Reasonable people may be wondering “Molly, why don’t you use a load balancer…You know you could have done it seamlessly? Are you planning on keeping active-active instances? Why don’t you have a dev environment for your blog? You’re editing IN THE CONSOLE? Weren’t you using with Pulumi last year?”

How can anyone trust you with Kubernetes!?
These are all great questions.

There’s a point where you should look at your infrastructure and ask “Should I be doing it this way?” and the answer is – “How cheap are you and would you rather spend your money on The Garden Room?” (Yes) and also “Who actually reads your blog?”

My current setup tests me. It tests me a lot and I love that. It’s also incredibly cheap and makes people laugh.

However, I did after this exercise setup CloudFront (Thanks to this blog by Guillermo Quiros with specific WordPress cache settings that are abstracted if you use a Lightsail distribution) which has been on my “to do.” Every addition to a stack – whether it’s a load balancer, autoscaling, more instances, more process – is complexity. You then have to pay down complexity with knowledge, documentation, practice. Often teams do not keep up the delta of practicing disaster with the constant change they inject into stacks. It becomes harder to respond and often that response requires subject matter expertise. My website humbles me and reminds me that simplicity is speed.

In Conclusion

I now know after this exercise that my snapshots work. I learned that if I don’t practice destruction I will make incredibly dumb mistakes as someone who knows better, but due to the simplicity, those dumb mistakes can be figured out in less than 5 minutes.

I read absolutely 0 instructions. Was it perfect? Absolutely not.

But Was it fun? Definitely.

Image by Brett Jordan on Unsplash