Sustainable Infrastructure (C3-P-O series)

I like acronyms. More specifically, I like being able to capture concepts with acronyms that can help practitioners remember and internalize them. You may have noticed this in my posts about TILT, PRO, and ACID. An acronym that goes back quite a few years in my career is one that I use to convey my guiding principles of reliable and sustainable infrastructure - C3-P-O. It has an interesting origin story.

It was around 1:30am when the text came. I don’t think I heard it until the second time it alerted, but when it did register and I realized what was happening, that familiar gnawing feeling crawled into my gut. We were having an outage, but I didn’t know what, where, why, or how bad yet. I rolled out of bed, threw on pants and a shirt, and made my way to my home office to crank up my laptop. By the time I got up and running and dialed in, the bridge was already in full swing. Several critical parts of our ERP suite were offline, and the next day’s supply-chain order flow was at risk. No idea of root cause. Folks were looking everywhere for clues. Trying to bring some order to the chaos, I began to get everyone oriented to a stepwise troubleshooting flow, starting with the simplest things first. The first reported symptom was loss of access to our ERP portal, so let’s start there. DNS was good, we could resolve the VIP hostname for the portal. Can we ping the VIP? No. Are the servers in the VIP pool failing health checks, causing the VIP to be down? Um, hold on…we can’t connect to the admin console of the load balancer. Even the LB’s OOB management IP isn’t reachable. Wait, what? Is the LB completely down? Why didn’t it fail over to the standby? Can someone check the standby? The standby is unreachable too? What the heck is happening?

That scenario played out in mid-2015. I had recently inherited the network engineering teams and while the people were smart, skilled, and motivated, the network environment was suffering from many years of rapid organic growth and uneven management. We were also battling the “if it works, don’t change it” philosophy of some of the previous leaders in the space. Needless to say, we had incredibly inconsistent architecture, a crazy mix of tools and equipment, and scores of devices that hadn’t seen a code rev since they were installed. We were having regular issues and outages like the one described above, leading to frequent and significant business disruption. In the case of the ERP outage, the standby load balancer had failed sometime in the previous months and had never been fixed. When the engineer couldn’t make progress on getting it to work again he simply powered it down and made a mental note (which he promptly forgot) to circle back to it later. When the primary load balancer subsequently failed from a hardware issue we had no way to quickly recover, and no one knew why the standby was offline. It took us until well into the daylight hours to get things back up and running, and we missed critical deadlines for our global supply chain along the way. That was a tough conversation with my CTO to say the least.

As we began the process of unwinding the chaos and restoring sustainable health to the ecosystem, I wanted to make sure that the team had a good foundational understanding of the characteristics of a well-managed environment. To help with this I came up with the acronym C3-P-O and began using it to reinforce my five baseline principles: Consistency, Currency, Continuity, Predictability, and Observability. While I will never claim that these are the only things that matter for good design and infrastructure health, I’ll assert that if you’re taking care of those five things you’re well on your way to building reliable and operationally sustainable infrastructure. Grounding the team in these principles gave them something to work against and a mechanism to validate decisions along the way.

Over the next five posts in this series, I’ll cover each of the five principles of C3-P-O in turn. I hope you find them useful for your own infrastructure journey!

Posts in this series:

Comments


Copyright

CC BY-NC-ND 4.0