Predictability (C3-P-O series)

As I described in the introduction to this series, I’ve used the acronym C3-P-O as a tool to help the teams I lead remember some key principles for maintaining a healthy infrastructure environment. This post is about the “P” in the acronym - Predictability.

Before we dig into the details of this part of the acronym, I want to clarify the meaning and intent of using the term Predictability. First, let’s see how Merriam-Webster defines it:

predictable - adjective 1: capable of being predicted; able to be known, seen, or declared in advance; 2: behaving in a way that is expected

I specifically talk about Predictability for the meaning of “behaving in a way that is expected”. I’m looking for an evidence-based expectation, rather than a hunch that is based on nothing other than vague feelings of what one thinks will happen. Being able to predict has an implication of inference derived from experience, facts, and/or testing.

To the intent of this portion of the acronym, we declare that all services that we build must have known and predictable performance, scaling, failure, and management characteristics. Achieving this to 100% certainty is not possible of course, but that’s not the measure of success. Outside of extraordinary circumstances, we should be able to predict with high certainty how a function or service will act for a given set of inputs - e.g., increased load, component failure, traffic pattern changes, configuration changes, upgrade processes, etcetera. If the expected outcomes of these things are not known, we must do the work to gain that knowledge. This could be through lab testing, chaos testing, or even through virtual simulation tools.

As a simple but concrete example, let’s examine redundant power distribution capabilities to infrastructure hardware. In a typical data center you’ll have distinct electrical circuits supplying A and B power to your racks, connected to A and B Power Distribution Units (PDUs) in the racks, which in turn supply A and B power to N+1 power supplies in your equipment. This is to ensure that an interruption of power delivery on one circuit feed does not take all of your equipment down (and to also allow for non-disruptive upstream power equipment maintenance as needed). You pay for this level of resilience and redundancy, but can you say with high certainty that you’ll not suffer an outage if someone were to open the “A” breaker on your Remote Power Panel (RPP)? If you’ve done the work to verify resiliency through at minimum a hand-over-hand connectivity check, you could say yes. Even better is a regular cadence of physically exercising the resiliency by opening the breakers at the RPP. If you’ve never done this type of test before, you shoud - you may be surprised by what you discover!

Prudent organizations integrate this type of Chaos Engineering into their operations, to specifically ensure that in the event of an unforseen event the outcome won’t be a complete surprise. To quote Seneca, “luck is when preparation meets opportunity.” You can’t test for all eventualities, but if an unforseen event causes an outage in infrastructure that I’m responsible for, I would much rather be in a position to say that testing had never uncovered the gap than to say we’d never tested failure scenarios at all. Make a list of your base assumptions regarding the resiliency of systems you operate, and think about the last time you actually validated those assumptions. You probably have some work to do.

Next up, Observability!

Predictability (C3-P-O series)

Comments

Copyright

Predictability (C3-P-O series)

Related Posts

Comments

Copyright