As I described in the introduction to this series, I’ve used the acronym C3-P-O as a tool to help the teams I lead remember some key principles for maintaining a healthy infrastructure environment. This post is about the first “C” in the acronym - Consistency.
High levels of operational overhead can often be traced directly back to inconsistencies in the environment. Inconsistency in hardware, configuration, software, or firmware between like components can lead to differences in behavior, slow issue remediation, and present significant barriers to the implementation of automation.
Managing and maintaining an infrastructure environment of any non-trivial size is an exercise in trade-offs. You need to trade the notion of “perfection for purpose” for the more mundane characteristic of efficient operability. When you have only one or two of something to manage and maintain, the thing or things can be exquisitely tailored and optimized to a task. At scale however, customization at the component level can is overwhelming difficult to manage.
As part of the C3-P-O acronym, Consistency refers to the practice of aggressively diving as much “sameness” into the environment as is feasiblke. At the most basic level, this needs to start with minimizing variation in the hardware footprint. It’s typically infeasible to get down to just one supplier of most gear, but no more than two should be the target (on an exception basis, three is acceptable if an incumbent is in the process of being replaced). This applies to servers, storage arrays, routers/switches, firewalls, load balancers, and the like. There is a huge additional cognitive burden on engineers when you ask them to be an expert on gear from multiple providers. Just imagine trying to troubleshoot a major network routing issue when half-awake at 2am, and having to context switch between three command sets across devices and you get the idea.
Stepping up a level, devices of like type should all be running the same revisions of firmware and/or software. The exception to this rule is when a code revision is in progress across the fleet. In this event, always complete the fleet’s code updates before taking up another new version. As with hardware, the critical count here is no more than two (current and next). We’ll talk more about how frequently you should update code in the post about Currency.
When it comes to configurations, invest time in two critical aspects: documenting your standards for naming, structure, defaults, etc., so that everyone builds configurations the same way; and establishing off-device storage of configurations, to ensure that you can always get a last-known-good configuration of any device at any time without relying on the device itself as the source of truth. Spend time following up on these items on a regular basis to ensure that you catch drift in process as early as possible.
At a physical level, invest time in establishing consistent standards for as many things as possible - cabling colors and cable management, power cord A/B colors, patch panel layouts, rack elevation specs, you name it. The more you can pre-determine and document a standard for, the less engineers have to think on the fly, potentially making decisions that could come back to haunt them in the future.
Lastly, consistency should also apply to the processes that teams use in day-to-day operations and engineering work. For example, establish a pattern for how change activity gets planned, prepared, and executed - work plan, peer review, pre-checks, change communications, execution, post-checks, and follow up. Get the team into a rhythm and don’t let it slide if someone skips a step.
As you can see, the overarching intent here is to remove as much cognitive load and ad hoc decision making as possible. When things are consistent and engineers get used to it, you’d be amazed at how quickly someone can spot something that is out of the ordinary or wrong. Humans are fantastic pattern-matchers, and we notice when things aren’t the way they should be. With consistent technology infrastructure, teams can minimize downtime, reduce the risk of failures, and improve collaboration and communication between teams, as everyone is working with similar systems and processes.
Next up, Currency!
Comments