Continuity (C3-P-O series)

As I described in the introduction to this series, I’ve used the acronym C3-P-O as a tool to help the teams I lead remember some key principles for maintaining a healthy infrastructure environment. This post is about the third “C” in the acronym - Continuity.

For infrastructure services in particular, continuity of operation is critical. The saying used to be that infrastructure services had to be like the dial tone - always there when you need it (for those of you who remember what “dial tone” is). To achieve this, infrastructure teams must design and implement services in a way that ensures continuity of the function when faced with failure scenarios. For the purposes of the C3-P-O acronym, there are degrees to continuity. I generally think about continuity capabilities as being in one of three categories: Immediate, Rapid, or Prompted. Each of these continuity approaches has appropriate use cases. Business needs, budget, and operational overhead must be evaluated during the selection process.

Immediate Continuity

Immediate Continuity is the pinnacle of continuity approaches. It’s the most expensive, usually the most complex, and potentially the most operationally-intense. This approach is used for infrastructure functions that simply must never go down unexpectedly - power, the network core, SAN connections, critical internet and WAN connections, and foundational services like Active Directory or DNS. Recovery of the function in the event of a component failure must be automatic and occur quickly enough to avoid disruption of dependent services. To achieve Immediate Continuity, the disruption of a failure should be unnoticeable above Layer 4 of the OSI model. In practice, even Immediate Continuity isn’t always immediate in the strictest sense. The recovery lag can range from milliseconds to a few seconds and still be considered immediate. What allows this is that transport and application layers can be and are resilient to transient failures, and will retry. As long as the disruption isn’t long enough to exceed retry limits, systems will carry on as if nothing has happened and may not even log an error.

The methods to achieve Immediate Continuity are too numerous to go into in detail for this summarization, but they generally take the form of Active/Active[/Active] designs where two or more components participate in handling the work of the function at all times. When a failure occurs, work transparently moves to the remaining healthy component(s). Immediate Continuity can also be achieved with N+1 Active/Standby designs when the components can detect and recover from failure in timespans short enough to prevent service disruption.

Rapid Continuity

The Rapid Continuity approach makes up the bulk of continuity implementations, largely due to lower complexity and sometimes lower overall cost. Operational overhead for Rapid Continuity can still be high, but there is typically less complexity to deal with than in Immediate Continuity designs. In a Rapid Continuity approach, recovery of the function must still be automatic, but the timeframe for recovery can be slightly longer. The recovery timeframe in some cases may be long enough that dependent systems will see a short disruption of services. Depending on the resilience of the dependent services, some may require a reset or restart to return to normal operation (e.g., such as a database or NFS connection).

Examples of designs that fall into the Rapid Continuity bucket are Active/Standby implementations of components such as WAN links, load balancers, and databases. Most virtual server hosting services use Rapid Continuity, where a virtual machine residing on a host server will automatically restart on a new server if the host goes down. This can take anywhere from seconds to minutes, depending on the type and size of VM and what functions are running on it.

Prompted Continuity

Last but not least, we have the Prompted Continuity approach. I use this term to describe designs where continuity of service is achievable, but requires some sort of manual intervention to realize. This approach is the simplest, the cheapest, and has the least operational overhead, but has the major downside of a highly variable recovery time. This type of design should be reserved for cases where recoverability is needed/desired, but the cost of Immediate or Eventual Continuity exceeds the cost of the business impact from an extended outage. This could be true for compenents such a WAN router in a regional office, a printer gateway, hot-swap hard drives, or network switches. The key is that the recovery must be possible without dependency on outside entities.

The methods to achieve Prompted Continuity generally involve “cold” or “warm” spares (as opposed to “hot” spares) that can be used to replace a failed component. In the event of a failure, the bad component is removed and replaced with the spare component, which may or may not be pre-configured. I don’t consider situations where a service contract exists to provide spares or repair services to be Prompted Continuity, as that approach depends on a third party to ensure that spares are available in a timely manner.

As teams design infrastructure services, I encourage them to explicitly declare the continuity approach of each component in the design. By doing so, we can review the design and pressure-test the methods for how continuity will be achieved. More than once we’ve uncovered significant issues stemming from bad assumptions - it’s far better to find and fix those problems at the design stage than when you have unhappy customers banging on your door because something has failed in an unexpected way.

Which leads us to the next part of the C3-P-O acronym…Predictability!

Comments


Copyright

CC BY-NC-ND 4.0