Note: This is part one of a two-part series discussing how we think about Disaster Recovery in the technology world
In conversations with other technologists, I still frequently hear people talk about “Disaster Recovery” (DR) even when their applications are hosted in the public cloud. Coincidentally, the last few months have seen some high-profile issues and outages at more than one CSP, and there have been a flurry of blog posts, news stories, and breathless tweets about “reconsidering” using provider A or B. I think that many enterprise technologists are missing the plot. We really need to shift the narrative around DR, especially when it comes to applications built in and operating in public clouds. Often the “disaster” component of DR is thought of in ways that are just too abstract and unlikely. How often do we need to react to a real disaster? I humbly propose that we should think more about Disruption Recovery, which hopefully strikes a more urgent tone.
In the days before public cloud when hardware was very expensive, delivery times were long, and capacity planning horizons were years in the future, the concept of Disaster Recovery was apt. The use of frameworks that clearly defined RTOs, RPOs, and recovery priorities, alongside the practice of building DR sites or consuming DR as a Service (remember Sungard?) made sense. In an emergency you simply couldn’t quickly get your application from A to B with minimal or no data loss without spending piles of money to prepare for it. And even when you did build the needed capabilities into your systems, if you didn’t practice execution of the failover plan on a regular basis it was just a bunch of words on paper. With few exceptions, this overall approach should be relegated to the trash heap.
In Risk Management parlance, traditional DR planning and architecture is mitigation against High Impact/Low Likelihood events. If we begin to think instead in terms of Disruption Recovery, the likelihood component of the equation implicitly increases, driving us toward different system design and implementation approaches. I think this is a good thing, especially since we’ve seen ample evidence just in the last year that disruptions happen with more frequency than we’d like.
Before taking this stream of thought further, I need to mention the importance of resiliency in system architectures. Quite a few services suffered complete outages during the recent AWS and Azure disruptions due to the fact that they only exist in one region/AZ. For cloud-deployed services this is malpractice and should be plainly called out as such. In short, these services were not sufficiently resilient to failure. There’s simply no acceptable reason for a critical cloud-deployed service to be in a single AZ or Region. Multi-site active architectures take more work to design and implement than primary/standby designs, but if there is a reason that your critical system can’t do active multi-site it should at least have a secondary site built and ready to take load (or ready to scale and take load). My concept of Disruption Recovery actually helps with this problem.
In part two of this post I’ll talk more about my thoughts on Disruption Recovery, and about how changing the assumptions of traditional DR impacts systems design and operations.
Comments