I’ve been involved with infrastructure - building it, running it, or leading teams that do - for quite a long time. Much has changed over the years to be sure, but one thing has remained constant: the customers of our infrastructure want us to do everything in our power to make it never fail. This is an impossible task, obviously. We can get pretty close in a private, on-premises environment, with tools at our disposal such as active/standby or active/active(/active/…) networking devices, highly redundant storage arrays, and virtualization capabilities that mask all but the worst underlying component failures. But even with all of that, things still break, and unfortunately people still do sometimes make mistakes. It can be tough to explain to a hosting customer why some random component suffered a failure in a manner unplanned for in hardware or software, which then led to an outage. “But wasn’t it redundant?” they ask? “Why didn’t your monitoring detect it?” they inquire? At this stage of my career I’m convinced that expecting to somehow prevent these kinds of questions or build infrastructure in a way that keeps them from happening is a fools errand. There is a better way, and it’s one that the cloud providers already embrace: build PRO infrastructure. PRO being an acronym for Programmable, Reliable, and Observable.
Cloud providers got it right early on by telling their customers to expect failures. In the words of Werner Vogels, “everything fails, all the time.” It was a brilliant act of establishing low expectations. If you tell your customers that of course stuff is going to break, and we actually guarantee that it will, they get a lot less mad when it actually does break. The problem that we’ve always had with private infrastructure is that we’ve historically let our customers dictate our success criteria. I’ve actually had leaders of customer teams tell me with a straight face that they expect 100% infrastructure uptime, including when taking into account required maintenance. Impossible. But with the traditional approach of project teams footing the bill for the infrastructure build, they felt they had the right to demand it (though they never really want to pay what it takes to make is happen). Amazon flipped that on its head and told customers up front that shit will break. They essentially canonized the declaration that an application’s availability is directly proportional to the efforts that the dev team put into making it robust and tolerant of infrastructure failures. Hallelujah.
But back to PRO. Establishing the notion that infrastructure will always break is great. I love it, and it’s a good expectation to set with customers. But that isn’t the end of the story. If you think you can be successful as an infrastructure leader by just telling your customers “stuff will break, it’s your job to deal with it”, well, sorry, that’s not happening. Setting the expectation is only half of the equation, if that. The other part is following through with PRO infrastructure, which I describe below. As a point of clarity, I’m not sharing this to advocate that any company should keep their private data centers and avoid using cloud providers. But if you do have private data centers and your hosting strategy includes maintaining that capacity for any reasonably long time (3+ years), then you need to get on board with this or your dev teams will make your career short.
What does it take to build PRO infrastructure?
Programmable Programmability of your infrastructure is the first item in the acronym not just because I didn’t want it to be “ORP” or “ROP” (or worse, “RPO”), but because the other two aspects are meaningless if your infrastructure isn’t programmable, at least to a basic degree. By programmable, I mean that anything and everything that your customers want or need to to with your infrastructure needs to be accessible programmatically. That could be via direct APIs, using SDKs, Terraform providers, or even triggered automation through CI/CD-style pipelines. The key success criteria here is the customer’s ability to change the state of their infrastructure without human interaction. If your customers can’t do this at all, you’re failing. Really. And they hate you for it, even if they don’t say so. This capability isn’t easy. In fact, it’s really hard, and I don’t know of any enterprises outside of Cloud providers or full-on Tech companies that have pulled it off in totality (Maybe Goldman, but they’re a bit of a special case). Regardless, you need to try. For servers, you need to cover at least the basics - create, destroy, reboot. For load balancing VIPs, you’re doing OK if you can expose add/remove reals and control VIP/real up/down. If you can cover those two areas you’re way ahead of the game compared to most. For sure there is WAY more to expose to your customers for programmability, but start small with these, and always work on adding more. Remember that cloud providers have been doing all of this and more for over 15 years. They didn’t have a choice.
Reliable Reliability in this context is intentionally a very broad concept. It isn’t intended to indicate high availability or immunity to outage. The core criteria here is that the infrastructure does what it’s supposed to do, every time, in the same way, with the same result, given the same inputs. This is similar to the notions of consistency and repeatability, but you can have consistency and repeatability and still be wrong. Reliability in this case means delivering the result that the customer expects, every time, and if you can’t, being explicit about the failure. Nothing will cause frustration faster than getting different results from executing the same exact action. Ambiguous or inconsistent results absolutely kill customer confidence. Once you’ve lost it, it’s really hard to win it back. This concept also extends to cover timeliness as well as correctness. If something normally takes 30 seconds to happen, but every now and then it takes far longer, you need to consider that edge case a defect and put it in your backlog to fix. Establish reasonable standard deviations of execution time and manage within those bounds.
Observable A fantastic way to build trust with your customers is to make everything visible. Don’t hide anything if you can avoid it (security considerations still apply). I’ve often heard an engineer say “I can’t expose that telemetry data to a customer, they’ll read it wrong!” Complexity of metrics and activity data is regularly used as an excuse for not making it visible to our customers. I get it - if you show an RTT latency graph to someone who isn’t a network engineer, the chances that they’re going to see some spikiness and assume something is wrong are pretty high. Trust me, it’s worth the risk. If you’re hiding everything about the inner workings of your infrastructure from your customers, and they have to constantly engage you to ask “is the network OK?” or “are you seeing any storage latency right now?”, you’re doing them a disservice. Expose the data, make the logs visible, educate your customers about what the dashboards mean (better yet, put the description right in the dashboard!), and empower your customers to investigate on their own. It will take time, but they’ll learn how to read the data and they’ll start to gain confidence in what you’ve built. This has the added benefit of creating a norm for transparency across the broader organization. Before long, other leaders will be asking why they can’t see the app team’s telemetry data too!
If you choose to build you infrastructure around the PRO characteristics, your customers will appreciate it. It will have the added benefit of being a good training ground for app teams that aren’t cloud-savvy yet, and it will also help to orient your infrastructure teams to cloud-relative technology. It will take time, training, money, and new skills, but it will be worth it, I promise.
Comments