I Unplugged What?

The lessons here are broader than just a simple "Don't do that."

Dear KV,

I'm sure by now you've read about the latest large systems failure, and I wondered if you'd share your thoughts on how such a large company—famous for having so many smart people at its disposal—can fail so miserably at infrastructure. I'm probably lobbing a softball, but how is it possible that these large and pervasive failures happen?

Making Popcorn While Waiting for Your Reply

Dear Popcorn,

Some people would say that it's unfair to pick on any company after such a spectacular failure, and that it's not nice to kick people, or companies, when they're down. KV, of course, is not one of those people.

Like the rest of the world, I watched in amusement as one of the wealthiest companies on earth seemingly shot itself in the foot with a configuration error. Some watched in horror, and yes, KV watched in amusement. It goes without saying that I know only what I've read in the news and on various "feeds," but some of the failure seems fairly well externally documented.

The lessons here are broader than just a simple "Don't do that," and there are numerous examples of companies doing similar things to themselves before this most recent incident. The real root cause that made this all so catastrophic isn't just the pushing of a bad configuration; the actual cause is something that concerns nearly all modern computing infrastructure, and it has to do with cake.

Modern computing is no longer done on just one or even one small set of computers but is carried out on thousands of machines spread across the globe. The way this infrastructure has been built up, both logically and physically, often resembles a layer cake, but one where the icing isn't sweet. In fact, in the best case, not only is the icing bitter, but often the icing between the layers becomes rancid. Furthermore, each layer has been baked by a different chef and then slapped on top of whatever rancid icing is currently at the top of the cake. Oh, and the chefs don't communicate because that would violate cake layering or something.

The pièce de resistance in this most recent catastrophe was the notably rookie error of seeming to hook everything to the same network such that a single failure brought down not only the externally visible site, but also its internal tools, and even locked personnel out of their conference rooms and data centers. Supposedly, the only way to start resetting the systems required using an angle grinder to get access to equipment in a locked rack.

Two intertwined issues made the failure far worse than it had to be. The first is the untracked coupling of systems without sufficient forethought to what happens when one of the icing layers is rancid. The other is putting all the layers on one cake.

Putting all the layers on one cake is simply foolish, and honestly, probably the more shocking revelation. KV cannot think of anyone who would knowingly put the control network for any sort of physical infrastructure on the same network as the one that serves pictures of cats. All distributed systems need to be constructed while bearing in mind the concept of separation of concerns, which may result in making several cakes instead of one.

The other failure—not tracking how the cake is layered—is, alas, all too common. Modern systems seem to be less designed and more like an accretion of systems and functionality over time. Given how often people in technology change jobs, the ability to retain clear, institutional knowledge of how the cake was made is a nontrivial exercise. Documentation, like code, rots if it's not maintained, and this is the biggest risk in building up a large system. Quarterly system reviews that bring together people from multiple disciplines within a company—including DevOps, NetOps, BizOps, SecOps, FooOps, or whatever the vogue du jour group name—are probably one of the best ways to make sure the icing has not gone off and that the chefs all know where their layers are supposed to go.

These types of failures are always failures of communication: first at the human layer and then, eventually, at the technological layer.

Kode Vicious, known to mere mortals as George V. Neville-Neil, works on networking and operating-system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. Neville-Neil is the co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System (second edition). He is an avid bicyclist and traveler who currently lives in New York City.

Kode Vicious
Too Big to Fail
Visibility leads to debuggability.
https://queue.acm.org/detail.cfm?id=2693195

Automating Software Failure Reporting
We can only fix those bugs we know about.
Brendan Murphy
https://queue.acm.org/detail.cfm?id=1036498

Resilience Engineering: Learning to Embrace Failure
A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli
GameDay Exercises Case Study
https://queue.acm.org/detail.cfm?id=2371297

Originally published in Queue vol. 19, no. 5—
Comment on this article in the ACM Digital Library

Kode Vicious

I Unplugged What?

The lessons here are broader than just a simple "Don't do that."

Related articles