The Kollected Kode Vicious

Kode Vicious - @kode_vicious

  Download PDF version of this article PDF

Kode Vicious

I Unplugged What?

The lessons here are broader than just a simple "Don't do that."

Dear KV,

I'm sure by now you've read about the latest large systems failure, and I wondered if you'd share your thoughts on how such a large company—famous for having so many smart people at its disposal—can fail so miserably at infrastructure. I'm probably lobbing a softball, but how is it possible that these large and pervasive failures happen?

Making Popcorn While Waiting for Your Reply

 

Dear Popcorn,

Some people would say that it's unfair to pick on any company after such a spectacular failure, and that it's not nice to kick people, or companies, when they're down. KV, of course, is not one of those people.

Like the rest of the world, I watched in amusement as one of the wealthiest companies on earth seemingly shot itself in the foot with a configuration error. Some watched in horror, and yes, KV watched in amusement. It goes without saying that I know only what I've read in the news and on various "feeds," but some of the failure seems fairly well externally documented.

The lessons here are broader than just a simple "Don't do that," and there are numerous examples of companies doing similar things to themselves before this most recent incident. The real root cause that made this all so catastrophic isn't just the pushing of a bad configuration; the actual cause is something that concerns nearly all modern computing infrastructure, and it has to do with cake.

Modern computing is no longer done on just one or even one small set of computers but is carried out on thousands of machines spread across the globe. The way this infrastructure has been built up, both logically and physically, often resembles a layer cake, but one where the icing isn't sweet. In fact, in the best case, not only is the icing bitter, but often the icing between the layers becomes rancid. Furthermore, each layer has been baked by a different chef and then slapped on top of whatever rancid icing is currently at the top of the cake. Oh, and the chefs don't communicate because that would violate cake layering or something.

The pièce de resistance in this most recent catastrophe was the notably rookie error of seeming to hook everything to the same network such that a single failure brought down not only the externally visible site, but also its internal tools, and even locked personnel out of their conference rooms and data centers. Supposedly, the only way to start resetting the systems required using an angle grinder to get access to equipment in a locked rack.

Two intertwined issues made the failure far worse than it had to be. The first is the untracked coupling of systems without sufficient forethought to what happens when one of the icing layers is rancid. The other is putting all the layers on one cake.

Putting all the layers on one cake is simply foolish, and honestly, probably the more shocking revelation. KV cannot think of anyone who would knowingly put the control network for any sort of physical infrastructure on the same network as the one that serves pictures of cats. All distributed systems need to be constructed while bearing in mind the concept of separation of concerns, which may result in making several cakes instead of one.

The other failure—not tracking how the cake is layered—is, alas, all too common. Modern systems seem to be less designed and more like an accretion of systems and functionality over time. Given how often people in technology change jobs, the ability to retain clear, institutional knowledge of how the cake was made is a nontrivial exercise. Documentation, like code, rots if it's not maintained, and this is the biggest risk in building up a large system. Quarterly system reviews that bring together people from multiple disciplines within a company—including DevOps, NetOps, BizOps, SecOps, FooOps, or whatever the vogue du jour group name—are probably one of the best ways to make sure the icing has not gone off and that the chefs all know where their layers are supposed to go.

These types of failures are always failures of communication: first at the human layer and then, eventually, at the technological layer.

KV

 

Kode Vicious, known to mere mortals as George V. Neville-Neil, works on networking and operating-system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. Neville-Neil is the co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System (second edition). He is an avid bicyclist and traveler who currently lives in New York City.

 

Related articles

Kode Vicious
Too Big to Fail
Visibility leads to debuggability.
https://queue.acm.org/detail.cfm?id=2693195

Automating Software Failure Reporting
We can only fix those bugs we know about.
Brendan Murphy
https://queue.acm.org/detail.cfm?id=1036498

Resilience Engineering: Learning to Embrace Failure
A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli
GameDay Exercises Case Study
https://queue.acm.org/detail.cfm?id=2371297

 

Copyright © 2021 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 19, no. 5
see this item in the ACM Digital Library


Tweet


Related:

Alvaro Videla - Meaning and Context in Computer Programs
When you look at a function program's source code, how do you know what it means? Is the meaning found in the return values of the function, or is it located inside the function body? What about the function name? Answering these questions is important to understanding how to share domain knowledge among programmers using the source code as the medium. The program is the medium of communication among programmers to share their solutions.


Daniil Tiganov, Lisa Nguyen Quang Do, Karim Ali - Designing UIs for Static Analysis Tools
Static-analysis tools suffer from usability issues such as a high rate of false positives, lack of responsiveness, and unclear warning descriptions and classifications. Here, we explore the effect of applying user-centered approach and design guidelines to SWAN, a security-focused static-analysis tool for the Swift programming language. SWAN is an interesting case study for exploring static-analysis tool usability because of its large target audience, its potential to integrate easily into developers' workflows, and its independence from existing analysis platforms.


Ayman Nadeem - Human-Centered Approach to Static-Analysis-Driven Developer Tools
Complex and opaque systems do not scale easily. A human-centered approach for evolving tools and practices is essential to ensuring that software is scaled safely and securely. Static analysis can unveil information about program behavior, but the goal of deriving this information should not be to accumulate hairsplitting detail. HCI can help direct static-analysis techniques into developer-facing systems that structure information and embody relationships in representations that closely mirror a programmer's thought. The survival of great software depends on programming languages that support, rather than inhibit, communicating, reasoning, and abstract thinking.


Timothy Clem, Patrick Thomson - Static Analysis at GitHub
The Semantic Code team at GitHub builds and operates a suite of technologies that power symbolic code navigation on github.com. We learned that scale is about adoption, user behavior, incremental improvement, and utility. Static analysis in particular is difficult to scale with respect to human behavior; we often think of complex analysis tools working to find potentially problematic patterns in code and then trying to convince the humans to fix them.





© 2021 ACM, Inc. All Rights Reserved.