Building on Shaky Ground

We owe it to the world to make systems work safely and reliably.

Dear KV, By now, I'm sure you've seen the CrowdStrike news, and I can't help but imagine you have opinions about what really went wrong to cause so many problems. The news, both tech and nontech, is full of explanations of what went wrong: poor testing, relying too much on one company for critical infrastructure, allowing a critical module failure to prevent further remote updates... the list goes on. With all that's already been said about this topic, it seems a lot of it is just finger-pointing, and I wonder if anyone has gotten to the heart of the matter, or if they never will and we'll just have to live with these sorts of outages—like mini-Y2Ks, only worse.

Missing the Heart

Dear Heartless, Like anyone else who was not living under a rock, I did see the CrowdStrike issue hit the news, and I'm glad my flight made it out before the systems shut down because it was a lot easier to point and laugh after the flight than it was for others who were stuck on the ground. The flight interruptions were only the most visible component of the failure, because huge queues at airports make for great news coverage. But CrowdStrike didn't just screw up air travel for days; it also affected hospitals and doctors' offices, banks and ATMs, as well as many other systems that people use daily.

For people who work in computer security, it was actually the day we had all been waiting for: a clear example of "I told you so!" that's explainable to nontechnical as well as technical folk. You don't need to understand NULL pointer exceptions to get that when your computer doesn't work, the world doesn't either. And this is probably the best part (if there is a best part) of this disaster: As far as we know, no one died from this, which is good, and everyone now knows that the world we've built rests upon pretty shaky ground. It's the wake-up call a lot of us have been waiting for. The question now is: Will we answer the phone?

A lot of ink has been spilled about how all this came about, from the low-level explanation of the NULL pointer exception to the way that testing missed the issue and how, maybe, we should not allow a remote push of software to prevent a system from booting without manual intervention.

The questions to ask now are not "How do we better lock things down in the current state?" or "How do we have better development practices so we don't push a NULL pointer bug in the lowest level of OS code?" These are good questions, but they aren't the heart of the matter.

That has to do with how systems are built—systems software in particular—with unsafe languages on unsafe hardware that's connected to a network on which nobody can be trusted.

Let's start at the hardware layer and work our way up. Current computer hardware is a wildly complex beast, composed of a diverse set of interconnected elements that, overall, trust each other.

What do I mean by trust? Take the issue of computer memory, from whence our NULL pointer errors arrive. The majority of in-memory software vulnerabilities come from the fact that anyone can do pointer arithmetic. Take a memory address, add or subtract a number, and—voila!—you have another possibly valid memory address.

It is game-playing of this type that is at the heart of most computer viruses and has inspired many types of security protections that have been attempted over the years, such as ASLR (address space layout randomization), no-execute bits in hardware, W^X (write xor execute) permissions, and many others. These protections have had a checkered history, sometimes working for short periods only to be overcome in the course of the arms race that is computer security. The heart of the matter for hardware is that we continue to pretend we're working with a minicomputer from the 1970s and that transistors are at a premium, which they are not.

There are solutions to the pointer arithmetic problem, but they require dedicating more resources to building a safer computer architecture. I'm referring here to capability machines, such as the one built at the University of Cambridge in the 1970s. In a capability machine, there are only capabilities, not just bare addresses, and these are cryptographically protected by the hardware. You can't simply add an integer to a capability to get a different one, because the hardware doesn't allow this kind of math to occur. But capabilities require a larger address space, doubling the size of a native pointer in some cases, which has an effect on memory, the TLB (translation lookaside buffer), and other parts of the computer. In the 1970s, this cost was prohibitive, but now it should not be, certainly not if the benefit is a more reliable and secure system.

Prototypes of such systems have been developed and are part of active research. It's time now to get these into production, especially around critical infrastructure. But then, what isn't critical infrastructure these days? Who knew that you could destroy the world's check-in kiosks with one bad push? When capabilities were first proposed, the world was not run on computers.

Today, it's a different story, and it's time to change our calculus. There surely are also other ways, even beyond capability machines, to use the embarrassment of riches that Moore's law has bestowed upon us to upset the balance in favor of security. It's high time we discovered what those ways are.

One option with an emphasis on security is CHERI (Capability Hardware Enhanced RISC Instructions). For an overview, see "An Introduction to CHERI," by Robert N. M. Watson, Simon W. Moore, Peter Sewell, and Peter G. Neumann (https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-941.pdf).

Moving up to software, we confront two main problems. There is now a ton of low-level software, built in C, that's based on a very old understanding of the hardware—models of compute that were appropriate in the 1970s. C is an unsafe language, basically assembler with for loops. C has the advantage of producing efficient code for modern processors. That fact, along with its long and historical use first in Unix and then other systems software, has meant that the lowest levels of nearly every computing system—from Windows, to Linux and Android, to macOS, to the BSDs, as well as nearly every realtime or embedded operating system—are written in C. When a packet comes to a computer from the Internet, it is always processed by code written in C.

And this, coupled with the aforementioned hardware problems, poses a huge security challenge. When C was written, there was no Internet and all the nodes of the ARPANET (for those who even remember that name) could be written down on a dinner napkin. It simply isn't appropriate to write code that will be connected to the Internet in an unsafe language like C. We've tried to make this work, and we can see the results.

The second problem is the systems software itself. Unix was considered a great win over Multics because Unix was simpler, introducing only two domains: the kernel and user space. User-space programs are protected from each other by virtual memory. But the kernel—any kernel—is a huge blob of shared state with millions of lines of code in it. This is true for any operating system now in day-to-day use—Windows, Unix, or whatever tiny embedded OS is running on your WiFi-connected light switches.

What makes up this huge blob? Device drivers! Device drivers of varying—and some might say questionable—quality, any one of which can poke and prod any part of the system for which it can manufacture a valid memory address. Once something breaches the operating system's kernel boundary, the game is over, because the operating system is "shared everything."

A modern approach to systems software suggests that we not only write all new systems in type-safe languages, such as Rust, but also rewrite what we already have in the same way. But this isn't economically practical. Imagine how much it would cost to rewrite any major OS in a new language, test it, deploy it, etc. I mean, KV would take 10 percent of that cost gladly to run the project, but the drinks bill would be stunning.

A multipronged approach is the only way out of the current morass, one in which we leverage type-safe languages such as Rust when possible and decide which hardware is actually critical and must be replaced. (Do you wanna bet some dam-control software somewhere runs Windows? I'll lay real money that it does.)

The whole CrowdStrike catastrophe happened because of architectural issues in hardware and in systems software.

We should be building systems that make writing a virus difficult, not child's play. But that's an expensive proposition now.

We conned humanity into using computers for everything. Now we owe it to the world to make those systems work safely and reliably.

George V. Neville-Neil works on networking and operating-system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large codebases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly 20 years, he has been the columnist better known as Kode Vicious. Since 2014, he has been an industrial visitor at the University of Cambridge, where he is involved in several projects relating to computer security. He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth, but also has been deployed as part of VxWorks in NASA's missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City.

Originally published in Queue vol. 22, no. 5—
Comment on this article in the ACM Digital Library

More related articles:

Jinnan Guo, Peter Pietzuch, Andrew Paverd, Kapil Vaswani - Trustworthy AI using Confidential Federated Learning
The principles of security, privacy, accountability, transparency, and fairness are the cornerstones of modern AI regulations. Classic FL was designed with a strong emphasis on security and privacy, at the cost of transparency and accountability. CFL addresses this gap with a careful combination of FL with TEEs and commitments. In addition, CFL brings other desirable security properties, such as code-based access control, model confidentiality, and protection of models during inference. Recent advances in confidential computing such as confidential containers and confidential GPUs mean that existing FL frameworks can be extended seamlessly to support CFL with low overheads.

Raluca Ada Popa - Confidential Computing or Cryptographic Computing?
Secure computation via MPC/homomorphic encryption versus hardware enclaves presents tradeoffs involving deployment, security, and performance. Regarding performance, it matters a lot which workload you have in mind. For simple workloads such as simple summations, low-degree polynomials, or simple machine-learning tasks, both approaches can be ready to use in practice, but for rich computations such as complex SQL analytics or training large machine-learning models, only the hardware enclave approach is at this moment practical enough for many real-world deployment scenarios.

Matthew A. Johnson, Stavros Volos, Ken Gordon, Sean T. Allen, Christoph M. Wintersteiger, Sylvan Clebsch, John Starks, Manuel Costa - Confidential Container Groups
The experiments presented here demonstrate that Parma, the architecture that drives confidential containers on Azure container instances, adds less than one percent additional performance overhead beyond that added by the underlying TEE. Importantly, Parma ensures a security invariant over all reachable states of the container group rooted in the attestation report. This allows external third parties to communicate securely with containers, enabling a wide range of containerized workflows that require confidential access to secure data. Companies obtain the advantages of running their most confidential workflows in the cloud without having to compromise on their security requirements.

Charles Garcia-Tobin, Mark Knight - Elevating Security with Arm CCA
Confidential computing has great potential to improve the security of general-purpose computing platforms by taking supervisory systems out of the TCB, thereby reducing the size of the TCB, the attack surface, and the attack vectors that security architects must consider. Confidential computing requires innovations in platform hardware and software, but these have the potential to enable greater trust in computing, especially on devices that are owned or controlled by third parties. Early consumers of confidential computing will need to make their own decisions about the platforms they choose to trust.