Linear Address Spaces

Unsafe at any speed

Poul-Henning Kamp

I recently bought an Apple computer with the new M1 CPU to supplement the beastiarium known as Varnish Cache's continuous integration cluster. I am a bit impressed that it goes head to head with the s390x virtual machine borrowed from IBM, while never drawing more than 25 watts, but other than that: Meh...

This is one disadvantage of being a systems programmer: You see up close how each successive generation in an architecture has been inflicted with yet another "extension," "accelerator," "cache," "look-aside buffer," or some other kind of "marchitecture," to the point where the once-nice and orthogonal architecture is almost obscured by the "improvements" that followed. It seems almost like a law of nature: Any successful computer architecture, under immense pressure to "improve" while "remaining 100 percent compatible," will become a complicated mess.

Show me somebody who calls the IBM S/360 a RISC design, and I will show you somebody who works with the s390 instruction set today.

Or how about: The first rule of ARM is, "We don't talk about Thumb-2."

A very special instance of this law happened when AMD created the x86-64 instruction set to keep the x86 architecture alive after Intel, the nominal owner of that architecture, had all but abandoned it and gone full Second Systems Syndrome with the ill-fated Itanium architecture.

Fundamentally, we now have both kinds of CPUs—ARM and x64—and they both suffer from the same architectural problems. Take, for example, the translation from linear virtual to linear physical addresses. Not only have page table trees grown to a full handful of levels, but there are also multiple page sizes, each of which comes with its own handful of footnotes limiting usability and generality.

Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?

Linear virtual addresses were made to be backwards-compatible with tiny computers with linear physical addresses but without virtual memory. There are still linear virtual addresses that are backwards-compatible with computers that were backwards-compatible with computers that were...

Apart from the smallest microcontrollers, nobody sane uses linear address spaces anymore, neither physical nor virtual. The very first thing any realtime nucleus or operating system kernel does is implement an abstract object store on top of the linear space. If the machine has virtual memory support, it then tries to map from virtual to physical as best it can, given five levels of page tables and all that drags in with it.

Translating from linear virtual addresses to linear physical addresses is slow and complicated, because 64-bit can address a lot of memory.

Having a single linear map would be prohibitively expensive in terms of memory for the map itself, so translations use a truncated tree structure, but that adds a whole slew of new possible exceptions: What if the page entry for the page directory entry for the page entry for the exception handler for missing page entries is itself empty?

This is the land where "double fault exceptions" and "F00F workarounds" originate. And with five levels of page tables, in the ultimate worst case, it takes five memory accesses before the CPU even knows where the data is.

It doesn't have to be that way.

One of my volunteer activities for datamuseum.dk is writing a software emulation of a unique and obscure computer: the Rational R1000/s400. (It's OK; you can look it up. I'll wait, because until one stood on our doorstep, I had never heard of it either.)

The R1000 has many interesting aspects, not the least of which is that it was created by some very brilliant and experienced people who were not afraid to boldly go. And they sure went: The instruction set is Ada primitives, it operates on bit fields of any alignment, the data bus is 128 bits wide: 64-bit for the data and 64-bit for data's type. They also made it a four-CPU system, with all CPUs operating in the same 64-bit global address space. It also needed a good 1,000 amperes at 5 volts delivered to the backplane through a dozen welding cables.

The global 64-bit address space is not linear; it is an object cache addressed with an (object + offset) tuple, and if that page of the object is not cached, a microcode trap will bring it in from disk.

In marketing materials, the object cache was sold as "memory boards," but in hardware, they contained a four-way associative cache, which brilliantly hid the tag-RAM lookup during the RAS (row address strobe) part of the DRAM memory cycle, so that it is precisely as fast as a normal linear DRAM memory would have been.

State-of-the-art CPUs today can still address only approximately 57 bits of address space, using five levels of page tables, each level successively and slowly sorting out another 10 bits of the address.

The R1000 addresses 64 bits of address space instantly in every single memory access. And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today.

The R1000 was a solid commercial success, or I guess I should say military success, because most customers were in the defense industry, and the price was an eye-watering, $1 million a pop. It also came with the first fully semantic IDE, but that is a story for another day.

Given Ada's reputation for strong typing, handling the type information in hardware rather than in the compiler may sound strange, but there is a catch: Ada's variant data structures can make it a pretty daunting task to figure out how big an object is and where to find things in it. Handling data+type as a unit in hardware makes that fast.

Not that type-checking in hardware is a bad idea. Quite the contrary: The recent announcement by ARM that it has prototyped silicon with its Morello implementation of Cambridge University's CHERI architecture gives me great hope for better software quality in the future.

Cut to the bone, CHERI makes pointers a different data type than integers in hardware and prevents conversion between the two types. Under CHERI, new valid pointers can be created only by derivation from existing valid pointers, either by restricting the permissible range or by restricting the permissions. If you try to create or modify a pointer by any other means, it will no longer be a pointer, because the hardware will have cleared the special bit in memory that marked it as a valid pointer.

According to Microsoft Research, CHERI would have deterministically detected and prevented a full 43 percent of the security problems reported to the company in 2019. To put that number in perspective: The National Highway Traffic Safety Administration reports that 47 percent of the people killed in traffic accidents were not wearing seat belts.

Like mandatory seat belts, some people argue that there would be no need for CHERI if everyone "just used type-safe languages," or they will claim that the extra bits for "capabilities" CHERI pointers carry make their programs look fat.

I'm not having any of it.

The linear address space as a concept is unsafe at any speed, and it badly needs mandatory CHERI seat belts. But even better would be to get rid of linear address spaces entirely and go back to the future, as successfully implemented in the Rational R1000 computer 30-plus years ago.

Poul-Henning Kamp ([email protected]) spent more than a decade as one of the primary developers of the FreeBSD operating system before creating the Varnish HTTP Cache software, which around a fifth of all web traffic goes through at some point. He lives in his native Denmark, where he makes a living as an independent contractor, specializing in making computers do weird stuff. One of his most recent projects was a supercomputer cluster to stop the stars twinkling in the mirrors of ESO's (European Southern Observatory's) new ELT (extremely large telescope).

Originally published in Queue vol. 20, no. 2—
Comment on this article in the ACM Digital Library

More related articles:

Nicole Forsgren, Eirini Kalliamvakou, Abi Noda, Michaela Greiler, Brian Houck, Margaret-Anne Storey - DevEx in Action
DevEx (developer experience) is garnering increased attention at many software organizations as leaders seek to optimize software delivery amid the backdrop of fiscal tightening and transformational technologies such as AI. Intuitively, there is acceptance among technical leaders that good developer experience enables more effective software delivery and developer happiness. Yet, at many organizations, proposed initiatives and investments to improve DevEx struggle to get buy-in as business stakeholders question the value proposition of improvements.

João Varajão, António Trigo, Miguel Almeida - Low-code Development Productivity
This article aims to provide new insights on the subject by presenting the results of laboratory experiments carried out with code-based, low-code, and extreme low-code technologies to study differences in productivity. Low-code technologies have clearly shown higher levels of productivity, providing strong arguments for low-code to dominate the software development mainstream in the short/medium term. The article reports the procedure and protocols, results, limitations, and opportunities for future research.

Ivar Jacobson, Alistair Cockburn - Use Cases are Essential
While the software industry is a fast-paced and exciting world in which new tools, technologies, and techniques are constantly being developed to serve business and society, it is also forgetful. In its haste for fast-forward motion, it is subject to the whims of fashion and can forget or ignore proven solutions to some of the eternal problems that it faces. Use cases, first introduced in 1986 and popularized later, are one of those proven solutions.

Jorge A. Navas, Ashish Gehani - OCCAM-v2: Combining Static and Dynamic Analysis for Effective and Efficient Whole-program Specialization
OCCAM-v2 leverages scalable pointer analysis, value analysis, and dynamic analysis to create an effective and efficient tool for specializing LLVM bitcode. The extent of the code-size reduction achieved depends on the specific deployment configuration. Each application that is to be specialized is accompanied by a manifest that specifies concrete arguments that are known a priori, as well as a count of residual arguments that will be provided at runtime. The best case for partial evaluation occurs when the arguments are completely concretely specified. OCCAM-v2 uses a pointer analysis to devirtualize calls, allowing it to eliminate the entire body of functions that are not reachable by any direct calls.