The Point is Addressing

A brief tour of efforts to reimagine
programming in a world of changing memories

Daniel Bittman
with introduction by Peter Alvaro

Research for Practice combines the resources of the ACM Digital Library, the largest collection of computer science research in the world, with the expertise of the ACM membership. Research for Practice columns have a common goal of sharing the joy and utility of reading about computer science research exchanged between academics and their counterparts in industry.

In his 2023 Ph.D. dissertation, "Operating Systems for Far Out Memories," Daniel Bittman argues that a recent convergence of hardware trends—including increased memory heterogeneity, faster interconnects that allow loads and stores to escape the confines of individual hosts, and fine-grained persistence—have pushed traditional approaches for how to organize, manage, and share memory to a breaking point. "Retrofitting existing interfaces is insufficient," he argues. "Instead, the correct approach to reimagining programming in a world of changing memories is to rebuild the operating system from the ground up."

As one of Daniel's advisors at UC Santa Cruz, I was impressed by the audacity of his vision and influenced considerably by his thinking. So, I too have taken to using "far out memories" as a blanket label for emerging technologies (or systems designed as reactions to them) that challenge traditional assumptions about the memory hierarchy—including disaggregated, far, nonvolatile, device-attached, and distributed shared memories. To restart Research for Practice after a six-month hiatus, I asked Daniel to curate a collection of papers about "anything related to far-out memories." It is a very broad topic from which to choose a handful of representative publications, but I knew Daniel would not only pick a rich selection but also employ an appropriately constrained lens through which to study it.

His RfP installment, which follows here, takes us through more than 30 years of research, from single-address-space operating systems, to software-based distributed shared memory, to far memory offload, to single-level stores for persistent memory. The featured papers challenge assumptions about isolation, sharing and locality, transparency, and movement of memory and computation. The thread that ties all these selections together in Daniel's analysis is the topic of addressing, or how data references data. As we see this through Daniel's eyes, implementation details regarding addressing often reveal fundamental design constraints that reflect how the systems can and should be used. For example, Sinfonia and Twizzler choose addressing schemes that permit (or force?) programmers to reason about locality; systems such as Carbink instead prioritize transparent data movement even if this introduces overheads.

—Peter Alvaro

The Point is Addressing

Large memory systems that process huge amounts of data are widespread these days, driven in part by the recent explosion of application demands on memory. We have long surpassed the point where we could reasonably fit all the data on a single node and so instead are grappling with the challenges that come from disaggregating memory across nodes, racks, and datacenters. Recent additions to Linux, like transparent page placement, are among the freshest efforts in a wave of attempts to assist programmers with accessing "far" memory (i.e., memory that exists on another node). Hardware advances such as CXL (Compute Express Link) or, perhaps, UALink, also are efforts to make widespread memory sharing possible. Now, given all this activity, the time is ripe to evaluate how data is organized and accessed within such systems.

Sharing memory across computers has a rich and storied history, filled with challenges such as handling failure, maintaining consistency, dealing with locality, and still others. Equally fundamental is how an application addresses (or refers to) data. Constructing in-memory data structures typically involves pointing to data, typically using a virtual address. As we'll see, this limits how data can be shared, since virtual addresses are (traditionally) tied to the context of the creation process, resulting in friction when you want to move data to and from far memory.

To gain some perspective, let's walk through some papers about what is loosely categorized as DSM (distributed shared memory); each presents interesting and unique ways of addressing data. As we discuss how these papers adopt different addressing schemes to solve this sharing problem, we'll see tradeoffs in a variety of design spaces and observe how these particular design choices affect things like scaling, locality, and transparency.

Opal

Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska.
1994.
Sharing and protection in a single-address-space operating system.
ACM Transactions on Computer Systems 12(4).
https://dl.acm.org/doi/pdf/10.1145/195792.195795

Opal came about in the early '90s when CPU manufacturers started making processors that could support "wide" (64-bit) memory addresses. The key insight was to consider wide-address architectures as an opportunity for a new paradigm, shifting the way programmers organize data for the purpose of sharing. Instead of running each program in a separate virtual address space (as the traditional Unix process model would), Opal runs a mostly traditional threading model across a single virtual address space while still using the memory-management hardware to enforce isolation. This means that in-memory data is allocated and kept in a single location in virtual memory, such that any running program can operate on that data—including following pointers in pointer-rich data structures (e.g., a linked list).

Essentially, by changing how the operating system manages the virtual memory space, the overhead that comes from serializing and moving data between fully isolated processes can be eliminated, while allowing different programs to easily share in-memory data structures.

This approach has some great advantages, such as enabling mostly isolated programs to share key data structures in shared memory without opening sockets or sending bytes down pipes between processes. The paper describes how the authors were able to build a computer-aided design software package atop their system for Boeing as a way of proving the potential for single-address-space operating systems. (Perhaps not the best advertisement these days )

The choice of 64-bit virtual addresses as data references was largely influenced by the novel hardware at the time, but you could ask if the size of the address space is sufficient for all the data across all running programs. The paper discusses this by using one of my favorite kinds of "napkin math," arguing how long you would be able to continuously allocate memory on a single machine before running out (they estimated 500 years!). But the paper places a limit on scaling beyond that, indicating that, while a small cluster is viable, many independent machines would quickly exhaust the address space.

Sinfonia

Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis.
2007.
Sinfonia: a new paradigm for building scalable distributed systems.
Proceedings of the 21st ACM Symposium on Operating System Principles.
https://www.cs.princeton.edu/courses/archive/fall08/cos597B/papers/sinfonia.pdf

Sinfonia presents a method of manipulating memory focused on consistency guarantees and fault tolerance between cooperating nodes in a cluster, all sharing memory. Applications submit "mini-transactions" to the system, which contain a collection of memory operations across a variety of memory locations within the system. These mini-transactions provide ACID (atomicity, consistency, isolation, and durability) guarantees while significantly reducing the number of network roundtrips needed to commit the transactions. In this way, mini-transactions act as a powerful primitive that enables programmers to build complex infrastructure-level applications, as the authors of this paper demonstrate by building a distributed file server.

Sinfonia's memory addresses are tuples, each comprising a node ID and an offset within a per-node linear address space. We can see how this scales further than Opal's 64-bit address space, since per-node spaces provide much larger addresses, and thus more data to which we can refer. But there is also no longer a need to worry about coordinating a single address space across all machines. One downside is that the data model is tightly coupled with hosts, since node IDs are explicit; this can make moving data or altering cluster topology more difficult.

In addition to the larger address space, the per-node linear address space allows programmers to express locality when laying out data. Operations (transactional ones in particular) on a collection of data are much faster if all the data is local (i.e., in one place). Since Sinfonia makes locality explicit in the address scheme, the programmer has an opportunity to take advantage of program semantics to inform how to lay out data while keeping node locality in mind.

Carbink

Yang Zhou, Hassan M. G. Wassel, Sihang Liu, Jiaqi Gao, James Mickens, Minlan Yu, Chris Kennelly, Paul Turner, David E. Culler, Henry M. Levy, Amin Vahdat.
2022.
Carbink: fault-tolerant far memory.
Proceedings of the 16th Usenix Symposium on Operating Systems Design and Implementation.
https://www.usenix.org/system/files/osdi22-zhou-yang.pdf

So far, the papers referred to here have chosen a more "fixed" addressing scheme, where data addresses do not change over time. (Or, if they do, the data is manually copied and the application must manually update pointers.) Carbink, by contrast, uses a notion of "remotable" pointers, where—instead of maintaining a large enough address space for all the data—the system rewrites pointers whenever data is moved to "far" memory.

This is done in a particularly clever way, leveraging C++ smart pointers and reverse pointers so that data that might move can find all the associated pointers and update them. Pointers, then, act as virtual addresses when the pointed-to-data is local and as remote data identifiers when it is far. As a result, the system comes with some overhead—for example, moving data to a remote host could require a fair bit of pointer manipulation, and shared pointers need some extra space. The paper discusses this overhead, along with some additional optimizations that can be used to reduce the costs (e.g., using RCU [read-copy update] for synchronizing pointer updates).

While Carbink is not the first system to use such a scheme, it applies some clever tricks. Since the system can update pointers, it can transparently move data structures. This lets Carbink group data automatically into hot and cold tiers so as to better select candidates for eviction to far memory. This can be contrasted with Sinfonia's addressing choice that followed directly from the need to take advantage of node locality and a desire to give the programmer direct control over layout. Carbink instead focuses on transparency, which significantly reduces program complexity but relies on a heuristic to automatically determine application access patterns.

Twizzler

Daniel Bittman.
2023.
Operating systems for far out memories.
Ph.D. dissertation.
https://dbittman.com/papers/dbittman-dissertation.pdf

Twizzler is a data-centric operating system that's designed to enable fluid movement of data as well as shared access to far memory, with the goal being to reduce coordination and allow pointer-rich data to be moved without modification between nodes. Twizzler's approach is more along the lines of Opal: an operating system reimagined for evolving memory hierarchies and access abilities. It defines a single address space for all data, organized into collections of bytes called objects, which are uniquely identified with a 128-bit ID.

Twizzler's pointer scheme involves invariant pointers, which, like Sinfonia, consist of a tuple—only one that contains an object ID and an offset into the (possibly large) object. The key idea is that, through invariant pointers, in-memory data structures can be built that are fast to access while still allowing those data structures to move freely, without modification, between nodes in the system since the pointers are invariant—that is, they can be understood regardless of which host or process is looking at the data. Unlike Sinfonia, Twizzler splits the tuple and stores IDs separate from the offset in a per-object table of "outgoing references," where each entry can be used for multiple pointers. The result is that in-memory data structures can point to data across objects in a huge address space without increasing pointer size over the native representation (64 bits).

Invariant pointers with 128-bit IDs improve scaling by providing a large address space without tying the data model to hosts and by allowing multiple machines to operate on shared data without the need for expensive clusterwide address space coordination. Objects in Twizzler act as an abstraction of memory regions, allowing programmers to build data structures independent of the underlying hardware topology while still expressing locality that the infrastructure can exploit. This kind of logical locality lets optimizations based around placement make good decisions informed by application semantics while separating data from hosts for better flexibility. The invariant pointer scheme also exploits data locality for its own purposes—since local pointers don't take up table entries, many pointers to the same object can reuse entries.

When starting Twizzler, I focused on non-volatile memory, but the ideas applied equally well to far memory. For the very curious reader, more details can be found here.

Conclusion

The papers discussed here have far more gems in them than could be covered in this admittedly narrow focus on addressing. Both Sinfonia and Carbink tackle challenges of fault tolerance and network overhead, and I would be remiss if I failed to mention that these challenges also affect the systems' choices for addressing. Opal documentation includes a fascinating discussion on how single-address-space designs affect linkers and how the prototype was implemented on Mach. Of course, there is also a wealth of additional work (e.g., Twizzler TOS) in this area, and the topic is a timely one, since application memory demands continue to push and strain the abilities of our infrastructure.

One thing that becomes clear from reading these papers is that even something as innocent as addressing comes from a rich design space filled with tradeoffs between important considerations such as scaling, transparency, overhead, and programmer control. These tradeoffs are just some of the examples of the many challenges facing programmers today, especially as we drive our applications to larger scales. The way we refer to and address data matters, with reasons ranging from speed to complexity to consistency, and can have unexpected effects down the line if we do not carefully consider how we talk about and refer to data at large. The papers presented here indicate a clear desire and need to directly share data at larger scales, and I can't wait to see where the community goes next.

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems in order to make them scalable, predictable, and robust to the failures and nondeterminism endemic to large-scale distribution. He earned his Ph.D. at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the National Science Foundation Career Award, Facebook Research Award, Usenix ATC 2020 Best Presentation Award, SIGMOD 2021 Distinguished PC Award, and UCSC Excellence in Teaching Award.

Daniel Bittman leads the Twizzler project, developing and building new OS abstractions for far memory, persistence, distribution, and security. He received his Ph.D. from UC Santa Cruz in 2023, studying under Peter Alvaro and Ethan Miller. He is currently CTO of Elephance Memory, a startup bringing programmable far pointers to disaggregated memory systems using Twizzler. More at https://dbittman.com.

Originally published in Queue vol. 23, no. 2—
Comment on this article in the ACM Digital Library

Research for Practice

The Point is Addressing

A brief tour of efforts to reimagine programming in a world of changing memories

Daniel Bittman with introduction by Peter Alvaro

The Point is Addressing

Opal

Sinfonia

Carbink

Twizzler

Conclusion

A brief tour of efforts to reimagine
programming in a world of changing memories

Daniel Bittman
with introduction by Peter Alvaro