Download PDF version of this article PDF

Intermediate Representations for the Datacenter Computer

Lowering the Burden of Robust and Performant Distributed Systems

Achilles Benetopoulos

We have reached a point where distributed computing is ubiquitous. In-memory application data size is outstripping the capacity of individual machines, necessitating its partitioning over clusters of them; online services have high availability requirements, which can be met only by deploying systems as collections of multiple redundant components; high durability requirements can be satisfied only through data replication, sometimes across vast geographical distances. While it has arguably never been easier to procure the necessary hardware—thanks to the prevalence of public clouds—and to deploy distributed applications with a variety of tools from cluster orchestrators such as Kubernetes to newer paradigms such as functions-as-a-service, building correct and efficient distributed solutions largely remains an individual exercise.

There is a baseline set of concerns that need to be addressed for any distributed program to behave correctly; at the same time, these concerns pertain to the aspects of distributed computing that represent the most significant departure from the single-node programs that application programmers are comfortable implementing, and can be the hardest to address correctly. Getting some of these wrong—like communication and the coordination of distributed action in the presence of partial failure, or access and modification to shared data at a distance—can affect system correctness; others—like appropriate data partitioning and placement—will not affect the validity of a program's results, but getting them wrong can have catastrophic effects on system performance.

It seems desirable to be able to isolate these issues from user applications, and to address them at the infrastructure level so that diverse programs running on top of a distributed computing substrate can all reap the benefits. Arguably, big-data frameworks such as Hadoop MapReduce and Apache Spark are the most successful attempts to this end, enabling programmers with an appetite for big data to leverage the power of clusters of machines to speed up large-scale computations. Their domain-specific programming abstractions isolate the application programmer from the mechanisms of load-balancing computation across hosts and moving data between them, as well as from managing host failures. Their domain specificity, however, makes them unsuitable for a large domain of applications such as long-term data storage or web-application back ends.

More typically, in practice, distributed systems arise as a composition of various subsystems, as illustrated by modern service-oriented architectures. Similar to the Unix model, functionality is composed at the granularity of a "process"—large-scale programs are built up by taking smaller programs (data stores like Cassandra or etcd, message brokers like RabbitMQ, etc.) that are designed to solve a specific problem in isolation and composing them by introducing appropriate plumbing for an application to consume them and coordinate across them. This paradigm presents a major tradeoff for the programmer: In return for having much more control over the details of how the system works, they are now exposed to the full breadth of distributed-systems pitfalls.

Some recent systems try to strike a balance between these two extremes by striving to provide the ease of use of the big data frameworks without unnecessarily restricting the kinds of programs that can reasonably be expressed in them, and they do so by focusing on facilitating common patterns of distributed computation. One such example is Temporal,9 a distributed execution platform in which user programs are expressed as workflows of individual steps that interact using the familiar async/await paradigm. Programmers express their workflows using Temporal's APIs, and they are guaranteed that these workflows will run correctly and to completion in the presence of any kind of failure.

At the core of Temporal is its durable workflow abstraction, which captures the sequence of steps that need to be taken as a response to a request for computation; the workflow abstraction is in some sense the target against which the user writes their program, but it effectively isolates them from execution details, which are determined entirely by the Temporal platform. Temporal itself doesn't care about the exact code running in each step of the workflow; the only details it needs to be aware of is what it is allowed to do in case it encounters a failure.

 

Decoupling Application Logic and Distributed Coordination

As demonstrated by some of these systems, the right choice of abstractions, such as Spark's resilient distributed datasets10 or Temporal's durable workflows, can allow the clean decoupling of business logic from the orchestration of distributed outcomes. By taking care of orchestration and communication details on the users' behalf, systems like Temporal can lower the barrier to entry and allow application programmers to leverage fleets of computers without becoming experts in distributed computing. At the same time, this separation of concerns is only the first (necessary) step in assisting programmers who navigate the domain of distributed computing. We should use the aforementioned decoupling as a basis for eliminating more than just coordination concerns (e.g., shared data consistency) from the programmer's plate.

To this end, the focus should be on providing language-independent, reusable abstractions for facilitating correct distributed outcomes with minimal user involvement. Language independence is a key requirement, as it allows maximizing the utility that can be derived from this imagined piece of infrastructure. This requirement leads us to looking to compiler research for inspiration.

Over the past 15 years, there has been renewed interest in exploring new programming languages and paradigms, thanks largely to the existence of projects like LLVM, and more specifically the ability to compile down drastically different languages into a unified IR (intermediate representation) that the rest of the compiler toolchain operates over in a way that is agnostic to whatever front-end language produced it. This IR fully captures the details of user programs but does so for an idealized, nonexistent machine; the compiler is free to apply sometimes radical transformations and optimizations in this closed representation according to an internal rule set and can subsequently translate the optimized program for this idealized virtual machine into concrete assembly code for a target architecture.

If the datacenter is believed to be a computer,1 then this question should be asked: Is there an idealized IR for it? This IR would necessarily be distinct from that used by a traditional compiler, because the "fundamental" operations of the (idealized) distributed computer are different than those of a single CPU. For example, whereas one of the most important aspects of a single-machine program revolves around loads and stores from memory, at the cluster level they revolve around when and how cross-machine communication should happen.

As a basis for this IR, we can look to distributed dataflow, in which large-scale programs are expressed as directed acyclic graphs where each node represents a single-node piece of functionality (e.g., a function call), called an operator, and edges represent data dependencies between those fragments of compute. In this way, all distribution-related concerns are isolated at the edges between operators. Core business logic, which is what the application programmer is a domain expert on, is distribution unaware; what happens along the edges can be the purview of an orchestration and distributed execution layer.

This separation between local and distributed code encourages us to imagine infrastructure that translates the distributed interactions embedded in user programs into an intermediate distributed dataflow representation that is subsequently "compiled" down to a concrete orchestration of distributed computation. The program's representation in this IR captures the programmer's intent—how data flows through the system and the transformations it is subject to—while providing a runtime leeway in making concrete decisions around placement and network interactions.

Components written in distinct languages can reference and use each other, with their interactions modifying and extending the underlying dataflow graph. The existence of such an IR would also enable a unified runtime as the computation substrate for all clusterwide computing. This article highlights some of the benefits anticipated under this model by focusing on two experimental systems in this line of research.

 

Two Academic Examples

This section looks at two active academic research projects that aim to isolate application programmers from mechanistic distribution details through high-level representations of distributed computation. The first such project is Hydro,7 a language stack developed at UC Berkeley with the goal of providing a common execution substrate to distributed programs written in a multitude of programming languages. The second one is Magpie, a new distributed single-level store and runtime being developed at UC Santa Cruz.

 

Hydro

The Hydro project consists of a compiler stack that aims to translate distributed programs written in a multitude of languages into a unified, distribution-aware IR that can be executed by the Hydro runtime. In the future envisioned by the project, programs will be written in a language of the programmer's choice, leveraging Hydro's library and APIs; at the core of Hydro's compilation process are multiple levels of intermediate representations, each with a different purpose.

The first-level IR, Hydrologic, is a declarative language that captures a high-level logical plan of the target distributed computation; Hydrologic programs represent the programmers' intent with respect to how application data is meant to be transformed and propagated through the system but omit concrete runtime details such as communication between hosts or data placement and replication strategies.

Hydrologic programs are then transformed to a second-level IR called Hydroflow—Hydroflow programs are semantically equivalent to the Hydrologic programs they originate from, in the sense that they produce identical outcomes but are meant to be single-threaded, single-node programs that can be executed by Hydro's runtime. This base Hydroflow representation is then used by the final step of the compilation process whose job it is to translate and optimize the single-node Hydroflow programs into potentially multihost asynchronous dataflow programs. This final configuration and optimization step takes into account high-level objectives specified by users (e.g., the maximum number of host failures the program should tolerate, or per-endpoint latency service-level objectives) in a declarative form, which are used to discover an appropriate configuration of the user program that can then be deployed on a cluster and executed by Hydro's runtime.

In a 2023 paper, Hellerstein et al.6 illustrate the power of separating application logic from distributed orchestration logic by walking the reader through an implementation of the infamous "shopping cart" example from the Dynamo paper4 for Hydro. As part of the example, the authors illustrate how the same high-level program can be translated into a variety of concrete distributed programs; more specifically, different decisions on the placement of edges that represent network communication in the final dataflow graph can make the difference between client-side batching and server-side batching of intermediate shopping cart states. The choice of variant depends on the tradeoffs the application programmer seeks to navigate—the authors argue that if the programmer's primary concern is tolerating client failures, then server-side batching is the preferable deployment configuration; however, if the programmer is operating under strict requirements around user privacy and data ownership, then the right choice is the solution that isolates intermediate state on a user's client and reveals information to a server only when the user chooses.

The ability to arrive at distinct implementations with significantly different communication characteristics without requiring any implementation-level rewrites is a direct corollary of the decoupling of the "logical plan" of a distributed program (which encodes the programmer's intent for computation) from the concrete deployment of said program, which is augmented with mechanisms that accomplish said intent at execution time under the realities of large-scale distribution. Today, all such tradeoffs would have to be considered at application-design time, as moving from one variant to the other would necessitate potentially complex application rewrites.

While Hydro is still in a relatively nascent state, the results that have come out of this research highlight the power of lifting distributed-systems interactions into a high-level "logical plan" that is amenable to rewrites and optimizations and point the way toward some powerful futures. More specifically, given that distributed programs get translated into a common representation, and the components that compose for the end-to-end execution of distributed programs become, in essence, part of the same program, we can imagine a future of opportunities for whole-program, cross-component optimizations through rewrites and manipulations of the underlying dataflow graph, similar to work published in 2024 by Chu et al., which focuses on optimizing Paxos through rule-driven rewrites.3

 

Magpie

Magpie is a distributed single-level store and runtime in which application data is packed into objects (arena-like typed regions of memory) that reside in a global address space. Programs are compositions of functions that operate over that data; these compositions form a dataflow graph that is dynamically expanded at runtime. This article focuses on how Magpie distributes user programs without programmer intervention or guidance. (The implementation of Magpie and its execution model will be described in more detail in a future publication.)

Importantly, both data and code in Magpie are mobile. Objects are structured in such a way as to be transparently relocatable between hosts.2 Functions are location-transparent, in the sense that there is nothing in their implementation that relies on specific placement decisions, and distribution-unaware, in the sense that they are not allowed to engage in any form of distributed coordination. They are queries and transformations over local data, guaranteed to execute from start to finish on a single machine without preemption. Furthermore, each function is executed under ACID (atomicity, consistency, isolation, durability) semantics, subject to the user's requirements around data consistency. The Magpie runtime itself fills the role of both the control and data planes for a given application.

Code and data mobility enables Magpie to take an extreme late-binding approach when it comes to executing user code. The aforementioned static task graph encodes information around data and control dependencies that need to be respected, but it omits concrete execution details such as activation sites, communication details, or even the potential parallelism available for the computation; as the static graph is expanded at runtime, and the computation's real set of data dependencies is discovered, the runtime orchestrates data and code movement so that it can execute each of these functions as guaranteed, trying to take full advantage of the available parallelism in the cluster (subject to the structure of the user's program) and minimizing the need for expensive distributed coordination like distributed commit. One of the reasons why the Magpie runtime can play such an effective role is the ability to perform lightweight interposition at the edges of the dataflow graph and introspect on each function's data dependencies and the operational state of the cluster while having full control over the application's data plane.

Lifting arbitrary programs into the dataflow representation means concerns that were previously thought to be exclusively the programmer's responsibility can now be subsumed and addressed entirely in the infrastructure. User programs can be transparently scaled up through automatic parallelization, which is enabled by the ability to introspect on their functional dependencies, coupled with the isolation properties of the transactional execution model. Visibility into application data access patterns is also critical in scaling out computation over a cluster of machines, as it allows for dynamically readjusting the placement of compute and data on the fly in such a way as to minimize distributed coordination and network interactions between components in an affinity-respecting manner. This amortizes network costs and takes advantage of temporal working sets in applications, while also effectively balancing observed load.

The task-graph abstractions also enable a powerful composition model, moving beyond the Unix-like model of service composition that was sketched in the introduction. In this model, infrastructure components such as key/value stores or pub/sub systems are implemented as user-level libraries, which can then be consumed by higher-level programs through functional interfaces they expose.

This model enables the development and reuse of established infrastructure—something like the microservice model—but the separation between components, critical for organizing large software artifacts, is manifest only in application-code structure; at runtime, the functionality of all components is translated to well-defined compositions of "operators" in the language of dataflow. In turn, this allows the runtime to potentially isolate the execution of individual workflows or programs entirely on a single machine, providing extremely fast distribution-free execution to user programs that, using more traditional approaches such as RPC (remote procedure call), might involve gratuitous communication over the network in the application's critical path.8

The benefits of ownership of the data and control planes extend beyond application performance concerns. For any operation that the runtime executes, there is a complete causal record of their observed effects, allowing us to easily answer questions around how a computation unfolded over time, as a response to a user request, or what data changed and what the modifications were. This record can serve as the basis for powerful debugging tools. For example, when an application encounters a bug, instead of trying to synthesize signals from a variety of tools like application logs and distributed traces to help you work backwards toward the root cause of the issue, you could gather all log records for the period of time in which the problem under investigation appeared and use them as input to a single-node deterministic cluster simulator that replays the logs and lets the user do time-travel debugging to isolate the root cause of the violation, in service of further troubleshooting.

 

Distinct IRs

Hydro and Magpie differ significantly in their value propositions. For example, Hydro builds on recent research in properties such as monotonicity5 and coordination avoidance and seeks to exploit these properties as opportunities for optimizing user programs at compile time. Magpie provides correctness and performance guarantees for arbitrary user programs that have been translated into its programming model but does not try to automatically optimize them by changing their structure; rather, it assumes that the optimal deployment configuration of a user program cannot be known ahead of time and therefore seeks optimization opportunities at runtime.

Despite these differences, both Hydro and Magpie represent a clean cut from existing paradigms that seek to augment traditional, single-node programming with constructs for the coordination of distributed activity; instead, they embrace the adoption of new programming models for the development of robust-by-construction, high-performance distributed programs. Critically, they serve to illustrate both the viability of deriving an IR for the datacenter computer, as both have arrived at using a form of distributed dataflow as the main abstraction for modeling arbitrary distributed computation, as well as the potential impact of such a fundamental abstraction. This enables both systems to eliminate concrete distribution concerns from user applications and to facilitate their correct and efficient execution.

 

Future Opportunities

Many questions are left to be answered before making definitive claims about the viability of rewriting existing applications for the experimental systems highlighted here, but in closing, let me sketch a path for the incremental adoption and evaluation of these systems, and describe how this incremental path can help in addressing an increasingly relevant problem, which is easily taking advantage of heterogeneous compute in the cloud.

Both Hydro and Magpie present programmers with their own data models—they control the persistence of updates and offer libraries of persistent collections as well as abstractions for users to implement their own data structures using the runtime's primitives. While this choice enables low-latency, in-memory compute with fine-grained control over the persistence of updates, it means that they cannot directly exchange data with external systems, instead relying on some ETL (extract, transform, load) layer. In the future both systems would likely benefit from supporting data formats such as Apache Arrow, which is fast becoming a de-facto format both for in-memory computing and, perhaps more critically, for data exchange. Support for external data formats would enable Hydro and Magpie to be deployed as subcomponents alongside already-running systems and their impact to be evaluated in an incremental fashion.

With the raised level of abstraction afforded to programmers—thanks to these systems' dataflow models—coupled with support for universal data formats, I anticipate new avenues for easily leveraging heterogeneous computing targets. As long as the infrastructure can mediate the exchange of data between distinct computing targets, potentially applying appropriate transformations, user programs will be free to mix and match on-CPU compute with accelerator offloading. Crucially, this kind of use case can be expected to illustrate the power of the IRs employed by Hydro and Magpie. The specific compute target requirements will need to become additional constraints that must be considered by the compiler stack or runtime when coming up with a system configuration or concrete execution plan, but the operations that involve accelerators and such can still be treated as another operator in a dataflow graph with semantics similar to those applied to CPU-oriented programs; therefore, user programs can be enabled to spill over to any kind of available accelerator without requiring any modification to the underlying high-level dataflow representations of these systems.

 

References

1. Barroso, L. A., Clidaris, J., Holzle, U. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. Morgan & Claypool Publishers.

2. Bittman, D., et al. 2020. Twizzler: a data-centric OS for non-volatile memory. In Proceedings of the Usenix Annual Technical Conference; https://www.usenix.org/system/files/atc20-bittman.pdf.

3. Chu, D. C. Y., et al. 2024. Optimizing distributed protocols with query rewrites. Proceedings of the ACM on Management of Data 2(N1); https://dl.acm.org/doi/pdf/10.1145/3639257.

4. DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. ACM SIGOPS Operating Systems Review 41(6), 205-220; https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.

5. Hellerstein, J. M., Alvaro, P. 2020. Keeping CALM: when distributed consistency is easy. Communications of the ACM 63(9), 72-81; https://dl.acm.org/doi/pdf/10.1145/3369736.

6. Hellerstein, J. M., et al. 2023. Invited paper: initial steps toward a compiler for distributed programs. In the Fifth Workshop for Advanced Tools, Programming Languages, and Platforms for Implementing and Evaluating Algorithms for Distributed Systems; https://hydro.run/papers/joe-applied-2023.pdf.

7. Hydro Project homepage; https://hydro.run/.

8. Seemakhupt, K. et al. 2023. A Cloud-Scale Characterization of Remote Procedure Calls. https://dl.acm.org/doi/10.1145/3600006.3613156

9. Temporal homepage; https://temporal.io/.

10. Zaharia, M., et al. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Proceedings of the Ninth Usenix Conference on Networked Systems Design and Implementation; https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf.

 

Achilles Benetopoulos is a Ph.D. student at UC Santa Cruz, working at the intersection of distributed systems, databases, and programming languages with Peter Alvaro, associate professor of computer science at UC Santa Cruz. Previously, he spent a few years working as a software engineer up and down the stack at a variety of companies.

Copyright © 2024 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 22, no. 6
Comment on this article in the ACM Digital Library





More related articles:

Marc Brooker, Ankush Desai - Systems Correctness Practices at AWS
Building reliable and secure software requires a range of approaches to reason about systems correctness. Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces. Formal methods have been an important part of the development process - perhaps most importantly, formal specifications as test oracles that provide the correct answers for many of AWS's testing practices. Correctness testing and formal methods remain key areas of investment at AWS, accelerated by the excellent returns seen on investments in these areas already.


David R. Morrison - Simulation: An Underutilized Tool in Distributed Systems
Simulation has a huge role to play in the advent of AI systems: We need an efficient, fast, and cost-effective way to train AI agents to operate in our infrastructure, and simulation absolutely provides that capability.


Matt Fata, Philippe-Joseph Arida, Patrick Hahn, Betsy Beyer - Corp to Cloud: Google’s Virtual Desktops
Over one-fourth of Googlers use internal, data-center-hosted virtual desktops. This on-premises offering sits in the corporate network and allows users to develop code, access internal resources, and use GUI tools remotely from anywhere in the world. Among its most notable features, a virtual desktop instance can be sized according to the task at hand, has persistent user storage, and can be moved between corporate data centers to follow traveling Googlers. Until recently, our virtual desktops were hosted on commercially available hardware on Google’s corporate network using a homegrown open-source virtual cluster-management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP (Google Compute Platform).


Pat Helland - Life Beyond Distributed Transactions
This article explores and names some of the practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions. Topics include the management of fine-grained pieces of application data that may be repartitioned over time as the application grows. Design patterns support sending messages between these repartitionable pieces of data.





© ACM, Inc. All Rights Reserved.