Research for Practice combines the resources of the ACM Digital Library, the largest collection of computer science research in the world, with the expertise of the ACM membership. Research for Practice columns have a common goal of sharing the joy and utility of reading about computer science research exchanged between academics and their counterparts in industry.
This summer's installment of Research for Practice covers a topic that, despite its maturity, continues to produce cutting-edge research: deterministic record-and-replay. Deterministic record-and-replay technologies enable a faithful re-execution (replay) of a program that ran in the past (and perhaps encountered a rare bug, a performance anomaly, or an intrusion by an adversary). But accomplishing this requires that any nondeterministic inputs to the program be logged (recorded) during execution.
The selection of techniques presented here is curated by Andrew Quinn, assistant professor of computer science and engineering at UC Santa Cruz. We chose the topic because of its growing relevance to our audience of practitioners, and we chose Professor Quinn because of his expertise in the area. His selections here all come from recent publications in top CS conferences, and they cover a range of topics from core technologies providing record-and-replay to somewhat more exotic applications. They reveal both how broadly applicable this technique is across domains and just how HOT record-and-replay remains even though it has been an active area of research for a long time.
Professor Quinn's summaries are detailed, so I will keep mine short. His first selection, which is called RR, provides a fantastic opportunity to get your feet wet with a system that "maximizes deployability" and works right out of the box by taking advantage of a mix of hardware and software features available on any modern system.
The second paper (which just came out in 2024) uses record-and-replay techniques not to re-execute a program per se, but instead to address the problem of lightweight auditing—for example, to ensure a cloud provider is indeed running the program you submitted without needing to resort to the massive compute resources offered by the provider.
Finally, he shares a paper that creatively applies record-and-replay ideas to a seemingly unrelated domain: securing GPU computations on mobile devices by treating each execution on new inputs as a "replay" of a canonical execution.
I cannot think of a better way to spend a weekend afternoon than reading these three papers, along with Professor Quinn's expert summary.
—Peter Alvaro
Andrew Quinn
Deterministic record-and-replay is a technique that allows a user to record a program execution and then replay the exact same execution at a later time. Think of it as being like TiVo, only for processes that execute on a computer. Researchers have created hundreds of systems that use deterministic record-and-replay across many domains. For example, security forensic systems (e.g., Backtracker5) use record-and-replay to determine what data was exfiltrated during a security breach, while replay-based debuggers (e.g., Instant Replay6) use record-and-replay to debug rare production issues and heisenbugs (bugs that go away once you observe them).
The key challenge with deterministic record-and-replay is that processes produce a lot of execution state, far too much to capture and store. To illustrate, let's do a back-of-the-envelope calculation for the size of a process's instruction sequence, which is a subset of the information that must be encoded for record-and-replay. Assume that each instruction pointer is eight bytes, that the process is single-threaded and executes for 10 minutes, and that the system uses a 2GhZ processor. In this scenario, the instruction sequence would require 1.2TB of storage. Imagine the storage demands to do this for all your workloads, which execute longer, have more threads, and use faster processes.
The key insight enabling deterministic record-and-replay is that most of a process's actions are deterministic—meaning their behavior is dependent entirely on the current state of the process. Thus, a deterministic record-and-replay system can eschew storing most execution states and instead store only information about the nondeterministic actions of the process. During recording, such systems store the nondeterministic inputs that were passed to the process (e.g., the bytes read from the network) into a log; during replay, the systems then re-execute the process with the values in the log to re-create the execution state.
This column describes three recent research results that use deterministic record-and-replay, focusing on diverse use cases of the technique.
Robert O'Callahan, Chris Jones, Nathan Froyd, Kyle Huey, Albert Noll, and Nimrod Partush.
Engineering record and replay for deployability.
Proceedings of the Usenix Annual Technical Conference, 2017.
https://www.usenix.org/conference/atc17/technical-sessions/presentation/ocallahan
This system, called RR, is something you could probably add to your engineering workflow right now since it's a record-and-replay debugger that's easy to deploy. The system's key novelty is that it performs record-and-replay on an unprivileged user-space implementation that supports unmodified user-space applications on a stock Linux kernel, compiler, language runtime, and standard x86/x86-64 CPU. RR even supports programs that have bugs, such as data races, which have been a major source of issues for past record-and-replay systems. RR is also an open-source system, so you should build and tinker around with it if you haven't had the chance already.
RR works by building on software and hardware features that were originally intended for traditional debugging and performance profiling. It uses ptrace, which was designed to implement software debuggers (e.g., gdb) that capture system call results and signals. The system accelerates the performance of ptrace by introducing in-process, system-call intersection, a technique that selectively suppresses ptrace traps and hence minimizes context switches, using seccomp-bpf (secure computing-Berkeley Packet Filter). RR executes only one thread at a time to avoid nondeterministic data races, and it uses CPU hardware performance counters (originally designed for performance profiling) to measure application progress so it can deliver signals and perform context switches at the right point in the program. This evaluation indicates whether the system meets reasonable compute, memory, and storage requirements for regular use as a debugger.
Ioanna Tzialla, Jeffery Wang, Jingyi Zhu, Aurojit Panda, and Michael Walfish.
Efficient auditing of event-driven web applications.
In Proceedings of the 19th European Conference on Computer Systems (EuroSys). 2024
https://dl.acm.org/doi/10.1145/3627703.3650089
Suppose that you deploy a web server on a remote cloud machine. How can you be sure that the machine is actually executing your web server? This brings us to the second system of interest: Karousos, which uses a variant of record-and-replay to answer that question about how your web server is faring, a concern these researchers refer to as "execution integrity." Specifically, Karousos provides a solution that provides a principal (you, that is) with confidence that running a given program (your web server) on given inputs (user requests) actually produces the alleged outputs (user responses) observed from a third-party machine (the remote cloud machine).
Karousos uses the model from Orochi7 in that a trusted collector machine captures all inputs and outputs to the program, while a verifier re-executes the inputs in the trace to check whether re-executed outputs match the trace. The model accelerates the process by re-executing requests in batches, using untrusted advice produced by the untrusted machine that consists of some of the nondeterministic actions made by the program.
Karousos specifically targets event-driven web applications, for which prior work provides poor support. The system's key contributions center around techniques that balance the tradeoff between re-execution throughput and the size of advice. The technique and analysis are deep and beyond the scope of this column. So, I encourage interested readers to dive into the paper. (I promise you'll learn something!)
In the end, Karousos's re-execution is essentially deterministic record-and-replay, with two key caveats. First, its re-executions aim to require less computation than the recorded execution; otherwise, it would not make sense to use the cloud in the first place. Second, deterministic record-and-replay assumes a correct log of nondeterministic events, whereas Karousos targets scenarios where the third-party adversarially constructs the advice.
Heejin Park and Felix Xiaozhu Lin.
GPUReplay: a 50-KB GPU stack for client ML.
Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022.
https://dl.acm.org/doi/10.1145/3503222.3507754
Let's say you design and maintain a mobile application that uses a GPU to accelerate a machine-learning (ML) component. Moreover, let's say this application executes on a GPU stack consisting of an ML framework (e.g., TensorFlow), a runtime (e.g., OpenCL), and a GPU driver. Together, the GPU stack converts a high-level language definition of your program into low-level GPU commands and places them on the GPU. Unfortunately, these sophisticated stacks come with three main concerns: weak security, deployment difficulty, and slow startup.
GPUReplay uses deterministic record-and-replay to improve this state of affairs. At development time, it executes your ML application and records the GPU stack executions. The executions encode how the GPU stack interacts with the GPU to initialize each execution. At deployment time, GPUReplay executes your ML application by invoking the recorded GPU stack executions when given new input data. In doing so, it needs to deal with nondeterminism that may arise between the recorded executions and the replay ones, which it does by allowing nondeterminism that does not affect program output (e.g., timing variations), preventing the nondeterminism from occurring in the first place, or detecting nondeterminism and attempting re-execution. The authors show that GPUReplay eliminates a large number of high-impact CVEs (Common Vulnerabilities and Exposures) from occurring in deployment and validates that the replayed executions are indeed correct.
This column describes three recent research advances related to deterministic record-and-replay, with the goal of showing both classical use cases (replay-based debugging) and emerging use cases (execution integrity and GPU acceleration). A growing number of systems use a weaker form of deterministic record-and-replay (which I call "mostly deterministic" record-and-replay). Essentially, these systems exploit the determinism that exists across many program executions but intentionally allow some nondeterminism for performance reasons. This trend is exemplified in GPUReplay in particular, but also in systems such as ShortCut4 and Dora.8
For more information on deterministic record-and-replay, I suggest any of the detailed surveys on the topic.1,2,3
1. Chen, Y., Zhang, S., Guo, Q., Li, L., Wu, R., Chen, T. 2015. Deterministic replay: a survey. ACM Computing Surveys 48(2); https://dl.acm.org/doi/10.1145/2790077.
2. Cornelis, F., Georges, A., Christiaens, M., Ronsse, M., Ghesquiere, T., De Bosschere, K. 2003. A taxonomy of execution replay systems. In International Conference on Advances in Infrastructure for Electronic Business, Education, Science, Medicine, and Mobile Technologies on the Internet; https://www.researchgate.net/publication/244149898_A_Taxonomy_of_Execution_Replay_Systems.
3. Dionne, C., Feeley, M., Desbiens, J. 1996. A taxonomy of distributed debuggers based on execution replay. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques (PDPTA); http://www.iro.umontreal.ca/~feeley/papers/DionneFeeleyDesbiensPDPTA96.pdf.
4. Dou, X., Chen, P. M., Flinn, J. 2019. ShortCut: accelerating mostly deterministic code regions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 570–585; https://dl.acm.org/doi/10.1145/3341301.3359659.
5. King, S. T., Chen, P. M. 2003. Backtracking intrusions. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, 223–236; https://dl.acm.org/doi/10.1145/945445.945467.
6. Leblanc, T. J., Mellor-Crummey, J. M. 1987. Debugging parallel programs with Instant Replay. IEEE Transactions on Computers C-36 (4), 471–482; https://dl.acm.org/doi/abs/10.1109/TC.1987.1676929.
7. Tan, C., Yu, L., Leners, J. B., Walfish, M. 2017. The efficient server audit problem, deduplicated re-execution, and the web. In Proceedings of the 26th Symposium on Operating Systems Principles, 546–564; https://dl.acm.org/doi/10.1145/3132747.3132760.
8. Viennot, N., Nair, S., Nieh, J. 2013. Transparent mutable replay for multicore debugging and patch validation. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS); https://dl.acm.org/doi/10.1145/2499368.2451130.
Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems in order to make them scalable, predictable, and robust to the failures and nondeterminism endemic to large-scale distribution. He earned his Ph.D. at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the National Science Foundation Career Award, Facebook Research Award, Usenix ATC 2020 Best Presentation Award, SIGMOD 2021 Distinguished PC Award, and UCSC Excellence in Teaching Award.
Andrew Quinn is an assistant professor at UC Santa Cruz. They work at the intersection of operating systems, programming languages, and computer architecture. Andrew's work has been awarded with an ASPLOS best paper award, an IEEE Micro top picks honorable mention, a Microsoft fellowship, an NSF GRFP fellowship, and numerous publications at top-tier venues across these areas.
Copyright © 2024 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 22, no. 4—
Comment on this article in the ACM Digital Library
Charisma Chan, Beth Cooper - Debugging Incidents in Google’s Distributed Systems
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.
Devon H. O'Dell - The Debugging Mindset
Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.
Peter Phillips - Enhanced Debugging with Traces
Creating an emulator to run old programs is a difficult task. You need a thorough understanding of the target hardware and the correct functioning of the original programs that the emulator is to execute. In addition to being functionally correct, the emulator must hit a performance target of running the programs at their original realtime speed. Reaching these goals inevitably requires a considerable amount of debugging. The bugs are often subtle errors in the emulator itself but could also be a misunderstanding of the target hardware or an actual known bug in the original program. (It is also possible the binary data for the original program has become subtly corrupted or is not the version expected.)
Queue Readers - Another Day, Another Bug
As part of this issue on programmer tools, we at Queue decided to conduct an informal Web poll on the topic of debugging. We asked you to tell us about the tools that you use and how you use them. We also collected stories about those hard-to-track-down bugs that sometimes make us think of taking up another profession.