The Fun in Fuzzing

The debugging technique comes into its own.

Stefan Nagy, with Introduction By Peter Alvaro

Research for Practice combines the resources of the ACM Digital Library, the largest collection of computer science research in the world, with the expertise of the ACM membership. Research for Practice columns have a common goal of sharing the joy and utility of reading computer science research between academics and their counterparts in industry.

For this edition of Research for Practice, we enlisted the help of Stefan Nagy, an assistant professor in the Kahlert School of Computing at the University of Utah. We would like to thank John Regehr—who has written for RfP before—for making this introduction.

Stefan takes us on a tour of recent research in software fuzzing, or the systematic testing of programs via the generation of novel or unexpected inputs. The first paper he discusses extends the state of the art in coverage-guided fuzzing (which measures the testing progress in terms of program syntax) with the semantic notion of "likely invariants," inferred via techniques from property-based testing. The second explores encoding domain-specific knowledge about certain bug classes (e.g., use-after-free errors) into test-case generation. His last selection takes us through the looking glass, randomly generating entire C programs and using differential analysis to compare traces of optimized and unoptimized executions, in order to find bugs in the compilers themselves.

I learned a lot from reading these papers and hope that you will too.

—Peter Alvaro

Research for Practice: The Fun in Fuzzing

Stefan Nagy

Fuzzing (shorthand for fuzz-testing) represents perhaps the most historically successful technique for automated software bug and vulnerability discovery. Fuzzing's core workflow is simple: (1) Generate lots of test cases; (2) run each on your target program and observe what happens; and then (3) group test cases by their observed execution behaviors (e.g., crashes, code coverage, timeouts, etc.), keeping those that exhibit some notion of "interesting" behavior (although what "interesting" means here is entirely up to you).

Over the past several decades, fuzzing has seen a significant evolution: from its origins as a crude, brute-force testing strategy to becoming one of today's most sophisticated—and essential—techniques for uncovering critical security vulnerabilities at scale. While there are, by now, a countless number of amazing fuzzing papers—with more and more being published each year—this edition of Research for Practice looks at three particularly innovative and inspiring papers from the past two years.

Moving on from coverage-only fuzzing

Andrea Fioraldi, Daniele Cono D'Elia, and Davide Balzarotti. "The Use of Likely Invariants as Feedback for Fuzzers." Proceedings of the 30th Usenix Security Symposium, 2021.

https://www.usenix.org/conference/usenixsecurity21/presentation/fioraldi

Most real-world fuzzing today adopts a code-coverage-maximizing feedback strategy: saving just the inputs that cover previously unseen target code and mutating them, ideally to create new coverage-increasing inputs. You can think of this like horse racing, where the goal is to breed future generations of winners—coverage-increasing inputs—from prior winners. Yet, because coverage-guided fuzzers' only focus is expanding their frontier of code coverage, they will often miss the many bugs hidden in already-covered code.

To overcome coverage-only fuzzing's exploration limitations, this 2021 Usenix paper augments fuzzing with property testing's semantic invariants: a set of "rules" about what states a program's variables are expected to take on. For example, the following assertion statement

assert(x < 5)

mandates that variable x be less than 5; and should this invariant be violated, the program will signal that this variable has taken on an unusual state. Automatically mining invariants, however, often brings significant false positives and noise. Naively tracking every possible invariant will both overwhelm a fuzzer with too much feedback and deteriorate its throughput from all of the per-invariant instrumentation overhead.

To mitigate this, the authors replay previously generated fuzzing inputs—a great way to leverage ordinarily discarded fuzzing data—and introduce a set of heuristics to prune impossible-to-violate, redundant, and otherwise erroneous invariants. These final invariants are then each instrumented with a check that signals any violations to the fuzzer.

By combining invariant and coverage feedback, the authors' approach helps fuzzers uncover many more program states than code coverage alone—revealing 34-56 percent more bugs at a negligible 8 percent overhead over coverage-only fuzzing.

With coverage-only fuzzing missing so many bugs, invariant-guided fuzzing is likely soon to become a first-class mode in popular fuzzing frameworks such as AFL++ and libFuzzer. Although the authors' implementation supports only source-available targets, advancements in binary lifting, translation, and rewriting leave me hopeful that invariant-guided fuzzing techniques will soon be extended to harder-to-fuzz closed-source software and systems.

Tailoring fuzzing to specific vulnerabilities

Manh-Dung Nguyen, Sébastien Bardin, Richard Bonichon, Roland Groz, and Matthieu Lemerre. Binary-level Directed Fuzzing for Use-After-Free Vulnerabilities. 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020). https://www.usenix.org/conference/raid2020/presentation/nguyen

A major challenge in fuzzer design is balancing bug-finding speed and breadth. Although coverage-guided fuzzing is quick to expose low-hanging-fruit bugs such as buffer overflows, its coverage-maximizing search is ineffective at triggering bugs that require complex program states (e.g., temporal memory-safety violations). Prior attempts to improve general-purpose bug discovery explored finer-grained levels of code coverage (e.g., n-gram control-flow edges) but often deteriorated fuzzing speed because of their higher instrumentation and coverage bookkeeping costs. Thus, improving software security requires not only strong general-purpose fuzzers, but also those specialized to certain classes of vulnerabilities.

This RAID 2020 paper introduces a directed fuzzing technique tailored to use-after-free vulnerabilities—a dangerous class of temporal memory safety errors that occurs when a program attempts to access a previously freed region of memory. General-purpose fuzzers often struggle to synthesize the series of program events needed to trigger use-after-frees: (1) an allocation; (2) a free; and (3) an access. To overcome this, the authors devise a directing fuzzing strategy to hit these function calls in sequence—effectively incorporating a temporal awareness to guide fuzzing's exploration toward these vulnerabilities.

Unsurprisingly, the authors' approach sees great success on use-after-free vulnerabilities: finding known use-after-frees two times faster than existing temporal-agnostic directed fuzzers, while also revealing 11 new use-after-frees in well-fuzzed programs such as Perl and MuPDF.

Although the authors' prototype targets binary executables, sequence-directed fuzzing easily scales to source-available fuzzing targets, as well as other temporal bugs such as double-frees. I anticipate there will be many more opportunities to improve fuzzing by combining general-purpose and bug-tailored methodologies.

Applying fuzzing to other problems

Giuseppe Antonio Di Luna, Davide Italiano, Luca Massarelli, Sebastian Österlund, Cristiano Giuffrida, and Leonardo Querzoni. Who's Debugging the Debuggers? Exposing Debug Information Bugs in Optimized Binaries. Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021).

https://dl.acm.org/doi/10.1145/3445814.3446695

Many post-deployment quality-assurance tasks rely on debug information (e.g., DWARF): a compiler-emitted mapping of an executable's code to its higher-level source representation (e.g., variables, types, instructions, etc.). Discrepancies in debug information—missing or irrelevant artifacts—are often caused by subtle bugs in development tool chains and exacerbated by compilers' aggressive optimizations. Preventing developers from being led astray by mishandled debug information demands effective, automated approaches for uncovering these hidden bugs in modern development toolchains.

This ASPLOS 2021 paper introduces a debug-information fuzzing framework based on differential analysis: randomly generating C programs and comparing their optimized and unoptimized versions' debugging traces for discrepancies. For their differential analysis, the authors formalize a set of trace invariants, which stipulate that an optimized binary maintain consistency with the debugging information (e.g., variables, functions, and source lines) present in its unoptimized version's debugging trace at any point during execution.

Following this, the authors triage discrepancy-inducing test cases to identify where bugs occur—either in the compiler's emitting of debug information or in the debugger's parsing and interpretation of it. The authors apply their automated fuzzing technique to a variety of widely used software development toolchains.

Overall, the authors' differential fuzzing approach demonstrates impressive effectiveness: uncovering 34 debug information errors across the LLVM, GNU, and Rust compiler/debugger toolchains—with more than a dozen reported issues already confirmed as being fixed. As vetting the correctness of debug information is still a fairly new application of fuzzing, I expect that future fuzzers will improve upon the techniques in this paper toward even greater scrutiny of development toolchains. Moreover, I am optimistic that the power of differential fuzzing will yield useful strategies for other nontrivial classes of bugs, software, and systems.

Happy fuzzing!

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems in order to make them scalable, predictable, and robust to the failures and nondeterminism endemic to large-scale distribution. He earned his Ph.D. at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the National Science Foundation Career Award, Facebook Research Award, Usenix ATC 2020 Best Presentation Award, SIGMOD 2021 Distinguished PC Award, and UCSC Excellence in Teaching Award.

Stefan Nagy (https://www.cs.utah.edu/~snagy/) is an assistant professor in the Kahlert School of Computing at the University of Utah. He received his Ph.D. in computer science from Virginia Tech in 2022, and his B.S. in computer science from the University of Illinois at Urbana-Champaign in 2016. His research in software and systems security broadly aims to make efficient and effective quality assurance possible for opaque and otherwise challenging domains in computing.

Originally published in Queue vol. 20, no. 6—
Comment on this article in the ACM Digital Library