Automatically Testing Database Systems

DBMS testing with test oracles, transaction history, and fuzzing

Manuel Rigger with Introduction by Peter Alvaro

Research for Practice combines the resources of the ACM Digital Library, the largest collection of computer science research in the world, with the expertise of the ACM membership. Research for Practice columns have a common goal of sharing the joy and utility of reading computer science research between academics and their counterparts in industry.

Over the last several years, there has been an explosion of new research on methodologies for testing data management systems. Some of the most exciting work has been pioneered by Manuel Rigger (now at the National University of Singapore), and we are very lucky to have him as curator for this installment of Research for Practice.

After dividing the idea of automatic database testing into the problems of input generation and test oracles, Dr. Rigger presents three selections that highlight the breadth of the field. The first paper focuses on the problem of automatically synthesizing sophisticated test oracles that check whether the outputs of a database history are correct, leaving aside the problem of input generation. The last paper factors the problem in the opposite way, covering a diverse input space of SQL dialects while relying on the "built-in" test oracle of system crashes. In the middle is a paper that co-designs the input generation and test oracle in order to focus strictly on bugs in concurrency control.

Whether you are new to the area or an old hat, you are sure to learn something from this selection of papers and from Dr. Rigger's expert analysis.

—Peter Alvaro

Research for Practice:
Automatically Testing Database Systems

Manuel Rigger

A DBMS (database management system) is expected to produce correct results efficiently. For efficiency, query optimizers apply many semantics-preserving optimizations when translating a query written in a declarative language such as SQL to an execution plan. In addition, a DBMS supports the execution of multiple transactions concurrently, while maintaining guarantees of what effects, if any, that transactions see of each other, referred to as isolation levels. For correctness, along with unit tests and general-purpose fuzzers (e.g., American Fuzzy Lop, or AFL), the past few years have seen a plethora of practical works that have significantly advanced the state of the art in automatically testing DBMSes.

Automatic testing techniques validate a DBMS without user interaction. They need to tackle two main challenges. First, they must generate test inputs. For a DBMS, depending on the goal, this can be a series of SQL statements that generate a database, as well as a query to test query optimizers; or it can be a series of transactions to test transaction handling.

Second, a so-called test oracle is required, which takes a test input and the DBMS's output, and checks whether the output is expected. Perhaps the simplest test oracle is to check whether a process crashes. To design test oracles that detect logic bugs or performance issues, domain-specific insights are typically required.

Each of the three works discussed here is representative of a research strand in this area.

• First, various approaches find logic bugs in query optimizers using black-box test oracles (i.e., they operate on a SQL level). Often, these approaches create pairs of queries or a series of statements that are expected to compute the same result and thus focus on finding logic bugs.

• Second, to test transaction handling, approaches based on lightweight formal methods techniques inspect the effects of a transaction history and infer which isolation level the DBMS supports.

• Finally, various fuzzing approaches find crash bugs, which can be exploitable, to harden a DBMS against potential attackers. One challenge is to find such bugs in a variety of DBMSs while minimizing the effort to adapt the fuzzer to new systems (e.g., account for different SQL dialects). In addition, such fuzzers should generate mostly valid, diverse, and interesting inputs to exercise the DBMS.

Using Test Oracles

Manuel Rigger and Zhendong Su.
Finding Bugs in Database Systems via Query Partitioning.
Proceedings of the ACM on Programming Languages. OOPSLA, 2020.
https://dl.acm.org/doi/abs/10.1145/3428279.

TLP (ternary logic partitioning) tackles the test-oracle problem by generating pairs of queries that are expected to compute the same results; any discrepancy indicates a bug in the DBMS.

One challenge is to generate pairs of queries that vary in complexity to stress the DBMS under test. It would be straightforward to come up with specific rules for equivalent queries. For example, for any predicate p, you could generate two equivalent queries "SELECT * FROM orders WHERE p" and "SELECT * FROM orders WHERE NOT (NOT p)" and check that they compute the same result. However, such rules are too specific to find a wide range of logic bugs.

The TLP test oracle uses the following key insight: Given a Boolean predicate p, for every row in the database, p must evaluate to true, false, or NULL. In other words, the database or any intermediate results can be partitioned into three parts, depending on which value a predicate in a filter evaluates. Consequently, a query such as "SELECT * FROM orders" should compute the same result as three queries "SELECT * FROM orders WHERE p", where p is instantiated with "num_items > 5", "NOT (num_items > 5)", and "(num_items > 5) IS NULL". This insight can be generalized and enables generating test cases using a range of different SQL features, including aggregate functions. Given that the three follow-up queries are more complex, they are likely to trigger more interesting functionality of the DBMS and, thus, potential bugs.

TLP has found many bugs in mature DBMSs. It is a key test oracle in SQLancer, a widely used DBMS testing tool developed by the authors of this article. Because of its simplicity, companies have also implemented the test oracle into their testing frameworks.

Using Transaction History

Kyle Kingsbury and Peter Alvaro.
Elle: Inferring Isolation Anomalies from Experimental Observations.
Proceedings of the VLDB Endowment, 2020.
https://dl.acm.org/doi/10.14778/3430915.3430918.

Elle checks a transaction history and, based on that, infers whether the DBMS implements the isolation level it claims to support. In other words, it attempts to find witnesses of so-called anomalies, which show that the DBMS supports a lower isolation level than claimed. It is a combined approach that tackles both the test-oracle and test-case problems.

A key challenge for isolation-level checkers is to reconstruct the so-called version history of an object stored in the database. Typically, reading from an object from a database reveals little about the past. For example, assume that transactions concurrently write the values "1", "2", and "3" to an object. By inspecting the database, you might see that the object holds the value "2", but you fail to learn whether value "1" was written before "3", whether "3" was written before "1", or whether one or both of the writes had been lost.

The key insight of Elle is to encode writes so that they retain the object's history and are unique. Essentially, each write operation appends to a list, so that the object contains a list of values such as "3, 1, 2", which unambiguously captures the order of writes. The version order indicated by the lists can then be used to infer dependencies between transactions, which are used to detect potential anomalies. In the formalism that is used, checking for anomalies corresponds to checking for cycles in the dependency graph, which can be solved in linear time, making the approach efficient.

Elle has found many critical bugs in mature systems and demonstrated that many DBMSs fail to support the isolation levels they claim to implement. It is part of Jepsen, a highly popular open-source project for distributed-systems testing.

The Fuzzing Approach

Jingzhou Fu, Jie Liang, Zhiyong Wu, Mingzhe Wang, and Yu Jiang.
Griffin: Grammar-Free DBMS Fuzzing.
Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022.
https://dl.acm.org/doi/abs/10.1145/3551349.3560431.

Griffin tackles the test-case generation problem by aiming to generate valid and diverse test inputs for a variety of DBMSs without requiring knowledge of each DBMS's SQL dialect. It achieves this by using a mutation-based fuzzing approach, specifically by shuffling SQL statements while ensuring that they still refer to valid objects.

A key challenge for generating valid SQL statements is that most DBMSs have their own SQL dialects that differ widely in terms of syntax and semantics. Tools such as SQLancer tackle this by using DBMS-specific SQL generators, implemented based on the DBMS's grammar, which requires a high level of implementation effort.

The key idea of Griffin is to combine statements from existing test cases in a dialect-agnostic way. Existing test cases contain valid SQL sequences, meaning that Griffin does not require understanding the specific SQL dialect to generate valid statements. Since many statements refer to existing objects (e.g., an INSERT typically references an existing table), which might not exist after reshuffling SQL statements, Griffin proposes a lightweight approach to determine and fix dependencies between statements; these are accounted for when generating new test inputs.

Griffin has found more than 50 bugs in a short testing campaign.

Conclusion

The automated testing of DBMS is an exciting, interdisciplinary effort that has seen many innovations in recent years. The examples addressed here represent different perspectives on this topic, reflecting strands of research from software engineering, (database) systems, and security angles. They give only a glimpse into these research strands, as many additional interesting and effective works have been proposed. Various approaches generate pairs of related tests to find both logic bugs and performance issues in a DBMS. Similarly, other isolation-level testing approaches have been proposed. Finally, various fuzzing approaches use different strategies to generate mostly valid and interesting test inputs that extract various kinds of feedback from the DBMS.

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems in order to make them scalable, predictable, and robust to the failures and nondeterminism endemic to large-scale distribution. He earned his Ph.D. at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the National Science Foundation Career Award, Facebook Research Award, Usenix ATC 2020 Best Presentation Award, SIGMOD 2021 Distinguished PC Award, and UCSC Excellence in Teaching Award.

Manuel Rigger is an assistant professor in the School of Computing at the National University of Singapore, where he leads the TEST (Trustworthy Engineering of Software Technologies) lab (https://nus-test.github.io/). His research focuses on making systems—in particular, data-centric ones—more reliable, efficient, and secure. Previously, he was a postdoctoral researcher at ETH Zurich, where he was mentored by Zhendong Su. He obtained his Ph.D. at the Johannes Kepler University Linz, mentored by Hanspeter M??ssenb??ck.

Originally published in Queue vol. 21, no. 6—
Comment on this article in the ACM Digital Library

Research for Practice