February 4, 2025
Volume 22, issue 6

Download PDF version of this article PDF

Systems Correctness Practices at AWS

Leveraging Formal and Semi-formal Methods

Marc Brooker and Ankush Desai

AWS (Amazon Web Services) strives to deliver reliable services that customers can trust completely. This demands maintaining the highest standards of security, durability, integrity, and availability—with systems correctness serving as the cornerstone for achieving these priorities. An April 2015 paper published in Communications of the ACM, titled "How Amazon Web Services Uses Formal Methods," highlighted the approach for ensuring the correctness of critical services that have since become among the most widely used by AWS customers.²¹

Central to this approach was TLA+,¹⁴ a formal specification language developed by Leslie Lamport. Our experience at AWS with TLA+ revealed two significant advantages of applying formal methods in practice. First, we could identify and eliminate subtle bugs early in development—bugs that would have eluded traditional approaches like testing. Second, we gained the deep understanding and confidence needed to implement aggressive performance optimizations while maintaining systems correctness.

Moreover, 15 years ago, AWS's software testing practice relied primarily on build-time unit testing, often against mocks, and limited deployment-time integration testing. Since then, we have significantly evolved our correctness practices, integrating both formal and semi-formal approaches into the development process. As AWS has grown, formal methods have become increasingly valuable—not only for ensuring correctness but also for performance improvements, particularly in verifying the correctness of both low- and high-level optimizations. This systematic approach toward systems correctness has become a force multiplier at AWS's scale, enabling faster development cycles through improved developer velocity while delivering more cost-effective services to customers.

This article surveys the portfolio of formal methods used across AWS to deliver complex services with high confidence in its correctness. We consider an umbrella definition of formal methods that encompasses these rigorous techniques—from traditional formal approaches such as theorem proving,^7,10 deductive verification,¹⁸ and model checking^8,14 to more lightweight semi-formal approaches such as property-based testing,^6,19 fuzzing,⁹ and runtime monitoring.¹¹

The P Programming Language

As the use of formal methods was expanded beyond the initial teams at AWS in the early 2010s, we discovered that many engineers struggled to learn and become productive with TLA+. This difficulty seemed to stem from TLA+'s defining feature: It is a high-level, abstract language that more closely resembles mathematics than the imperative programming languages most developers are familiar with. While this mathematical nature is a significant strength of TLA+, and we continue to agree with Lamport's views on the benefits of mathematical thinking,¹⁵ we also sought a language that would allow us to model check (and later prove) key aspects of systems designs while being more approachable to programmers.

We found this balance in the P programming language.⁸ P is a state-machine-based language for modeling and analysis of distributed systems. Using P, developers model their system designs as communicating state machines, a mental model familiar to Amazon's developer population, most of whom develop systems based on microservices and SOAs (service-oriented architectures). P has been developed at AWS since 2019 and is maintained as a strategic open-source project.²² Teams across AWS that build some of its flagship products—from storage (e.g., Amazon S3, EBS), to databases (e.g., Amazon DynamoDB, MemoryDB, Aurora), to compute (e.g., EC2, IoT)—have been using P to reason about the correctness of their system designs.

For example, P was used in migrating S3 (Simple Storage Service) from eventual to strong read-after-write consistency.¹ A key component of S3 is its index subsystem, an object metadata store that enables fast data lookups. To achieve strong consistency, the S3 team had to make several nontrivial changes to the S3 index protocol stack.²⁵ Because these changes were difficult to get right at S3 scale, and the team wanted to deliver strong consistency with high confidence in correctness, they used P to formally model and validate the protocol design. P helped eliminate several design-level bugs early in the development process and allowed the team to deliver risky optimizations with confidence, as they could be validated using model checking.

In 2023, the P team at AWS built PObserve, which provides a new tool for validating the correctness of distributed systems both during testing and in production. With PObserve, we take structured logs from the execution of distributed systems and validate post-hoc that they match behaviors allowed by the formal P specification of the system. This allows for bridging the gap between the P specification of the system design and the production implementation (typically in languages like Rust or Java). While there are significant benefits from verifying protocols at design time, runtime monitoring of the same properties for the implementation makes the investment in formal specification much more valuable and addresses classic concerns with the deployment of formal methods in practice (i.e., connecting design-time validation with system implementation).

Lightweight Formal Methods

Another way that AWS has brought formal methods closer to its engineering teams is through the adoption of lightweight formal methods.

Property-based testing

The most notable single example of leveraging light-weight formal method is in Amazon S3's ShardStore, where the team used property-based testing throughout the development cycle both to test correctness and to speed up the development (described in detail by Bornholt, et al.⁴). The key idea in their approach was combining property-based testing with developer-provided correctness specifications, coverage-guided fuzzing (an approach where the distribution of inputs is guided by code coverage metrics), failure injection (where hardware and other system failures are simulated during testing), and minimization (where counterexamples are automatically reduced to aid human-guided debugging).

Deterministic simulation

Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings that they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.

Deterministic simulation testing moves testing of system properties, like behavior under delay and failure, closer to build time instead of integration testing. This accelerates development and provides for more complete behavioral coverage during testing. Some of the work done at AWS on build-time testing of thread ordering and systems failures has been open-sourced as part of the shuttle (https://github.com/awslabs/shuttle) and turmoil (https://github.com/tokio-rs/turmoil) projects.

Continuous fuzzing or random test-input generation

Continuous fuzzing, especially coverage-guided scalable test-input generation, is also effective for testing systems correctness at integration time. During the development of Amazon Aurora's data-sharding feature (Aurora Limitless Database^3), for example, we made extensive use of fuzzing to test two key properties of the system. First, by fuzzing SQL queries (and entire transactions), we validated that the logic partitioning SQL execution over shards is correct. Large volumes of random SQL schemas, datasets, and queries are synthesized and run through the engines under test, and the results compared with an oracle based on the nonsharded version of the engine (as well as other approaches to validation, like those pioneered by SQLancer²³).

Fuzzing, combined with fault injection testing, is also useful for testing other aspects of database correctness such as atomicity, consistency, and isolation. In database testing, transactions are automatically generated, their correct behavior is defined using a formally specified correctness oracle, and then all possible interleaving of transactions and statements within the transaction are executed against the system under test. We also use post-hoc validation of properties like isolation (following approaches such as Elle^13).

Fault Injection as a Service

In early 2021 AWS launched FIS (Fault Injection Service)² with the goal of making testing based on fault injection accessible to a wide range of AWS customers. FIS allows customers to inject simulated faults, from API errors to I/O pauses and failed instances, into test or production deployments of their infrastructure on AWS. Injecting faults allows customers to validate that the resiliency mechanisms they have built into their architectures (such as failovers and health checks) actually work to improve availability and do not introduce correctness problems. Fault-injection testing based on FIS is widely used by AWS customers, and internally within Amazon. For example, Amazon.com ran 733 FIS-based fault-injection experiments in preparation for Prime Day 2024.

In 2014, Yuan, et al. found that 92 percent of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors. Many distributed-systems practitioners who were told about this research were surprised the percentage wasn't higher. Happy-case catastrophic failures are rare simply because the happy case of systems is executed often, tested better (both implicitly and explicitly), and is significantly simpler than the error cases. Fault-injection testing and FIS make it much easier for practitioners to test the behavior of their systems under faults and failures, closing the gap between happy-case and error-case bug density.

While fault injection is not considered a formal method, it can be combined with formal specifications. Defining the expected behavior using a formal specification, and then comparing results during and after fault injection to the specified behavior, allows for catching a lot more bugs than simply checking for errors in metrics and logs (or having a human look and say, "Yup, that looks about right").

Metastability and Emergent System Behavior

Over the past decade, there has been an emerging interest in a particular class of systems failure: those where some triggering event (like an overload or a cache emptying) causes a distributed system to enter a state where it doesn't recover without intervention (such as reducing load below normal). This class of failures, dubbed metastable failures,⁵ is one of the most important contributors to unavailability in cloud systems. Figure 1 (adapted from Bronson, et al.⁵) illustrates a common type of metastable behavior: Load increases on the system are initially met with increasing goodput, followed by saturation, followed by congestion and goodput dropping to zero (or near zero). From there, the system cannot return to healthy state by slightly reducing load. Instead, it must follow the dotted line and may not recover until load is significantly reduced. This type of behavior is present in even simple systems. For example, it can be triggered in most systems with timeout-and-retry client logic.

Traditional formal approaches to modeling distributed systems typically focus on safety (nothing bad happens) and liveness (something good eventually happens), but metastable failures remind us that systems have a variety of behaviors that cannot be neatly categorized this way. We have increasingly turned to discrete-event simulation to understand the emergent behavior of systems, investing both in custom-built systems simulations and tooling that allow the use of existing system models (built in languages like TLA+ and P) to simulate system behavior. Extending exhaustive model checkers, like TLA+'s TLC with probabilistic simulations, also allows for the generation of statistical results such as posterior latency distributions, making model checking useful for tasks like understanding the achievability of latency SLAs (service-level agreements).

Formal Proof

In some cases, the formal methods enumerated so far in this article are not sufficient. For critical security boundaries such as authorization and virtualization, for example, proofs of correctness can be both desirable and worth the significant investment needed to create them.

In 2023, AWS introduced the Cedar authorization policy language for writing policies that specify fine-grained permissions. Cedar was designed for automated reasoning and formal proof.^12,24 The language was designed to be well-suited for proof, and the implementation was built in the verification-aware programming language Dafny. Using Dafny, the team was able to prove that the implementation satisfies a variety of security properties. This type of proof goes beyond testing. It is a proof in the mathematical sense. The team also applied a differential testing approach using the Dafny code as a correctness oracle to verify the correctness of the production-ready Rust implementation. Publishing the Dafny code and test procedures as open source, along with the Cedar implementation, allows Cedar users to check the team's work on correctness.

Another example is the Firecracker VMM (virtual machine monitor). Firecracker uses a low-level protocol called virtio to expose emulated hardware devices (such as a network card or solid-state drive) to guest kernels running inside the VM. This emulated device is a critical security boundary because it is the most complex interaction between the untrusted guest and trusted host. The Firecracker team used a tool called Kani²⁰ that is able to reason formally about Rust code to prove key properties of this security boundary. Again, proof here goes beyond testing and ensures that the critical properties of this boundary are held no matter what the guest attempts to do.

Proofs around the behaviors of programs are an important part of AWS's software correctness program, and so we support development on tools such as Kani, Dafny,¹⁸ and Lean,¹⁶ and the underlying tools—like SMT (satisfiability modulo theories) solvers—that power them.

The ability to use formal models and specifications—for model-checking systems at design time, for validating in-production behavior using runtime monitoring by serving as a correctness oracle, for simulating emergent systems behavior, and for building proofs of critical properties—allows AWS to amortize the engineering effort of developing these specifications over a larger amount of business and customer value.

Benefits Beyond Correctness

Finally, as discussed in the aforementioned 2015 paper, formal methods are a crucial part of safely improving the performance of cloud systems. Modeling a key commit protocol for the Aurora relational database engine in P and TLA+ allowed us to identify an opportunity to reduce the cost of distributed commits from two to 1.5 network roundtrips without sacrificing any safety properties. These kinds of stories are usual for teams that adopt formal methods, driven by at least two different dynamics.

First, the act of deeply thinking about and formally writing down distributed protocols forces a structured way of thinking that leads to deeper insights about the structure of protocols and the problem to be solved.

Second, having the ability to formally check (and, in some cases, prove) that proposed design optimizations are correct allows naturally conservative distributed-systems engineers to be bolder in their protocol design choices without increasing risk and boosting the developer velocity toward delivering reliable services.

These productivity and cost benefits are limited not only to high-level design optimizations but also to low-level code that normally gets ignored. In one example, the AWS team identified optimizations to the implementation of the RSA (Rivest-Shamir-Adleman) public-key encryption scheme on our ARM-based Graviton 2 processor, which could improve throughput by up to 94 percent.¹⁷

Using the HOL Light interactive theorem prover, the team was able to prove the correctness of these optimizations. Given the high percentage of cloud CPU cycles spent on cryptography, this type of optimization can significantly reduce infrastructure costs and aid sustainability while at the same time improving customer-visible performance.

Challenges and Opportunities for the Future

Despite significant success in scaling formal and semi-formal testing methods across AWS over the past 15 years, several challenges persist, particularly in industrial adoption of formal methods. The primary barriers for formal methods tools include their steep learning curve and the specialized domain expertise required. Additionally, many of these tools remain academic in nature and lack user-friendly interfaces.

Even well-established semi-formal approaches face adoption challenges. For example, deterministic simulation, a key distributed-systems testing technique used successfully at AWS and in projects like FoundationDB, remains unfamiliar to many experienced distributed-systems developers joining AWS. Similar gaps exist in the adoption of other proven methodologies such as fault-injection testing, property-based testing, and fuzzing. The challenge is educating distributed-systems developers about these testing methods and tools, teaching the art of rigorous thinking.

The education gap begins at the academic level, where even basic formal reasoning approaches are rarely taught, making it difficult for graduates from top institutions to adopt these tools. Although formal methods and automated reasoning are crucial for industry applications, they continue to be viewed as niche fields. We anticipate that increased industry adoption of formal methods and automated reasoning will attract more talent to this domain.

Metastability and other emergent properties of large-scale systems represent another critical research area facing similar awareness challenges. Common practices that lead to metastable system behavior, such as "retry N times on timeout," continue to be widely recommended despite their known issues. Current tools and techniques for understanding emergent system behavior are still in their early stages, making system stability modeling both expensive and complex. The ongoing research in this area holds promising potential for advancement.

Looking ahead, we believe large language models and AI assistants will significantly help address the adoption challenges of formal methods in practice. Just as AI-assisted unit testing has gained popularity, these tools are expected soon to help developers create formal models and specifications, making these advanced techniques more accessible to the broader developer community.

Conclusion

Building reliable and secure software requires a range of approaches to reason about systems correctness. Alongside industry-standard testing methods (such as unit and integration testing), AWS has adopted model checking, fuzzing, property-based testing, fault-injection testing, deterministic simulation, event-based simulation, and runtime validation of execution traces. Formal methods have been an important part of the development process—perhaps most importantly, formal specifications as test oracles that provide the correct answers for many of AWS's testing practices. Correctness testing and formal methods remain key areas of investment at AWS, accelerated by the excellent returns seen on investments in these areas already.

References

1. Amazon Web Services. 2020. Amazon S3 now delivers strong read-after-write consistency automatically for all applications; https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-s3-now-delivers-strong-read-after-write-consistency-automatically-for-all-applications. (Announces automatic strong read-after-write consistency for all S3 operations across all AWS regions.)

2. Amazon Web Services. 2021. Announcing general availability of AWS Fault Injection Simulator, a fully managed service to run controlled experiments; https://aws.amazon.com/about-aws/whats-new/2021/03/aws-announces-service-aws-fault-injection-simulator/.

3. Amazon Web Services. 2023. Announcing Amazon Aurora Limitless Database; https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-limitless-database/.

4. Bornholt, J., Joshi, R., Astrauskas, V., Cully, B., Kragl, B., Markle, S., Sauri, K., Schleit, D., Slatton, G., Tasiran, S., Van Geffen, J., Warfield, A. 2021. Using lightweight formal methods to validate a key-value storage node in Amazon S3. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 836–850; https://dl.acm.org/doi/10.1145/3477132.3483540.

5. Bronson, N., Aghayev, A., Charapko, A., Zhu, T. 2021. Metastable failures in distributed systems. In Proceedings of the Workshop on Hot Topics in Operating Systems, 221–227; https://dl.acm.org/doi/10.1145/3458336.3465286.

6. Claessen, K., Hughes, J. 2000. QuickCheck: A lightweight tool for random testing of Haskell programs. In Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming, 268–279; https://dl.acm.org/doi/10.1145/351240.351266.

7. de Moura, L., Ullrich, S. 2021. The Lean 4 theorem prover and programming language. In Automated Deduction – CADE 28 (28th International Conference on Automated Deduction, volume 12699 of Lecture Notes in Computer Science, 625–635. Springer; https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37.

8. Desai, A., Gupta, V., Jackson, E., Qadeer, S., Rajamani, S., Zufferey, D. 2013. P: safe asynchronous event-driven programming. ACM SIGPLAN Notices 48(6), 321–332; https://dl.acm.org/doi/10.1145/2499370.2462184.

9. Fioraldi, A., Maier, D., Eißfeldt, H., Heuse, M. 2020. AFL++: Combining incremental steps of fuzzing research. In 14th Usenix Workshop on Offensive Technologies; https://www.usenix.org/conference/woot20/presentation/fioraldi.

10. Harrison, J. 2009. HOL Light: an overview. In Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics, Munich, Germany, ed., S. Berghofer, T. Nipkow, C. Urban, and M. Wenzel, 60–66, volume 5674 of Lecture Notes in Computer Science. Springer-Verlag; https://link.springer.com/chapter/10.1007/978-3-642-03359-9_4.

11. Havelund, K., Rosu, G. 2019. Runtime verification – 17 years later. In Runtime Verification, 18th International Conference, RV 2018, Limassol, Cyprus, ed., C. Colombo and M. Leucker, 3–17, volume 11237 of Lecture Notes in Computer Science. Springer; https://link.springer.com/book/10.1007/978-3-030-03769-7.

12. Hicks, M. 2023. How we built Cedar with automated reasoning and differential testing. Amazon Science; https://www.amazon.science/blog/how-we-built-cedar-with-automated-reasoning-and-differential-testing.

13. Kingsbury, K., Alvaro, P. 2020. Elle: inferring isolation anomalies from experimental observations. Proceedings of the VLDB Endowment 14(3), 268–280; https://dl.acm.org/doi/10.14778/3430915.3430918.

14. Lamport, L. 2002. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley Professional.

15. Lamport, L. 2015. Who builds a house without drawing blueprints? Communications of the ACM 58(4), 38–41; https://dl.acm.org/doi/10.1145/2736348.

16. Lean Prover Community. 2024. Lean 4. GitHub; https://github.com/leanprover/lean4.

17. Lee, J., Becker, H., Harrison, J. 2024. Formal verification makes RSA faster—and faster to deploy. Amazon Science; https://www.amazon.science/blog/formal-verification-makes-rsa-faster-and-faster-to-deploy.

18. Leino, K. R. M. 2010. Dafny: an automatic program verifier for functional correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning, 348–370. Springer-Verlag; https://dl.acm.org/doi/10.5555/1939141.1939161.

19. MacIver, D. R., Hatfield-Dodds, Z., et al. 2019. Hypothesis: a new approach to property-based testing. Journal of Open Source Software 4(43), 1891; https://joss.theoj.org/papers/10.21105/joss.01891.pdf.

20. Monteiro, F., Roy, P. 2023. Using Kani to validate security boundaries in AWS Firecracker. Technical report, Amazon Web Services; https://model-checking.github.io/kani-verifier-blog/2023/08/31/using-kani-to-validate-security-boundaries-in-aws-firecracker.html.

21. Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M. 2015. How Amazon Web Services uses formal methods. Communications of the ACM 58(4), 66–73. https://dl.acm.org/doi/10.1145/2699417.

22. P Team. 2024. P: a programming language for formal specification of distributed systems. https://github.com/p-org/P. (Used extensively in AWS and Microsoft for formal verification of distributed systems.)

23. Rigger, M., Su, Z. 2020. Testing database engines via pivoted query synthesis. In 14th Usenix Symposium on Operating Systems Design and Implementation, Article 38, 667–682; https://dl.acm.org/doi/10.5555/3488766.3488804.

24. Rungta, N. 2024. Trillions of formally verified authorizations a day! Splash keynote; https://2024.splashcon.org/details/splash-2024-keynotes/3/Trillions-of-Formally-Verified-Authorizations-a-day-.

25. Vogels, W. 2021. Diving deep on S3 consistency. All Things Distributed; https://www.allthingsdistributed.com/2021/04/s3-strong-consistency.html. (Blog post describing the implementation of strong consistency in Amazon S3.)

Marc Brooker is a Distinguished Engineer at Amazon Web Services, where he focusses on AI and databases. He's interested in distributed systems, serverless, systems correctness, and formal methods. Marc holds a Ph.D. in Electrical Engineering from the University of Cape Town.

Ankush Desai is a Principal Applied Scientist at Amazon Web Services, where he focusses on building tools and techniques that help developers deliver distributed services with high assurance of correctness. He's interested in improving developer productivity, formal methods, systematic testing, fuzzing, and distributed systems. Ankush holds a PhD in Computer Science from the University of California, Berkeley.

Originally published in Queue vol. 22, no. 6—
Comment on this article in the ACM Digital Library