acmqueue is now an app!

Download from iTunes or Google Play, or view within your browser.

ACM professional members can sign in with their ACM web account to read acmqueue for free.

Non-members can subscribe to acmqueue for $19.99 per year.

More information here

Quality Assurance

  Download PDF version of this article

The Antifragile Organization

Embracing Failure to Improve Resilience and Maximize Availability

Ariel Tseitlin

Failure is inevitable. Disks fail. Software bugs lie dormant waiting for just the right conditions to bite. People make mistakes. Data centers are built on farms of unreliable commodity hardware. If you're running in a cloud environment, then many of these factors are outside of your control. To compound the problem, failure is not predictable and doesn't occur with uniform probability and frequency. The lack of a uniform frequency increases uncertainty and risk in the system. In the face of such inevitable and unpredictable failure, how can you build a reliable service that provides the high level of availability your users can depend on?

A naive approach could attempt to prove the correctness of a system through rigorous analysis. It could model all different types of failures and deduce the proper workings of the system through a simulation or another theoretical framework that emulates or analyzes the real operating environment. Unfortunately, the state of the art of static analysis and testing in the industry hasn't reached those capabilities.4

A different approach could attempt to create exhaustive test suites to simulate all failure modes in a separate test environment. The goal of each test suite would be to maintain the proper functioning of each component, as well as the entire system when individual components fail. Most software systems use this approach in one form or another, with a combination of unit and integration tests. More advanced usage includes measuring the coverage surface of tests to indicate completeness.

While this approach does improve the quality of the system and can prevent a large class of failures, it is insufficient to maintain resilience in a large-scale distributed system. A distributed system must address the challenges posed by data and information flow. The complexity of designing and executing tests that properly capture the behavior of the target system is greater than that of building the system itself. Layer on top of that the attribute of large scale, and it becomes unfeasible, with current means, to achieve this in practice while maintaining a high velocity of innovation and feature delivery.

Yet another approach, advocated in this article, is to induce failures in the system to empirically demonstrate resilience and validate intended behavior. Given that the system was designed with resilience to failures, inducing those failures—within original design parameters—validates that the system behaves as expected. Because this approach uses the actual live system, any resilience gaps that emerge are identified and caught quickly as the system evolves and changes. In the second approach just described, many complex issues aren't caught in the test environment and manifest themselves in unique and infrequent ways only in the live environment. This, in turn, increases the likelihood of latent bugs remaining undiscovered and accumulating, only to cause larger problems when the right failure mode occurs. With failure induction, the added need to model changes in the data, information flow, and deployment architecture in a test environment is minimized and presents less of an opportunity to miss problems.

Before going further, let's discuss what is meant by resilience and how to increase it. Resilience is an attribute of a system that enables it to deal with failure in a way that doesn't cause the entire system to fail. It could involve minimizing the blast radius when a failure occurs or changing the user experience to work around a failing component. For example, if a movie recommendation service fails, the user can be presented with a nonpersonalized list of popular titles. A complex system is constantly undergoing varying degrees of failure. Resilience is how it can recover, or be insulated, from failure, both current and future.7

There are two ways of increasing the resilience of a system:

The first item is well covered in other literature. The remainder of this article will focus on the second.

The Simian Army

Once you have accepted the idea of inducing failure regularly, there are a few choices on how to proceed. One option is GameDays,1 a set of scheduled exercises where failure is manually introduced or simulated to mirror real-world failure, with the goal of both identifying the results and practicing the response—a fire drill of sorts. Used by the likes of Amazon and Google, GameDays are a great way to induce failure on a regular basis, validate assumptions about system behavior, and improve organizational response.

But what if you want a solution that is more scalable and automated—one that doesn't run once per quarter but rather once per week or even per day? You don't want failure to be a fire drill. You want it to be a nonevent—something that happens all the time in the background so that when a real failure occurs, it will simply blend in without any impact.

One way of achieving this is to engineer failure to occur in the live environment. This is how the idea for "monkeys" (autonomous agents really, but monkeys inspire the imagination) came to Netflix to wreak havoc and induce failure. Later the monkeys were grouped together and labeled the Simian Army.5 A description of each resilience-related monkey follows.

Chaos Monkey

The failure of a virtual instance is the most common type of failure encountered in a typical public cloud environment. It can be caused by a power outage in the hosting rack, a disk failure, or a network partition that cuts off access. Regardless of the cause, the result is the same: the instance becomes unavailable. Inducing such failures helps ensure that services don't rely on any on-instance state, instance affinity, or persistent connections.

To address this need, Netflix created its first monkey: Chaos Monkey, which randomly terminates virtual instances in a production environment—instances that are serving live customer traffic. 3

Chaos Monkey starts by looking into a service registry to find all the services that are running. In Netflix's case, this is done through a combination of Asgard6 and Edda.2 Each service can override the default Chaos Monkey configuration to change termination probability or opt out entirely. Each hour, Chaos Monkey wakes up, rolls the dice, and terminates the affected instances using AWS (Amazon Web Services) APIs.

Chaos Monkey can optionally send an e-mail message to the service owner when a termination is made, but most service owners don't enable this option because instance terminations are common enough that they don't cause any service degradation.

Chaos Gorilla

With Chaos Monkey, a system is resilient to individual instance failure, but what if an entire data center were to become unavailable? What would be the impact to users if an entire Amazon AZ (availability zone) went offline? To answer that question and to make sure such an event would have minimal customer impact, Netflix created Chaos Gorilla.

Chaos Gorilla causes an entire AZ to fail. It simulates two failure modes:

Chaos Gorilla causes massive damage and requires a sophisticated control system to rebalance load. For Netflix, that system is still being developed, and as a result, Chaos Gorilla is run manually, similar to the GameDay exercises mentioned previously. With each successive run, Chaos Gorilla becomes more aggressive in the way it executes the failures—the goal being to run it in an automated unattended way, like Chaos Monkey.

Chaos Kong

A region is made up of multiple data centers (availability zones) that are meant to be isolated from one another. A robust deployment architecture has AZ redundancy by using multiple AZs. In practice, regionwide failures do occur, which makes single-region deployments insufficient to provide resilience to regionwide failures. Once a system is deployed redundantly to multiple regions, region failure must be tested analogously to instances and availability zones. Chaos Kong will serve that purpose. Netflix is working toward the goal of taking an entire region offline with Chaos Kong.

Latency Monkey

Once Chaos Monkey is running and individual instance failure no longer has any impact, a new class of failures emerges. Dealing with instance failure is relatively easy: just terminate the bad instances and let new healthy instances take their places. Detecting when instances become unhealthy, but are still working, is harder, and having resilience to this failure mode is harder still.

Error rates could become elevated, but the service could occasionally return success. The service could reply with successful responses, but latency could increase, causing timeouts.

What Netflix needed was a way of inducing failure that simulated partially healthy instances. This was the genesis of Latency Monkey, which induces artificial delays in the RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, very large delays can be used to simulate node downtime, or even an entire service downtime, without physically bringing instances or services down. This can be particularly useful when testing the fault tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

The Remaining Army

The rest of the Simian Army, including Janitor Monkey, takes care of upkeep and other miscellaneous tasks not directly related to availability. A full reference is available at com/2011/07/netflix-simian-army.html.

Monkey Training At Netflix

While the Simian Army is a novel concept and may require a shift in perspective, it isn't as hard to implement as it initially appears. Understanding what Netflix went through is illustrative for others interested in following such a path.

Netflix is known for being bold in its rapid pursuit of innovation and high availability, but not to the point of callousness. It is careful to avoid any noticeable impact to customers from these failure- induction exercises. To minimize risk, Netflix takes the following steps when introducing a monkey:

The Importance of Observability

No discussion of resilience would be complete without highlighting the important role of monitoring. Monitoring here means the ability to observe and, optionally, signal an alarm on the external and internal states of the system and its components. In the context of failure induction and resilience, monitoring is important for two reasons:

One of the most important first questions to ask during a customer-impacting event is, "What changed?" Therefore, another key aspect of monitoring and observability is the ability to record changes to the state of the system. Whether a new code deployment, a change in runtime configuration, or a state change by an externally used service, the change must be recorded for easy retrieval later. Netflix built a system, internally called Chronos, for this purpose. Any event that changes the state of the system is recorded in Chronos and can be quickly queried to aid in causality attribution.

The Antifragile Organization

Resilience to failure is a lofty goal. It enables a system to survive and withstand failure. There's an even higher peak to strive for, however: making the system stronger and better with each failure. In Nassim Taleb's parlance, it can become antifragile—growing stronger from each successive stressor, disturbance, and failure.8

Netflix has taken the following steps to create a more antifragile system and organization:


The more frequently failure occurs, the more prepared the system and organization become to deal with it in a transparent and predictable manner. Inducing failure is the best way of ensuring both system and organizational resilience. The goal is to maximize availability, insulating users of a service from failure and delivering a consistent and available user experience. Resilience can be improved by increasing the frequency and variety of failure and evolving the system to deal better with each newfound failure, thereby increasing antifragility. Focusing on learning and fostering a blameless culture are essential organizational elements in creating proper feedback in the system.


1. Robbins, J., Krishnan, K., Allspaw, J., Limoncelli, T. 2012. Resilience engineering: learning to embrace failure. Communications of the ACM55(11): 40-47;

2. Bennett, C. 2012. Edda - Learn the stories of your cloud deployments. The Netflix Tech Blog;

3. Bennett, C., Tseitlin, A. 2012. Chaos Monkey released into the wild. The Netflix Tech Blog;

4. Chandra, T. D., Griesemer, R., Redstone, J. 2007. Paxos made live: an engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing: 398-407;

5. Izrailevsky, Y., Tseitlin, A. 2011. The Netflix simian army. The Netflix Tech Blog;

6. Sondow, J. 2012. Asgard: Web-based cloud management and deployment. The Netflix Tech Blog;

7. Strigini, L. 2009. Fault tolerance and resilience: meanings, measures and assessment. London, U.K.: Centre for Software Reliability, City University London;

8. Taleb, N. 2012. Antifragile: Things That Gain from Disorder. Random House.


Ariel Tseitlin is director of cloud solutions at Netflix, where he manages the Netflix Cloud and is responsible for cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering. He is also interested in resilience and highly available distributed systems. Prior to joining Netflix, he was most recently VP of technology and products at Sungevity and before that was the founder and CEO of CTOWorks.

© 2013 ACM 1542-7730/13/0600 $10.00


Originally published in Queue vol. 11, no. 6
see this item in the ACM Digital Library

For more articles and columns like this, check out the latest issue of acmqueue magazine



Robert V. Binder, Bruno Legeard, Anne Kramer - Model-based Testing: Where Does It Stand?
MBT has positive effects on efficiency and effectiveness, even if it only partially fulfills high expectations.

Terry Coatta, Michael Donat, Jafar Husain - Automated QA Testing at EA: Driven by Events
A discussion with Michael Donat, Jafar Husain, and Terry Coatta

James Roche - Adopting DevOps Practices in Quality Assurance
Merging the art and science of software development

Neil Mitchell - Leaking Space
Eliminating memory hogs


rich | Fri, 28 Jun 2013 16:41:32 UTC

This is a great summary of how to induce the development of a resilient system. What it doesn't do, though, is speak to what architectural principles are necessary to capitalize on such an approach. For example, in the design phase, one principle that seems important is to make sure that the system can always make "progress." Even if progress is to stop fielding requests of a certain type while remedial action takes place, it should be designed to continue in a controlled manner no matter what conditions it encounters. Often performance or expedience concerns nullify this principle, particularly when delivery deadlines are tight.

Also, the Armies are a great example of how to automate "negative QA" -- the ability to test that the system produce a predictable response to unforseen run time circumstances. One new simian that would be wonderful to meet would be the "Configuration Orangutan" -- an automated chaos monkey that misconfigures the system randomly and then runs it again non chaotic workloads. Often the human element is the most unpredictable and pernicious.

Great article. Thanks for codifying the Netflix approach so neatly.

Frank Sowa | Mon, 22 Jul 2013 14:55:57 UTC

This was a well-written summary. As a consultant for 32 years, I'd just like to add a few pragmatic steps that would better ensure use of what you've written. First, create a "production-ready" alternate critical element system redundant to the running system -- that can be fired up offline in the testing facility to deal with symbian army elements that may enter (or even be maliciously placed by hackers) in the production system. And, if financially possible also have a clean back-up architecture (in-house, or cloud-based) to switch over to if the faults are corrupting your system in a sector of your network. This two-step redundancy in design is often seen as an unnecessary cost-burden. But, as often as systems do have corruption issues, they are a low-cost benefit when trouble sets in.

Just as in a disease-control lab in healthcare, viruses and infections are resolved off-line then the immunization is applied post-fix, the "production-ready" elements provide that capability (and in a critical instance can be quickly swapped into the production running system to avoid catastrophe). The latter IT governance approach to have the second "back-up" system to switch to -- allows the system to keep running while components are removed from production to identify and take corrective action. Obviously, these are something IT professionals inherently understand and have been trained in. But, before one balks at what I've written, remember that the strategic drivers and decisions lie in the C-Level offices and are usually made by financial people who may, or may not have, a background that intrinsically understands the depth of these issues (until after the crises occur). So, my point -- you may need to explain importance of the "prepared approach" to proactively setting up the means to resolve these issues.

Bobby Lin | Wed, 21 Aug 2013 00:58:53 UTC

There is a need for help from software testers to test the system in an extreme manner to induce failures, not just based on the methods that are taught in school.

Leave this field empty

Post a Comment:

© 2015 ACM, Inc. All Rights Reserved.