Storage systems continue to lay the foundation for modern Internet services such as Web search,
Despite their simple interface and data model, distributed
As it turns out, choosing which subset of replicas to contact for a storage operation profoundly impacts the behavior of distributed storage systems. For strong consistency a client may read and write a majority of replicas. Since majorities overlap, this ensures that each read "sees" the latest write. In contrast, eventually consistent systems may read and write
Although many applications benefit from strong consistency,
This article looks at methods of quantifying consistency (or lack thereof) in eventually consistent storage systems. These methods are necessary for meaningful comparisons among different system configurations and workloads. First, the article defines eventual consistency more precisely and relates it to other notions of weak consistency. It then drills down into metrics, focusing on staleness, and surveys different techniques for predicting and measuring staleness. Finally, the relative merits of these techniques are evaluated, and any remaining open questions are identified.
Eventual consistency can be defined as either a property of the underlying storage system, or a behavior observed by a client application. For example, Doug Terry et al. give the following definition of eventual consistency in the context of the Bayou storage system: "All replicas eventually receive all writes (assuming sufficient network connectivity and reasonable reconciliation schedules), and any two replicas that have received the same set of writes have identical databases."26 This informal definition addresses both the propagation of writes to replicas (i.e., the eventual) and convergence onto a
This definition, while simple and elegant, does not say precisely what clients observe. Instead, it captures a range of behaviors encompassing both strongly consistent relational databases, and weakly consistent systems that may return stale or
Werner Vogels describes the consistency observed by clients in a concrete way: "The storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value."28 Vogels's definition captures in an intuitive way the convergence of replicas to the last updated value, but only in the special case where updates are suspended while clients read an
Since weak consistency is difficult to reason about, application developers often seek stronger properties such as a consistent prefix, monotonic reads, "read my writes," or causal consistency.25 Indeed, some systems support such
Abstract definitions of eventual consistency leave open a number of questions regarding system behavior in the presence of concurrent accesses, as well as in a
An emerging approach for describing these behaviors is to relate them to a strict form of consistency called linearizability.18 Informally speaking, this property states that the storage system behaves as though it executes operations one at a time, in some serial order, despite actually executing some operations in parallel. As will be explained later, this serial order is further constrained in a manner that forbids stale reads. As a result, linearizability is a suitable gold standard against which eventually consistent systems can be judged. In other words, the observed behavior can be described in terms of deviations from this standard, which in this context can be regarded as consistency violations. For example, one can think of Amazon's Dynamo as violating linearizability when reads return stale values, even though the system promises only eventual consistency.
Whereas linearizability is ubiquitous in centralized
The side effect of weakening consistency is increased staleness, which so far has been discussed informally. More precisely, a value is considered fresh from the moment it is written into a storage system until it is overwritten, and stale thereafter. Thus, staleness describes the age of a value returned by a read relative to the last updated value, and hence quantifies how badly a system's behavior deviates from the gold standard. Two interpretations of this concept have been discussed in the literature,2,15 arising from different ways to define "age":
• Version-based staleness defines age by counting versions of an object (e.g., a read returns the kth- latest version).
• Time-based staleness defines age in terms of
The concept of staleness, although intuitive, is fraught with technical subtleties. In the simple case where operations are executed one at a time, there is a natural order in which clients expect these operations to take effect, and so stale reads can be identified easily. When operations are executed in parallel, however, their order can be very difficult to determine from the client's perspective.
To make sense of eventual consistency, we turn to relaxed consistency
Consider the trace of operations in figure 1a, showing three operations applied to an object denoted by X. First, a write operation W(X,1) assigns 1 to object X, and then this value is updated to 2 by a second write operation, W(X,2). The third operation R(X) is a read of X and begins after both writes have ended. In this case, linearizability dictates that W(X,1) should appear to take effect before W(X,2), because W(X,1) ends before W(X,2) begins. Thus, 2 is the last updated value of X when R(X) is applied, but R(X) returns 1 instead, so the trace is not linearizable. The more complex case when operations overlap in time is addressed by the formal definition of linearizability,18 a discussion that is beyond the scope of this article.
Another method used to characterize eventual consistency is based on a combination of system modeling and prediction. Peter Bailis et al.6 present the PBS (Probabilistically Bounded Staleness) framework, in which a
The PBS model makes two simplifying assumptions. First, like Vogels's definition of eventual consistency,28 PBS does not consider workloads where writes overlap in time with reads, in which the width t of the gap between writes and reads is not well defined. Second, PBS does not model failures explicitly. Although storage node failures can be simulated using longer latencies, PBS does not account for network partitions.
Thus far, this article has addressed techniques for quantifying staleness in eventually consistent systems. This section discusses approaches that measure consistency empirically from the perspective of both the system and the client.
Measuring eventual consistency "in the wild" is as hard as defining it precisely. Concurrent operations make it difficult to identify the order in which operations take effect, and hence to classify reads as stale or not. To make matters worse, in the event of a network partition, clients on opposite sides of the divide may observe the last updated value differently, even if no new updates are made to an object after some point in time. Such an anomaly is possible because in an always- available system each partition will continue to accept writes.
Despite these challenges, a number of techniques have been devised for measuring eventual consistency, particularly data staleness. This article looks at two fundamentally different methodologies: active measurement, in which the storage system is exercised in an artificial way to determine the time lag from when a new value is written to the storage system until this value becomes visible to clients; and passive analysis, in which a trace of operations is recorded for an arbitrary workload and analyzed mathematically to obtain a measurement of staleness.
Active measurement underlies early studies of consistency in cloud storage systems. In this category of techniques, one client writes a new value to a key, and a different client then reads the same key repeatedly until the new value is returned. As in Vogels's definition of eventual consistency,28 writes do not overlap in time with reads in this scenario. The time from the write to the last read that returns the old
At a technical level, the main challenge in active measurement is to determine the difference in time between operations executed at different client
Staleness measurement in YCSB++ (Yahoo!
Active measurement can be used to discover the range of update propagation times in an eventually consistent system. For example, in experiments involving Amazon's SimpleDB,4 Wada et al. report that convergence occurred in at most one second in more than 90 percent of the runs, but took more than four seconds in a few (less than 1 percent) cases. On the other hand, active measurement does not indicate what proportion of reads in a real workload will return stale values, as this quantity depends on how the data objects are accessed. For example, the proportion of stale reads would be expected to vary with the rate at which operations are applied on a given object— close to zero when operations are applied infrequently and the gaps between them exceed the convergence time, and larger when reads follow writes more closely. Active measurement does not separate these two cases, as it is based upon a controlled workload designed to measure convergence time.
In the earlier section on relaxed consistency properties, the discussion focused on ways of defining staleness in a precise and meaningful way and provided a hint of how a consistency metric can be derived from such definitions. It begins with a
The most immediate technical challenge in passive analysis is the collection of the trace, which records for each operation its type, start and finish times, as well as its arguments and response (e.g., a read of object X, starting at time t1 and ending at t2, returning the value 5). This trace can be obtained at clients, as shown in figure 2.
Because the trace is obtained by merging data from multiple clients or storage nodes over a finite period of time, it is prone to two types of anomalies: dangling reads, which return values that lack a corresponding write (i.e., a write that assigns the value returned by the read); and reversed operations, whereby a read appears in the trace before its corresponding write. These anomalies must be removed if they occur; otherwise, the
Dangling reads indicate missing information, such as when the first access to an object in a trace is a read and the corresponding write occurs before the start of the trace. In this case the dangling read is identified easily and can be dropped from the trace with no ill effects. Reversed operations are equally easy to detect, but harder to remedy. Clock synchronization techniques such as atomic clocks and GPS10 can eliminate reversed operations altogether. Even ordinary NTP (Network Time Protocol) is sufficient if the synchronization of clocks is tight enough. Alternatively, one can also estimate clock skew directly from reversed operations and adjust the trace accordingly.
Given a trace free from dangling reads and reversed operations, the next challenge is to compute staleness metrics. Efficient algorithms are critical in this context because the trace may be very long, which means that both the space and computation required are high. In fact, the problem of testing whether a trace is linearizable was shown by Gibbons and Korach to be intractable.13 Testing whether a trace is
Fortunately, the intractability result breaks in the special case when every write on a given object assigns a unique value. This condition is straightforward to enforce (e.g., by embedding a unique token in each value written, such as a node ID and timestamp), which opens the door to efficient algorithms for computing staleness metrics from traces in practice. For
Passive analysis in general is not limited to staleness metrics, as illustrated by the technique of cycle analysis,5,30 which detects linearizability violations by analyzing a conflict graph representing the execution trace. The absence of cycles in such a graph indicates that the trace is linearizable, and so the number and length of such cycles can be defined as metrics for inconsistency.5,30 The relationship of these metrics to staleness is not known precisely.
Prediction, active measurement, and passive analysis are all useful ways to study eventual consistency. Although each method has specific strengths and weaknesses, it is difficult to compare them directly, as they meet different goals. Passive analysis is
The techniques discussed in this article also vary in terms of conceptual models of eventual consistency. Active measurement and PBS are based upon a simplified model, similar to Vogels's definition,28 in which writes occur before reads. In contrast, passive analysis considers a general model in which writes may overlap in time with reads and with other writes. As explained in another article by Golab et al.,16 storage systems may behave differently in these two models when clients follow the "R + W > N" rule,28 which ensures that read and write operations access overlapping subsets of replicas. The latter condition has long been considered sufficient for "strong consistency,"6,28 and in fact guarantees linearizability in the simplified model but permits linearizability violations in the general model.
Despite the somewhat high cost of analyzing a detailed trace of operations, passive analysis plays an important role in understanding eventual consistency. Whereas any perspective on consistency is ultimately affected by the state of data in replicas in a storage system, which can be thought of as the "ground truth," passive analysis is most closely tied to the actual consistency observed by client applications. Specifically, it reflects
For modern online service providers, small decreases in latency are known to have a measurable effect on user experience and, as a consequence, revenue.17 As weak consistency continues to gain traction as the means for reducing latency, the ability to reason clearly about this
Terry et al. describe a system that allows client applications to select different grades of weak consistency, such as "at most
Whereas the understanding of eventual consistency is expected to improve with additional empirical studies involving measurement, fundamental technical problems also remain to be solved. In particular, the problem of consistency verification remains an open and important challenge. Storage system designers need verification techniques to test whether their implementations correctly fulfill
One of the open problems related to verification algorithms is to determine the computational complexity of deciding
Eventual consistency is increasingly viewed as a spectrum of behaviors that can be quantified along various dimensions, rather than a binary property that a storage system either satisfies or fails to satisfy. Advances in characterizing and verifying these behaviors will enable service providers to offer an increasingly rich set of service levels of differentiated performance, ultimately improving the end user's experience.
We are grateful to Jay J. Wylie, Indranil Gupta, and Doug Terry for their helpful feedback.
1. Abadi, D. 2012. Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer 45(2):
2. Aiyer, A., Alvisi, L., Bazzi, R. A. 2005. On the availability of
3. Amazon Web Services. DynamoDB; http://aws.amazon.com/dynamodb/.
4. Amazon Web Services. SimpleDB; http://aws.amazon.com/simpledb/.
5. Anderson, E., Li, X., Shah, M. A., Tucek, J., Wylie, J. J. 2010. What consistency does your key- value store actually provide? In Proceedings of the 6th Usenix Workshop on Hot Topics in System Dependability.
6. Bailis, P., Venkataraman, S., Franklin, M. J., Hellerstein, J. M., Stoica, I. 2012. Probabilistically bounded staleness for practical partial quorums. In Proceedings of the VLDB (Very Large Data Base) Endowment 5(8):
7. Bermbach, D., Zhao, L., Sakr, S. 2013. Towards comprehensive measurement of consistency guarantees for
8. Bermbach, D., Tai, S. 2011. Eventual consistency: how soon is eventual? An evaluation of Amazon S3's consistency behavior. In Proceedings of the 6th Workshop on Middleware for Service-oriented Computing.
9. Brewer, E. A. 2000. Towards robust distributed systems (invited talk). In Proceedings of the 19th ACM
10. Corbett, J. C., et al. 2012. Spanner: Google's globally distributed database. In Proceedings of the 10th Usenix Conference on Operating Systems Design and Implementation:
11. Davidson, A., Rubinstein, A., Todi, A., Bailis, P., Venkataraman, S. 2012. Adaptive hybrid quorums in practical settings;
12. Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W. 2007. Dynamo: Amazon's highly available
13. Gibbons, P. B., Korach, E. 1997. Testing shared memories. SIAM (Society for Industrial and Applied Mathematics) Journal on Computing 26(4):
14. Golab, W., Hurwitz, J., Li, X. 2013. On the
15. Golab, W., Li, X., Shah, M. A. 2011. Analyzing consistency properties for fun and profit. In Proceedings of the 30th Annual ACM
16. Golab, W., Rahman, M. R., AuYoung, A., Keeton, K., Gupta, I. 2013.
17. Hamilton, J. 2009. The cost of latency; http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx.
18. Herlihy, M., Wing, J. M. 1990. Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS) 12(3):
19. Hunt, P., Konar, M., Junqueira, F. P., Reed, B. 2010. ZooKeeper:
20. Lakshman, A., Malik, P. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44(2):
21. Lamport, L. 1986. On interprocess communication. Part I: basic formalism; and Part II: algorithms. Distributed Computing 1(2):
22. Lloyd, W., Freedman, M. J., Kaminsky, M., Andersen, D. G. 2011. Don't settle for eventual: scalable causal consistency for
23. Patil, S., Polte, M., Ren, K., Tantisiriroj, W., Xiao, L., López, J., Gibson, G., Fuchs, A., Rinaldi, B. 2011. YCSB++: benchmarking and performance debugging advanced features in scalable table stores. In Proceedings of the 2nd ACM Symposium on Cloud Computing:
24. Rahman, M. R., Golab, W., AuYoung, A., Keeton, K., Wylie, J. J. 2012. Toward a principled framework for benchmarking consistency. In Proceedings of the 8th Usenix Workshop on Hot Topics in System Dependability.
25. Terry, D. 2013. Replicated data consistency explained through baseball. Communications of the ACM.
26. Terry, D. B., Petersen, K., Spreitzer, M. J., Theimer, M. M. 1998. The case for
27. Terry, D. B., Prabhakaran, V., Kotla, R., Balakrishnan, M., Aguilera, M. K.,
28. Vogels, W. 2008. Eventually consistent; http://queue.acm.org/detail.cfm?id=1466448. ACM Queue 6(6):
29. Wada, H., Fekete, A., Zhao, L., Lee, K., Liu, A. 2011. Data consistency properties and the trade- offs in commercial cloud storage: the consumers' perspective. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research:
30. Zellag, K., Kemme, B. 2012. How consistent is your cloud application? In Proceedings of the 3rd ACM Symposium on Cloud Computing.
Wojciech Golab is an assistant professor in electrical and computer engineering at the University of Waterloo. His current research focuses on algorithmic problems in distributed computing with applications to storage, transaction processing, and
Muntasir Raihan Rahman is a computer science Ph.D student in the Distributed Protocols Research Group at the University of Illinois at
Alvin Auyoung is a researcher in the systems group at
Kimberly Keeton is a principal researcher at
Xiaozhou (Steve) Li is a software engineer at Google. In the past, he worked at
LOVE IT, HATE IT? LET US KNOW email@example.com
© 2014 ACM
Originally published in Queue vol. 12, no. 1—
see this item in the ACM Digital Library
Rick Richardson - Disambiguating Databases
Use the database built for your access model.
Mark Cavage, David Pacheco - Bringing Arbitrary Compute to Authoritative Data
Many disparate use cases can be satisfied with a single storage system.
Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen - Don't Settle for Eventual Consistency
Stronger properties for low-latency geo-replicated storage
Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, Andy Hopper - A Primer on Provenance
Better understanding of data requires tracking its history and context.