November/December issue of acmqueue The November/December issue of acmqueue is out now

acmqueue is free for ACM professional members. Non-members can purchase an annual subscription for $19.99 or a single issue for $6.99.

Download the app from iTunes or Google Play,
or view within your browser.

More information here


  Download PDF version of this article PDF

All Your Database Are Belong to Us

In the big open world of the cloud, highly available distributed objects will rule.

Erik Meijer, Microsoft

In the database world, the raw physical data model is at the center of the universe, and queries freely assume intimate details of the data representation (indexes, statistics, metadata). This closed-world assumption and the resulting lack of abstraction have the pleasant effect of allowing the data to outlive the application. On the other hand, this makes it hard to evolve the underlying model independently from the queries over the model.

As the move to the cloud puts pressure on the closed-world assumption of the database, exposing naked data and relying on declarative magic becomes a liability rather than an asset. In the cloud, the roles are reversed, and objects should hide their private data representation, exposing it only via well-defined behavioral interfaces.

The programming-language world has always embraced such an open-world assumption by putting behavior at the center of the universe and imperatively building reliable applications from unreliable parts. Object databases are back with a vengeance. To help developers transition to the cloud, the existing class libraries and tool infrastructure need to evolve to run as highly available services exposed using regular object-oriented programming language interfaces that reflect the relevant operational details.


Database developers (let’s call them modelers) and application developers (let’s call them programmers) have dual views of the world. If you are a modeler, then everything revolves around data—and data about that data (metadata). The world of database models is noun-based, talking about Customers, Orders, LineItems, etc. Once modelers have designed the data model correctly, they consider their job done.

In the realm of modelers, there is no notion of data abstraction that separates abstract properties of the model from the concrete details of the fully normalized realization in terms of tables with PK/FK (primary-key/foreign-key) relationships.

  (ID int PRIMARY KEY, …)
  (ID int PRIMARY KEY, CID int REFERENCES Customers(ID), …)
  (ID int PRIMARY KEY, OID int REFERENCES Orders(ID), …)

Normalized table declarations

The outside-in aspect of PK/FK relationships actually amplifies the need to expose all the details of the underlying rows and tables. For example, you cannot pick up one specific customer row from the Customers table and ask it what its list of orders is. Instead you must have access to and understand the schemas of both the Customers and Orders tables to perform a join across all the order rows in the Orders table and all the rows in the Customers table to check if the foreign key in the order matches the primary key of the customer, and then filter by your target customer.

Modelers consider this lack of data hiding an advantage since it allows for ad-hoc queries on the data that may not have been thought of before, such as joining all customers with all products whose price matches their age. Imposing no data abstraction arguably has the advantage that the data can outlive the application.

(Databases do support a limited type of data abstraction in the form of views, stored procedures, and table-valued functions. These are more like .NET extension methods, however, in that they operate on the public data and do not enforce data hiding. Moreover, this complicates the conceptual model, because these additional concepts are not rooted in the underlying mathematical foundations of relational algebra.)

This noun-centric view of the world carries over to how modelers look at computation. For them, computation recursively is yet another model, a plan that can be inspected, transformed and optimized, and ultimately interpreted by a query engine. Often a query execution engine uses constant space—that is, it interprets a fixed static plan without needing a call stack or using term rewriting to execute. Even the context of a computation is considered to be data via the system catalog that maintains metadata about other data such as the tables in a database and their properties.

The ability to treat execution plans as data to inspect and optimize before execution is extremely powerful, but naturally leads to a first-order, non-recursive computational model with a limited set of operators. Modelers also love to draw pictures of their plans, which is a great debugging tool, but gets hard when dealing with higher-order constructs (such as first-class functions or nested structures), complex control-flow, or arbitrary recursion. Recursion is allowed, in the form of so-called common-table expressions that express linear recursion of the form X=B∪F(X). This corresponds to a simple repetition and can still be drawn as a picture with a loop.

Database queries are declarative. For example, a typical SQL query such as the following, which computes the total price of all products by category for a customer, semantically defines a triple nested loop that iterates over each customer, order, and line item. Modelers, however, trust that the query optimizer will find a more efficient execution plan2 that exploits indexes and other advanced techniques to make it run fast.

SELECT Customer, LineItem.Category, Sum(LineItem.Price)
FROM Customers, Orders, LineItems
WHERE Customers.ID = Orders.CID AND Orders.ID = LineItems.OID
GROUP BY LineItem.Category
Total amount of purchases per category

This requires a perfect closed world where the optimizer can reason across all tables used by the query, and where modelers can be shielded from explicitly handling latency, exceptions, or other low-level operational concerns. These assumptions break down when tables get so big that they no longer fit on a single machine, or when you try to join two tables that live in different “worlds” or administrative domains, or do a distributed transaction across a slow and unreliable network. Queries suddenly need to become partition-aware, defeating many of the advantages of a declarative query language that hides the “how” from the “what.”

The inability to naturally perform arbitrary computations, the difficulties of reasoning about the operational behavior of code, and the lack of data abstraction arguably disqualify conventional relational database technology as an operational model for cloud programming. The stunningly beautiful mathematical abstractions of Ted Codd served us well at the micro level where the database can control everything, but they fall apart when we move to the macro level of the cloud. Just as in physics where you need to trade in classical mechanics for quantum mechanics when you move to the subparticle level, in the computer science world you need to trade in nouns for verbs when you move beyond the single-machine level.

The problem with SQL databases is not the data model or the superb query processing functionality, it’s the assumption that all the data lives in the same place and meets a bunch of consistency constraints, which is hard to maintain in an open distributed world. The details that are relevant to build a working system have shifted, which means the level of abstraction has to evolve likewise. In the local context of a closed world, however, the expressiveness, efficiency, and reliability of an RDMS (relational database management system) remain hard to beat.


Concepts such as “behavior,” “imperative,” “side effects,” “arbitrary nesting,” “higher-order functions,” or “unbounded recursion” are the bread and butter for programmers, whose world revolves around computation, as opposed to data. For programmers it is all about verbs such as DeleteFiles, OnDragDrop, etc., and enforcement of strong data abstraction by strictly separating the implementation and interface of their objects.

Taking a page from Wikipedia (, “Objects can be thought of as wrapping their data within a set of functions designed to ensure that the data are used appropriately, and to assist in that use. The object’s methods will typically include checks and safeguards that are specific to the types of data the object contains. An object can also offer simple-to-use, standardized methods for performing particular operations on its data, while concealing the specifics of how those tasks are accomplished. In this way alterations can be made to the internal structure or methods of an object without requiring that the rest of the program be modified.”

Take for example the using statement in C#, which obtains a disposable resource, executes a side-effecting statement using that resource, and then disposes of the resource via the following syntax:

using (var r = e) s

A resource is any object that implements the IDisposable interface, no further questions asked. The IDisposable interface has a single parameterless Dispose() method that releases the resource:

interface IDisposable{ void Dispose(); }

Because the compiler assumes only that the resource implements the IDisposable interface but does not care about the concrete type of the resource, it can blindly desugar the using statement into the following underlying primitive sequence of an assignment and a try-catch block:

var r = e; try s finally (r as IDisposable).Dispose();

The actual implementation and type of the resource are kept strictly private and are not exposed to the using statement. This allows the same syntax to be used over a range of objects such as TextReader and IEnumerable that follow the same general pattern of create/use/delete.

The method implementations of an interface or class can use arbitrary imperative code. In fact, because each method takes an implicit this parameter that contains the method itself, methods are inherently recursive,1 and code is executed using a runtime call stack that dynamically keeps track of recursive calls as the computation unfolds into a potentially infinite call tree.

Compilers try to optimize the static-code representation or trace the runtime execution paths before optimizing the resulting straight-line code. Developers expect only constant time improvements, however, and generally do not like the structure of their computation to be rearranged dramatically and unpredictably by the optimizer, because the correctness of a program may critically rely on the order of evaluation due to side effects. Moreover, programmers demand that the debugger be able to map the optimized code back to the actual source they wrote such that they can inspect the intermediate values in the computation to resolve bugs or problems.

The ability to specify a behavioral contract between specification and implementation, along with the fact that method implementations can invoke arbitrary code, allows applications to outlive data. This is especially important when leveraging disruptive technological advances (such as the cloud) to improve the implementation of a given interface, with minimal disruption for the programmer and application.


As a concrete example of data abstraction, consider the .NET standard collections library, System.Collections.Generic, or similarly, the Java collections library and C++ STL containers. These libraries were implemented when machines were single-core and most programs were single-threaded. Hence, the logical tradeoff was not to make collections thread-safe by default. For example, in the .NET generic collections library, the class Dictionary<K,V> implements three (standard) collection interfaces:

class Dictionary<K,V> : IDictionary<K,V>,
  ICollection<KeyValuePair<K,V>>, IEnumerable<KeyValuePair<K,V>>
.NET generic dictionary class

Programmers do not really care how the dictionary class is implemented, as long as it satisfies the contract defined by its base interfaces. As a bonus, because Dictionary<K,V> implements IEnumerable<KeyValuePair<K,V>>, you can automatically use LINQ to Objects to query over dictionaries viewed as collections of key-value pairs.

(Since the LINQ to Objects implementation is defined in terms of the IEnumerable interface, it cannot use specific features of the dictionary implementation to improve efficiency, which is what modelers would do. In practice, however, the implementation of LINQ to Objects “cheats” by checking for known concrete collection types and dynamically dispatches to a type-specific implementation.)

With the advent of many-core processors, concurrency has become commonplace, and, hence, it makes sense to make collections thread-safe by default. Also, we would like to implement LINQ to Objects to maximize the parallelism inherent in the LINQ “queries” via PLINQ (Parallel LINQ). To leverage many-core machines, the .NET Framework 4.0 introduced a new implementation of collections in the System.Collections.Concurrent namespace ( And guess which interfaces the new thread-safe class ConcurrentDictionary<K,V> implements? Precisely the same as those of the old Dictionary<K,V> class:

class ConcurrentDictionary<K,V> : IDictionary<K,V>,
  ICollection<KeyValuePair<K,V>>, IEnumerable<KeyValuePair<K,V>>
.NET concurrent dictionary class

Java has seen exactly the same evolution, where HashMap<K,V> and ConcurrentHashMap<K,V> both implement the unchanged base interface Map<K,V>.

This is data abstraction in its purest form—the same interface with a new and improved implementation. In this particular situation, the new implementation for collections could partition the dictionary into independent shards that can be accessed and updated concurrently, a relevant implementation detail that therefore should be exposed explicitly to the programmer in the .NET framework via the Partitioner<T> and OrderablePartitioner<T> abstract base classes.

(Note that just as the regular LINQ to Objects implementation cheats by downcasting to the concrete implementations of IEnumerable, the PLINQ implementation applies the same cruel and unnatural act with the new concurrent collections.)


As we move from the many-core world to the massively parallel and distributed world of the cloud, the next logical step3 in the evolution of the System.Collections namespace is to add support for collections as a service by introducing new types such as DistributedDictionary<K,V> that use the familiar collection interfaces, but are implemented as highly available scalable services running on some underlying cloud fabric such as Windows Azure, Amazon EC2 (Elastic Compute Cloud), or Google AppEngine:

class DistributedDictionary<K,V> : IDictionary<K,V>,
Hypothetical distributed dictionary class

Since the constraints of a distributed system such as the cloud with regard to latency and error conditions are different from the constraints of running objects locally on a single- or many-core machine, support for asynchronous operations must be added by modifying the interfaces to use the new Task-based asynchronous pattern supported by await in the latest versions of C# and Visual Basic. For developers this is a small and nondisruptive evolution of familiar interfaces to reflect essential operational changes in the underlying runtime.

Collections as a service are immensely useful. Imagine you want to build a Web-based game that maintains a high score among millions of players of that game (every aspiring game developer dreams of being as popular as Mafia Wars or Farmville). Assume that each player maintains a list instance of System.Collections.Distributed.DistributedSortedList<int,Player>, say in the variable highScores. You can then update and compare scores programmatically by mutating and invoking the list in code as follows (using C# 5.0 asynchronous methods):

var highScores = … get object instance connected to service …
await highScores.Add(100000, me);
var top10 = await highScores.TakeAsync(10);
Computing high scores

Since the service is highly available, players can disconnect and reconnect to it at any moment. Since it is scalable, it can support millions of concurrent players. When developers dream of platform as a service, they envision a class library of services like this.

Note that we choose to expose the services in System.Collections.Distributed using regular programming language APIs instead of via a REST or another network-level interface. Just as we like to hide the state of an object behind an interface, the exact protocol that the client proxy of an object uses to communicate with the service that implements its behavior is an internal detail that should be kept private for the same reasons. In this context it is interesting to note that certain Web browsers apply the same principles to accessing Web pages by switching from human-readable HTTP to the more efficient SPDY protocol. Most modern services, such as Twilio and the distributed event notification service of Google’s Thialfi, prefer language-specific APIs over protocol-level access. As explained by the Google Thialfi team, “Initially, we had no client library whatsoever, opting instead to expose our protocol directly. Engineers, however, strongly prefer to develop against native-language APIs. And, a high-level API has allowed us to evolve our client-server protocol without modifying application code.”

The idea of System.Collections.Distributed is not science fiction. A concrete example of a distributed “drop-in” reimplementation of the Java standard collection classes is KeptCollections ( , which uses Apache ZooKeeper as the backing store to provide a distributed and scalable implementation of the Map<K,V> interface. The Redis project announces itself as a “data structure server” and is another realization of the idea of providing distributed collections such as tables, lists, and sets as services. Services such as MemcacheDB and Windows Azure Caching and Service Bus are other examples of problem-focused services that implement standard collection classes.

Switching back to modelers for a second, Dynamo9-based noSQL databases such as Riak appeal to the CAP theorem to trade consistency (which modelers are used to) for availability (which certain applications such as shopping carts require) as they try to create databases that can cope with the open world of the cloud. The consequence of dropping consistency is that it is now up to the application to resolve conflicts, which deeply worries many modelers. Programmers, however, are already familiar with nonconsistent and nonavailable systems. Every time you call a method on an object, it may throw an exception and leave the hidden state of the object in an undefined state. The very fact that you are able to read this article on a Web site is testament to the fact that programmers know pretty well how to build highly scalable, practical, distributed systems out of unreliable components, using neither transactions nor consistency nor availability. They design their systems to be robust with regard to inconsistency to start with.

An underexposed consequence of choosing availability over consistency is that queries have to deal with version conflicts. For example, since a key may contain multiple versions of a value, the map function in a MapReduce query must resolve the conflict (without being able to write back the choice) as it is applied to each key. Most programmers probably prefer to drop availability rather than having to solve versioning conflicts.

The research community has been studying so-called CRDTs (convergent or commutative replicated data types8)—distributed data types that satisfy algebraic properties guaranteeing eventual consistency under asynchronous replication. These data structures strike a nice balance between programmer convenience and scalability of implementation. Also interesting and relevant in this area is the work on concurrent revisions7 and older work such as reversible computations in the Time Warp operating system.4


The same analogy can be drawn for threading and concurrency as for collections. On single-core machines, programmers dealt with (simulated) concurrency using the System.Threading.Thread class. For many-core machines the model was extended to System.Threading.Tasks. In this case threads and tasks do not implement a common interface (although see Java’s Executor framework and Rx schedulers), but conceptually they are very similar. Both threads and tasks are constructed by passing an action delegate to the respective constructors

var thread = new Thread
  (delegate{ … });

var task = new Task(delegate{ … });

and then are started by calling their Start() method. The task API is much richer than the original threading API and, for example, allows comonadic composition of existing tasks into new tasks using the ContinueWith method. Yet both threads and tasks embody the data-hiding maxim of providing a coherent set of behaviors that represent an abstract notion of concurrency without revealing the underlying implementation details, such as the thread pool using a work-stealing scheduling algorithm under the covers.

The cloud does not just have a pool of threads, but a gigantic ocean of machines. We would like to expose all that abundant potential concurrency as a service that programmers can tap into. Besides simple actions, we also want to enable long-running and highly available computations in the cloud-pool that would survive fault-domain failures, as well as cluster restarts. Carl Hewitt’s Actors are the prevalent model for concurrency at cloud-scale, as evidenced by languages and libraries such as Rx, Dart, Erlang, Akka, F#, Axum, and Ciel. In contrast to more static models for distributed computation such as MapReduce or Pregel that are favored by modelers, Actors allow programmers to express fully recursive and dynamic distributed computations. Observant readers will have correctly realized that distributed Actors are a natural way to implement the data structures as a service, as well as objects with a fixed behavior.

Operationally, programmers need to deal with four basic forms of dual computational effects:

1. The built-in effect of blocking computations (in a pure language such as Haskell this would surface as type IO<T>) that synchronously return a single value.

2. Blocking or pull-based collections of type IEnumerable<T> that synchronously return multiple values.

3. Nonblocking computations of type Task<T> or Future<T> that asynchronously return a single value.

4. Streaming or push-based collections of type IObservable<T> that asynchronously return multiple values.

The various concurrency primitives such as Threads or Actors and regular method calls produce their results using some particular combination of these four basic effects (see table 1).


When developing code, programmers rely not only on programming languages and libraries, but also on various tools such as debuggers and profilers. When talking about moving existing programming idioms to the cloud, we tend to sweep a lot of details under the rug that are necessary for building enterprise-grade services, such as monitoring, tracing, distributed locking, fault detection, replication, and fail-over. Interestingly, many of these can be exposed as objects implemented by a service. Implementing the building blocks for debugging, understanding, and monitoring distributed systems is no panacea, but it is a necessary condition to make programmers successful in the new world of distributed applications.

Other aspects that become relevant when composing larger services from smaller services are colocation and efficient service resolution. Even on a single Windows machine, it is necessary to reason about thread affinity—for example, in order to update the UI, you must run on the UI thread. A practical substrate for building distributed applications must provide abstractions that make location and placement boundaries explicit—and perhaps provide local atomicity or reliability guarantees. Within the confines of a single machine, programmers often gloss over such details (just as modelers do), but in a distributed world, exposing them becomes relevant and important to acknowledge for both programmers and modelers.


Even though we have argued that models are not the right operational basis for cloud computing, let there be absolutely no mistake that we do acknowledge the enormous economic and intellectual value of (relational) database technology, and programmers are eager to use the data that modelers have curated, cleansed, normalized, and refined.

Fortunately, models can be embedded into programs by appealing to a generalization of Codd’s theory6—namely, Category Theory. As it turns out, most of the notions of “collections” or “tables” in various database implementations are actually instances of a mathematical concept, really just a kind of interface, called monads. Queries can be translated into the underlying monadic operators implemented as code.

Programmers call this LINQ5 instead of monads, and there are LINQ bindings for countless databases such as SQL, Azure Tables, MongoDB, CouchDB, Hadoop, and HBase that expose models to developers, by embedding them as push or pull collections using a common interface.


Both relational databases and distributed actors shine in the right context. To paraphrase Pat Helland10, SQL is clearly the undefeated leader for transactional processing of data on the inside. The object-oriented databases from the 1980s tried too much to be like databases and too little like objects, and they did not play to their strengths. In the small closed world of tables, declarative queries, and sophisticated optimizers2 it is hard to beat RDMSs at their own game.

In the big open world of the cloud however, highly available, inconsistency robust, Web services exposed as as standard imperative objects/actors will rule. The cloud era is one of (monadic and comonadic) computation and verbs, as opposed of to data and nouns.

To ease the transition to the cloud for programmers, we have identified the following requisites:

• Expose every data source created by modelers to programmers using monads/LINQ.

• Create a class library of distributed collections implemented as highly available and scalable services but exposed using standard programming language bindings and interfaces.

• Give programmers access to the ocean of concurrency in the cloud via comonads/actors.

• Expose tracing, monitoring, debugging, and diagnostic infrastructure as another service in the cloud.

Many of the required materials are available today. What is missing is a tasteful assembly of all these pieces into a set of elegant and functional packages that target the challenges developers face when migrating into the cloud.


I would like to thank Brian Beckman, Terry Coatta, Gavin Bierman, Joe Hoag, Brian Grunkemeyer, and Rafael Fernandez Moctezuma for improving both the style and substance of this article.


1. Abadi, M., Cardelli, L. 1997. Theory of objects. ECOOP (European Conference on Object-oriented Programming) Tutorial;

2. DeWitt, D. J. 2010. SQL query optimization: why is it so hard to get right? Keynote address at PASS Summit;

3. Gribble, S. D., Brewer, E. A., Hellerstein, J. M., Culler, D. 2000. Scalable, distributed data structures for Internet service construction. Proceedings of 4th Usenix Symposium on Operating Systems Design and Implementation;

4. Jefferson, D., et al. 1988. The status of the Time Warp operating system. Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues 1(738-744);

5. Meijer, E. 2011. The world according to LINQ. ACM Queue 9(8);

6. Meijer, E., Bierman, G. 2011. A co-relational model of data for large shared data banks. ACM Queue 9(3);

7. Microsoft Research. Concurrent revisions;

8. Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. 2011. A comprehensive study of convergent and commutative replicated data types. Inria No. RR-7506;

9. Vogels, W. 2007. Amazon’s Dynamo. All Things Distributed;

10. Helland, P. 2005. Data on the Outside Versus Data on the Inside (144-153);


ERIK MEIJER (, has been working on “democratizing the cloud” for the past 15 years. He is perhaps best known for his work on the Haskell language and his contributions to LINQ and Rx (Reactive Framework). He is a part-time professor of cloud programming at TUDelft and runs the cloud programmability team in Microsoft’s Server and Tools Business.

© 2012 ACM 1542-7730/12/0700 $10.00


Originally published in Queue vol. 10, no. 7
see this item in the ACM Digital Library



Heinrich Hartmann - Statistics for Engineers
Applying statistical techniques to operations data

Pat Helland - Immutability Changes Everything
We need it, we can afford it, and the time is now.

R. V. Guha, Dan Brickley, Steve MacBeth - Evolution of Structured Data on the Web
Big data makes common schemas even more necessary.

Rick Richardson - Disambiguating Databases
Use the database built for your access model.


(newest first)

Headinthebox | Sat, 24 May 2014 04:05:34 UTC

Ruediger, you are spot on. If you carefully read the section on distributed collections, you see that I do propose, and use in the code examples, that the APIs are asynchronous.

Ruediger Moller | Fri, 23 May 2014 10:44:05 UTC

What's completely overseen in the vision of transparent distribution of collections is the effect of latency. You won't be able to distribute synchronous interfaces effectively, as e.g. on a dictionary you'll get a complete network roundtrip latency added to your call.

What's needed on-top is a transition in programming and interface patterns from synchronous to asynchronous ('reactive') calls. And I'd say most languages and tools are poorly equipped to support asynchronous programming patterns in a convenient way.

Barfo Rama | Wed, 22 Aug 2012 22:36:55 UTC

I notice that html syntax is not allowed and will be removed. I suppose that implicitly comments on how exposing tracing and debugging to cloud interfaces will let random cyber-hoodlums iteratively redirect your credit card object containment to their Cayman Island banking monad while displaying little circles going around and around.

Ironically, the hoodlum behavior is more predictable than the objects.

David Broderick | Wed, 22 Aug 2012 13:31:44 UTC

I hate to argue against someone I admire as much as Erik, but I disagree. His argument makes sense only for 'mandatory' context and rules, and that's a minority subset.

If I'm querying for the length of something, is that 18mm or 18au? Units are mandatory context for something quantitative. But the moment the context or rules become situational (which ends up being a large amount of the time), the problem of using objects in this space means forcing context and rules which don't apply. And then you have something either useless or less effective.

The future, instead, is to model the context and rules as you do the data. Users can then simply query how much they want/need of the former.

Mark W. Farnham | Thu, 16 Aug 2012 20:53:59 UTC

"In the big open world of the cloud, highly available distributed objects will rule." Wow. What a leading statement.

For starters, you presume a world in which the cloud is highly available. For most infrastructure stacks of which I am aware, global network access is the weakest link regarding uptime, latency, and bandwidth.

Some of what you write after that is probably relevant, though the over verbose camel hump strings you posit tossing back and forth in all directions over the network to your global cloud will contribute to its failure.

As for generalizing Codd, the relational model is in fact a general model. Some folks allow particular (and so far in every case partial) implementations of the relational model to pollute understanding of the relational model. But Chris Date has already represented that argument better than I can.

Leave this field empty

Post a Comment:

© 2016 ACM, Inc. All Rights Reserved.