Over the past 30 years Michael Stonebraker has left an indelible mark on the database technology world. Stonebraker’s legacy began with Ingres, an early relational database initially developed in the 1970s at UC Berkeley, where he taught for 25 years. The Ingres technology lives on today in both the Ingres Corporation’s commercial products and the open source PostgreSQL software. A prolific entrepreneur, Stonebraker also started successful companies focused on the federated database and stream-processing markets. He was elected to the National Academy of Engineering in 1998 and currently is adjunct professor of computer science at MIT.
Interviewing Stonebraker is Margo Seltzer, one of the founders of Sleepycat Software, makers of Berkeley DB, a popular embedded database engine now owned by Oracle. Seltzer now spends most of her time teaching and doing research at Harvard, where she is full professor of computer science. She was kind enough to lend us her time and travel down the Charles River to speak with Stonebraker, her former Ph.D. advisor, at MIT’s striking Stata Center.
MARGO SELTZER It seems that your rate of starting companies has escalated in the past several years. Is this a reflection of your having more time on your hands or of something going on in the industry?
MICHAEL STONEBRAKER Well, I think it’s definitely the latter. What I see happening is that the large database vendors, whom I’ll call the elephants, are selling a one-size-fits-all, 30-year-old architecture that dates from somewhere in the late 1970s.
Way back then the technological requirements were very different; the machines and hardware architectures that we were running on were very different. Also, the only application was business data processing.
For example, there was no embedded database market to speak of. And there was no data warehouse market. Today, there are a variety of different markets with very different computing requirements, and the vendors are still selling the same one-size-fits-all architecture from 25 years ago.
There are at least half a dozen or so vertical markets in which the one-size-fits-all technology can be beaten by one to two orders of magnitude, which is enough to make it interesting for a startup. So I think the aging legacy code lines that the major elephants have are presenting a great opportunity, as are the substantial number of new markets that are becoming available.
SELTZER What new markets are more amenable to what we’ll call the small mice, as opposed to the big elephants?
STONEBRAKER There are a bunch of them. Let’s start with data warehouses. Those didn’t exist until the early 1990s. No one wants to run large, ad hoc queries against transactional production databases, as no one wants to swamp such systems.
So everyone scrapes data off of transactional systems and loads it into data warehouses, and then has their business analysts running whatever they want to run. Everyone on the planet is doing this, and data warehouses are getting positively gigantic. It’s very hard to run ad hoc queries against 20 terabytes of data and get an answer back anytime soon. The data warehouse market is one where we can get between one- and two-orders-of-magnitude performance improvements from a very different software system.
The second new market to consider is stream processing. On Wall Street everyone is doing electronic trading. A feed comes out of the wall and you run it through a workflow to normalize the symbols, clean up the data, discard the outliers, and then compute some sort of secret sauce.
An example of the secret sauce would be to compute the momentum of Oracle over the last five ticks and compare it with the momentum of IBM over the same time period. Depending on the size of the difference, you want to arbitrage in one direction or the other.
This is a fire hose of data. Volumes are going through the roof. It’s business analytics of the same sort we see in databases. You need to compute them over time windows, however, in small numbers of milliseconds. So, again, a specialized architecture can just clobber the relational elephants in this market.
I also believe the same statement can be made, believe it or not, about OLTP (online transaction processing). I’m working on a specialized engine for business data processing that I think will be about a factor of 30 faster than the elephants on the TPC-C benchmark.
Text is the fourth market. None of the big text vendors, such as Google and Yahoo, use databases; they never have. They didn’t start there, because the relational databases were too slow from the get-go. Those guys have all written their own engines.
It’s the same case in scientific and intelligence databases. Most of these clients have large arrays, so array data is much more popular than tabular data. If you have array data and use special-purpose technology that knows about arrays, you can clobber a system in which tables are used to simulate arrays.
SELTZER If I rewind history 20 years, you could imagine somebody else sitting in this room, saying, “Today people are building object-oriented applications, and relational databases aren’t really any good for objects. We can get a couple of orders-of-magnitude performance improvement if we build a data model around objects instead of around relations.”
If we fast-forward 20 years, we know what happened to the object-oriented database guys. Why are these domains different?
STONEBRAKER That’s a great question: Why did OO (object-oriented) databases fail? In my opinion the problem with the OO guys is that they were fundamentally architecting a system that was oriented toward the needs of mechanical and electronic CAD. The trouble is, the CAD market didn’t salute to their systems. They were very unsuccessful in selling to the engineering CAD marketplace.
The trouble was that the CAD guys had existing systems that would swizzle disk data into main memory, where they would edit stuff with very substantial editing systems. Then they would reverse swizzle to put it back. If you were to go with object-oriented databases, the only thing you would save would be this swizzling code in both directions. There wasn’t enough pain for them to think about switching to something else.
The OODB guys weren’t faster than the CAD market’s proprietary home-brewed systems. The CAD guys already had mountains of proprietary code to do all this editing. The OODB guys just didn’t solve a big enough piece of their problem, and they weren’t faster, so they were never very successful in the CAD marketplace.
They failed because the primary market they were going after didn’t want them. I don’t think that is true of the other markets I’ve talked about.
SELTZER Let me push on that point a little bit. The Wall Street guys have piles and piles of software that they’ve built in-house to do exactly what you’re describing. What’s the compelling reason for them to switch, when the CAD guys didn’t think it was worthwhile?
STONEBRAKER There are two very simple answers. Answer number one is that feed volumes are going through the roof, and they’re tending to break their legacy infrastructures. That gives them a compelling reason to try something new.
The second reason is that electronic trading has the characteristic that the “secret sauce” works for a while—and then it stops working, so you have to keep changing stuff. The current Wall Street folks are dying because of rickety infrastructure and an inability to change their hardcoded interfaces quickly to meet business needs.
One of the pilot projects that StreamBase [founded by Stonebraker in 2003] did was with a large multinational investment bank with bond-trading desks in Tokyo, New York, London, Paris, and a few other places. Each of these bond desks was using home-brewed software, written locally. What happens is that all of the bond desks reprice bonds on the fly. For example, a typical algorithm would be: “If two-year treasuries tick up by five basis points, then reprice five-year General Motors corporate bonds by three basis points.” They have these built-in rules. So all of the bond desks are adjusting their prices and publishing them electronically. The internal traders inside this particular institution watch the same feeds that the bond guys are watching. If they can reach in and grab the bond that’s about to be repriced, before the bond guys manage to reprice it, then, of course, they win.
It’s basically a latency arms race. If your infrastructure was built with one-second latency, it’s just impossible to continue, because if the people arbitraging against you have less latency than you do, you lose. A lot of the legacy infrastructures weren’t built for sub-millisecond latency, which is what everyone is moving toward.
SELTZER Many people would argue that we solved the performance problem; processors are fast enough. You’re saying, “No, there really still is a performance problem and a latency problem.” The hardware guys are giving us processors with multiple cores, so they’re increasing parallelism, but they’re actually slowing down the single-threaded instruction execution rate. How does that interact with what you’re seeing in the stream-processing world?
STONEBRAKER I can explain what’s happening with a very simple example. Until recently, everyone was using composite feeds from companies such as Reuters and Bloomberg. These feeds, however, have latency, measured in hundreds of milliseconds, from when the tick actually happens until you get it from one of the composite-feed vendors.
Direct feeds from the exchanges are much faster. Composite feeds have too much latency for the current requirements of electronic trading, so people are getting rid of them in favor of direct feeds.
They are also starting to collocate computers next to the exchanges, again, just to knock down latency. Anything you can do to reduce latency is viewed as a competitive advantage.
Let’s say you have an architecture where you process the data from the wire and then use your favorite messaging middleware to send it to the next machine, where you clean the data. People just line up software architectures with a bunch of steps, often on separate machines, and often on separate processes. And they just get clobbered by latency.
SELTZER So, it’s not the latency of the instruction execution; it’s the latency of the architecture?
SELTZER That argues that the software architectures we’re building now are wrong.
STONEBRAKER Well, as the founder of Sleepycat, you can readily relate to the following characteristic. If I want to be able to read and write a data element in less than a millisecond, there is no possible way that I can do that from an application program to any one of the elephant databases, because you have to do a process switch, a message to get into their systems. You’ve got to have an embedded database, or you lose.
In the stream processing market, the only kinds of databases that make any sense are ones that are embedded. With all the other types, the latency is just too high.
SELTZER You’re preaching to the choir on that one. But let’s talk about that side of the world, where the elephants may be elephants, but they’re not standing still. Can you really compete with the elephants in the long term? Are the elephants simply going to get smart and say, “OK, our big engine doesn’t do this; so we’ll build a little engine that does.” Right? They’ve got lots of programmers.
STONEBRAKER I think of things in a much more holistic fashion. At least in the database world, the large vendors move quite slowly. So it seems the way technology transfer happens is that the elephants just don’t do new ideas. They wait for startups to prove that they work. The good ideas go into startups first. Then the elephants pick and choose from them.
SELTZER So the startups are necessary for innovation, because the elephants can’t innovate—is that really the answer?
STONEBRAKER I think so.
SELTZER Let’s draw a distinction between emerging technology and disruptive technology. Emerging technology is anything that’s new and may be different from the old stuff. Disruptive technology is an emerging technology that ultimately replaces the old technology. My question is whether these new database verticals that you’ve identified are emerging or disruptive?
STONEBRAKER Well, the elephants never had the text market, so that is simply somebody else’s stuff.
Right now the elephants own the warehouse market, but they’re selling the wrong technology, and it’s not obvious how to morph from old to new. I think that will be very disruptive.
Stream processing is largely a new application. That’s simply a green field that didn’t exist 20 years ago, and now it does.
And I think if I’m successful in building an OLTP engine that’s faster by a factor of 30, that would be very disruptive.
SELTZER Let’s talk about how that disruption can occur, given that some people think that nobody actually buys databases anymore; people just buy applications. In order to truly disrupt, you’ve got to win the applications. How does a tiny startup do that?
STONEBRAKER It’s clearest in the data warehouse space, where it turns out that Teradata is doing very well. There’s a startup in Framingham, called Netezza, that’s doing very well, too. It’s selling proprietary hardware, which no one on the planet wants from the get-go, but it’s very successful. Why would anybody buy lock-ins and proprietary hardware? The answer is, you have to be in considerable pain.
In the data warehouse market, people are in tremendous pain. There are several ways to talk about this pain. One way is ad hoc queries on data warehouses. The complexity of queries tends to go up at about the square of the database size. So, if you have a small warehouse, you’re perfectly okay on Wintel and SQL Server.
But then, if you run out of gas on SQL Server, which doesn’t scale anymore, you’re facing a discontinuous forklift upgrade to something like, say, Sun Solaris and Oracle. That’s different hardware, a different database, and a different operating system. In short, a forklift upgrade—a horrible transition to manage.
If you’re staring at this wall, and the solution is a forklift upgrade, then you’re in real pain.
Similarly, Oracle has scalability problems that limit its ability to scale in the multi-terabyte range. What usually happens is that people who have a terabyte-size warehouse that is growing are looking at the same kind of wall, and they are forced to go to something like Netezza or Teradata.
If you’re looking at any one of these walls, you’re faced with great pain in moving to the other side. And if you’re in this kind of pain, it means you’re willing to take a gander at new technology.
SELTZER I’m going to argue that you just, in fact, agreed with the point of my question, which is that people don’t buy databases, they buy applications. The application that you just described is data warehousing. Each customer may run different queries on the warehouse, but the warehouse is still an application.
If you make that transition into the OLTP market, now suddenly OLTP is really a platform, and there are zillions of applications that run on top of it. How does a little guy disrupt the big technology?
STONEBRAKER An interesting way to answer that question is by looking at Tandem. It made a lot of hay by being a serious player in the OLTP market; the New York Stock Exchange runs Tandem. But Tandem didn’t start out in OLTP; it started in the machine tool market. The NYSE is not about to trust its data to a 20-person startup.
You have to sneak into the OLTP market some other way, because the people who do serious OLTP are very cautious—they wear both a belt and suspenders. They’re very risk-averse, and they’re not going to trust a startup, no matter what.
If you started a company, it would behoove you to get two or three huge application elephants to be backers who would agree to go through the pain to give you legitimacy. For example, Dale Skeen’s company, Vitria, in the beginning, had FedEx as its premier account. You need a pathfinder application.
Another alternative is if you’re in the warehouse market and you’re successful because there’s so much pain there, then you move into the mixed market, which is partly transactions and partly warehouses. Once you’re successful there, you just attempt to eat your way into the OLTP market.
SELTZER The classic disruptive technology approach.
STONEBRAKER All startups with disruptive technology have this problem. How do you get legitimacy in the enterprise software space, where stuff really has to work?
One of the things I find fascinating is that we’ve been writing software for 30 years and the tools we have to create reliable software are not significantly dissimilar from what we had a long time ago. Our ability to write reliable software is hardly any better now than it was then. That’s one of my pet peeves.
SELTZER Does that mean you’re going to become a languages guy or a tools guy?
STONEBRAKER I wish I knew something about that.
SELTZER That hasn’t stopped others, before.
STONEBRAKER If you look at how you talk to databases right now, you use ODBC and JDBC, embedded in your favorite language. Those are the worst interfaces on the planet. I mean, they are so ugly, you wouldn’t wish them on your worst enemy.
C++ and C# are really big, hard languages with all kinds of stuff in them. I’m a huge fan of little languages, such as PHP and Python.
Look at a language such as Ruby on Rails. It has been extended to have database capabilities built into the language. You don’t make a call out to SQL; you say, “for E in employee do” and language constructs and variables are used for database access. It makes for a much easier programming job.
There are some interesting language ideas that can be exploited. If I knew anything about programming languages, I probably would attempt to do something.
SELTZER Now I’m really going to hold your feet to the fire. You were around not only at the birth of the relational stuff, but you were one of the movers and shakers that made it happen. Are you going to be one of the movers and shakers who helps lead to its demise, as well?
STONEBRAKER Let’s look at Ruby on Rails again. It does not look like SQL. If you do clean extensions of interesting languages, those aren’t SQL and they look nothing like SQL. So I think SQL could well go away.
More generally, Ruby on Rails implements an entity-relationship model and then basically compiles it down into ODBC. It papers over ODBC with a clean entity-relationship language embedding.
So you say, “Well, if that’s true, is the relational model going to make it?” In semi-structured data, it’s already obvious that it’s not. In data warehouses, 100 percent of the data warehouses I’ve seen are snowflake schemas, which are better modeled as entity relationships rather than in a relational model.
If you get a bunch of engines for a bunch of different vertical markets, both the programming language interface and the data model can be thrown up in the air. We aren’t in 1970. It’s 37 years later, and we should rethink what we’re trying to accomplish and what are the right paradigms to do it.
SELTZER One of the big arguments, if I recall correctly, was that you could prove things about the relational model. You could make strong mathematical statements. Is that important in building systems or in designing and developing this kind of database software?
STONEBRAKER If you look at what Ted Codd originally did with the relational model, and you compare it with SQL, you can prove almost nothing about SQL. In fact, there’s a terrific paper by Chris Date (A Critique of the SQL Database Language, ACM SIGMOD Record, 1984), that basically spent page after page, in area after area, explaining why SQL has terrible semantics. I think we’ve drifted far away from Ted Codd’s original clean ideas.
SELTZER Have we drifted sufficiently far away from our roots that the roots no longer matter?
STONEBRAKER I think that’s right, and I think with good reason: because Ted Codd’s original idea was to clean up IBM’s IMS (Information Management System) and business data processing. Now you want semi-structured data and data warehousing, and the problem is just vast, compared with what he was talking about 37 years ago. We’ve taken what started out as a simple standard and grown it into a huge thing, with layer upon layer of junk.
SELTZER Which no one understands.
STONEBRAKER Therefore, what the community does is “add only,” which is why we just get more and more stuff. You don’t create a skyscraper by growing it one floor at a time, year by year by year, by committee.
SELTZER I’ve always liked the attitude that we should start hiring programmers to remove lines of code, instead of hiring them only to produce lines of code.
I have one last question to ask: Now that you’ve done startups on both coasts, can you say there is a difference?
STONEBRAKER Having seen programmers, students, and technologists on both coasts, I have found that there are more of them on the west coast, but there sure are smart people everywhere.
In terms of the venture capital community, I think the east coast VCs are more conservative. You know, there are more of them who wear bowties.
I don’t detect any difference in the intellectual climate. I think MIT has some of the smartest people on the planet. So does Stanford. So does Berkeley.
SELTZER There’s another school up the river, Mike, that you’re missing.
STONEBRAKER I applaud your efforts to improve computer science at Harvard, and I wish Harvard would get deadly serious about computer science because there’s a tremendous upside that you can realize over time.
SELTZER Well, come meet our students!
Originally published in Queue vol. 5, no. 4—
see this item in the ACM Digital Library
Mark Cavage, David Pacheco - Bringing Arbitrary Compute to Authoritative Data
Many disparate use cases can be satisfied with a single storage system.
Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, David G. Andersen - Don't Settle for Eventual Consistency
Stronger properties for low-latency geo-replicated storage
Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, Andy Hopper - A Primer on Provenance
Better understanding of data requires tracking its history and context.
Wojciech Golab, Muntasir R. Rahman, Alvin AuYoung, Kimberly Keeton, Xiaozhou (Steve) Li - Eventually Consistent: Not What You Were Expecting?
Methods of quantifying consistency (or lack thereof) in eventually consistent storage systems