From Liability to Advantage: A Conversation with John Graham-Cumming and John Ousterhout

July 14, 2008
Volume 5, issue 2

From Liability to Advantage: A Conversation with John Graham-Cumming and John Ousterhout

Software production has become a bottleneck in many development organizations.

Q: Welcome to this month's Queuecast, featuring conversation with John Graham-Cumming and John Ousterhout, cofounders of Electric Cloud, Inc. Electric Cloud produces solutions that automate, accelerate, and analyze the entire software production management process, and with that, I'll turn it over to John Graham-Cumming. John?

JOHN GRAHAM-CUMMING: Thanks very much. And we're here today to talk about something that Electric Cloud has really staked a claim to, and that is software production management. So, John, this is a new term. What is it all about?

JOHN OUSTERHOUT: Well, the way we think about it is that you can view software production, or software development, as having a frontend, and a backend, and the frontend process involves the highly-creative, very human involved activities such as design and coding and debugging and so on. And if you look over the last 20 or 30 years, there have been dozens of high-quality tools built to help people with the frontend side of software development. Then there's the backend side, things like build and test and package and deploys -- processes that ideally would have no human interaction at all. You simply push a button and they run like clockwork, and those processes have not received much attention at all over the last 20 or 30 years, and that's where we're really focusing as a company.

GRAHAM-CUMMING: So why haven't they received any attention? We've been building software for a long time. Is this software production management thing become a new problem as it appeared? What's happening?

OUSTERHOUT: Yes, that's a good question. It's definitely not a new problem. It's certainly been around as long as I've been doing software development, but it's been getting a lot worse over the last 10 or 15 years, and there are at least four factors that are contributing: software complexity, agile programming, distributed development, and compliance.

Software complexity just means that software continues to get more and more complicated. People talk about Moore's Law for integrated circuits, where the complexity of an IC goes up by a factor of four every three years. Software complexity is probably rising even faster than that.

It's not just individual packages getting larger but proliferations of variations of packages such as embedded systems vendors that have to write software that works on dozens of different chipsets. Then there's agile programming where people are trying to do a much more iterative style of development, and in order to do that, you need a backend that runs like clockwork because you need to do lots more of the build and test to support agile programming.

Then third, there's distributed development where more and more development is happening in sites that are a long ways away from each other, in many cases in fact, there's no overlap in the work days between groups of different sites. Between, say, the United States and India, or Europe and Asia, and because of this, you depend on a much higher level of automation in organizing the interactions between these different groups.

And then finally, the increasing emphasis on compliance which means you need to be able to prove certain properties of your software, and for this, you need really good reporting, and you need to keep track very carefully of software and how it is built.

GRAHAM-CUMMING: So if you think people are really facing these problems today, here comes Electric Cloud, if you just put Electric Cloud aside for a moment, what are people doing and trying to do, to solve the problems with the things that are out there, and what things are there out there they can go download and start playing with?

OUSTERHOUT: There's not a lot of stuff you can download today. There are a few open-source packages that manage pieces of the problems -- packages like Make or Ant, for doing builds, for example -- and some relatively small packages for managing build and test, things like Maven and Cruise Control.

But honestly, there's not a lot of stuff there. Mostly, people tend to build their own, to handle software productions.

GRAHAM-CUMMING: Yes. I think build managers for a long time have been used to building lots of tools and things ... packing-like files and ... scripts. That isn't good enough any more. We have to cut that.

Did managers really -- they're used to hacking things together. There was a time with Perl and Make. Why can't they continue? Why isn't that good enough any more?

OUSTERHOUT: Well, it's a combination of all of those factors I mentioned before that are just making the task of rolling your own more and more difficult. For example, if you have hundreds of developers who want to be able to invoke build and test at the push of a button, you need pools of machines available for that, and so you have to build resource management for those pools.

If your project is growing very large, you have builds that take hours and hours, you need some way of accelerating those.

And if you have distributed development, you need to automate all of these interactions between different teams.

And so just the accretion of responsibilities and functionality that has to go into this software is making it harder and harder to roll your own. So the result, people tend to end up with three problems today: performance, management, and visibility.

The first one is performance, and the issue there is things just take too long. For example, of the companies we talk to, with more than a few dozen developers, the average build time is typically two to four hours. That's a huge impact in engineering productivity. On the manageability side, it's this proliferation of homegrown scripts to manage build and test, that result in a very brittle system, that's very hard to scale or evolve as your products change, and then the third area of visibility, again coming mostly from homegrown stuff, it's very hard to see what's going on inside your build and test processes.

On the one hand, you have way too much output, often megabytes of output say from a meg. On the other hand, it's very hard to get the things you need, key metrics like what percentage of builds are succeeding or failing and are there particular tests that are often failing, so we should go in and figure out how to rewrite those tests.

So three things again: performance, manageability, and visibility.

GRAHAM-CUMMING: I know all the things you're talking about there have me nodding my head and remembering back to when we were founding Electric Cloud and talking about some of those problems.

We'd encountered other companies. Why don't we talk a little bit about Electric Cloud and I'll just give you the microphone here just to chat about what were those things that made us found Electric Cloud, and what are we doing today?

OUSTERHOUT: Well, of course, as you know, we started the company because we got tired of our own bitter experiences dealing with various software production problems, and the key experience for me was when our last company was bought by a larger company, and we had a chance to observe its development team, and all of the build problems they had, where builds were taking seven to 10 hours, and there was one period, late in the development of a major release of a new product, where more than a month went by without a single successful production build. By the time the last build got fixed, somebody else had checked in changes that broke the next one and it just went on and on.

So, the reason we decided to start the company was to see if we could build industrial strength tools, that would make it possible to solve these problems, and we started off, mostly, working in the speed area and build speed but the more experience we got, the more we became aware of these other problems, and so we've recently expanded our product line to cover the whole area of software production management.

GRAHAM-CUMMING: Yes, we were talking about the company we were in that got bought by the other one. You brought back awful memories for me at how long those deals were and all the problems we had in getting things working because it was brittle. Let's just talk a little bit about those three things you were mentioning, performance, manageability and visibility because one of the problems we had at that previous company was overall manageability builds that were incredibly long but also the number of them and how do you deal with them.

Tell us a bit about what we're doing. Let's talk a little bit about what we're doing with manageability these days.

OUSTERHOUT: Sure. So let me start with the problem, which is, typically it's a mass of homegrown built scripts that each organization builds for themselves.

We had multiple experiences between us with this, for example, we actually built our own home-brew build system at Electric Cloud, when we were building our first product, and probably invested in an engineer a year over the first two years of the company in the system, and it ended up being a really bad system, consisting of bandaids piled on top of bandaids.

The problem is that you never consider your build system your core competency, so you tend to throw something together that does not have a very good architecture, does not scale, and does not provide very good reporting facilities.

So the goal of our product Electric Commander is to fix this by providing a powerful web-based platform that makes it easy for you to implement and manage distributed processes like those for build and test.

GRAHAM-CUMMING: I think a key thing anyone listening to this will say is that sounds wonderful, but I've got an enormous amount invested in my existing infrastructure. There's no way I could rewrite it or make it work with Electric Commander. What are you requiring from a new person who uses Electric Commander?

OUSTERHOUT: Yes, that's certainly a good point, and we've tried to build Electric Commander so that you can pretty easily incorporate your existing scripts. You can think of it as bringing your existing build scripts wholesale into Electric Commander system, and then getting it running and getting some benefits and then gradually over time breaking it up into the pieces that fit more naturally with the Electric Commander structure.

One of the interesting things about these nightly build scripts, if you look at them, it seems like everybody's build scripts are different, and they are at least superficially. Every team builds their own. We talk to companies that have 30 different build systems in the company, but if you look under the covers, in fact, they tend to be built out of the same components. Everybody has the same basic infrastructure they end up building into their build and test systems.

For example, the ability to run a command on a remote machine, you know, do my Solaris build over here and my Windows build over there.

Or, send out an e-mail report after the build finishes, or reboot a machine or detect an error if a machine crashes during a build.

So, even though superficially all these build systems look different, in fact, they tend to consist mostly of the same infrastructure. It's sort of ironic that you have dozens of teams within a company and thousands of teams all over the world basically re-implementing the same infrastructure over and over again.

So one of the ideas for Electric Commander was that we provide a very powerful architecture with all of that common infrastructure that everybody needs, along with a very simple web-based interface where you can define the stuff that's specific to your environment, and by doing that, we can reduce dramatically the amount of work you have to do to put together a build or test system.

GRAHAM-CUMMING: Right, and also we're not asking people to throw away their make file or anything like that.

OUSTERHOUT: No, no, in fact, this is really -- at this level, we're talking about the infrastructure around us, so the system that runs your Makes, for example, as opposed to replacing 'Make.'

GRAHAM-CUMMING: So, John, give us an example of how this thing actually operates? How would I use it in my day-to-day work?

OUSTERHOUT: The way you use Electric Commander is that you come in through a web interface and you describe your processes. Processes described in terms of what we call a procedure which is a collection of steps. You say something like 'this compile set needs to be executed on this machine and this test set needs to be executed on this machine' and so on.

So you describe what are the processes you want to run, you describe what machines are available for running those processes -- we call those resources -- and then you describe a collection of schedules which indicate when a particular process should run. You might have something that runs every night at midnight, or you might do continuous builds that run every 20 minutes all day long, or you might do a build and test every time somebody checks in a change on a source control system.

So once you describe your processes, schedules, and the resources, then Electric Commander does all the rest. We store that information in a database, and our server steps in and runs the various processes, when you've wanted them to, and records a whole bunch of information about them, that you can then use for reporting.

GRAHAM-CUMMING: Okay, I want to move onto the second topic which is speed. This is something which I notice more and more being important because of two things, one is the distributed development that's going on, which essentially eliminates night. So the concept of nightly build is an oxymoron at this point.

The other one is agile development which is leaning -- developers are clamoring for very fast builds. So fast, in fact, they want them, when they're having their coffee and come back and get a full build.

So let's talk a little bit about how Electric Cloud deals with the speed problem.

OUSTERHOUT: Sure. This is our second product -- actually, it's our first product historically called Electric Accelerator, and it's a product that's focused on just the build problem. So imagine the stuff that happens at or below the level of Make or Ant or Visual Studio Build. So Electric Commander is more, think of it as the global overall infrastructure for trying to get all the pieces of your build and test. Electric Accelerator focuses on this one point problem, and I'm sure people are wondering how do these products relate, and the answer is, you can use them either together or separately. They solve different but related problems.

So anyhow, the idea of Electric Accelerator is make those builds run a lot faster, like ideally, for example, 10, 20 times, or even more faster.

And the way that Electric Accelerator does that is by implementing parallelism at a very fine grain, using the information you have provided for Make or Visual Studio or whatever. We break the buildup into many small pieces at the level of individual files to be compiled, and then pass those individual compiles out to a cluster of machines where they're done in parallel.

So the simple overall answer is you get speed through parallelism.

GRAHAM-CUMMING: Okay, and I think that everybody listening, it sounds great. Can I just do, make ... and get parallelism for free, it is out there, isn't it? Why is it different? What is it that we've done different?

OUSTERHOUT: Right, there are tons of attempts people have made over the last couple of decades to make parallelism accessible for builds. I did some back in the late 1980s, when I was professor at Berkeley and we've had other experiments along the way, and if you go and you Google distributed Make or parallel build, you'll see lots of hits on that.

So this idea of a parallel build is great except for one little hitch, which is that it doesn't work. And the reason it doesn't work at least not until now is that in order to know when it is or is not safe to run build steps in parallel, you have to have perfect dependency information. You have to know which parts of the build depend on which other parts so you don't accidentally do things in parallel that should be run serially. And unfortunately until now, the only way to get that dependence information was for a human to specify it. There are some tools like Make depend, for example, that can provide you with some of the dependency stuff, but ultimately, it comes down to an engineer writing down in a Make file which files depend on which other files.

Obviously, for a commercial project with thousands or tens of thousands of source files, that just isn't practical.

So what we did in Electric Accelerator when we started the company, we knew we had to solve the dependency problem. We couldn't depend on humans to provide this information, and so in Electric Accelerator we had built some pretty cool technology where we can deduce the dependencies while the build is running by watching file accesses. The simple idea is that if you read a file that I write, you better not run until after I've finished. So by doing that, we can get a perfect picture of the dependencies, and by doing that, we can make parallel builds safe. We can bring 10, 20, 30 or more CPUs to bear on a single build.

GRAHAM-CUMMING: You know when you started talking about dependencies matter, you reminded me of the third topic which we really cover which is visibility because I think most people who deal with builds to the degree they don't have visibility into the interior of the build which jobs take a long time or it depends on what, that's a very hard thing to do...

Can you talk a little bit about the Electric Insight tool?

OUSTERHOUT: Yes, Electric Insight started off as a debugging tool and then turned out to be so useful that we turned it into a separate product. The way it works is that it uses a database created by Electric Accelerator while it's running a parallel build. Electric Accelerator spits out an XML file we call an annotation file with all sorts of information about the build, and then when the build is finished, Electric Insight can take that and provide you with a graphical display of what went on in the build. For example, when you're running a parallel build, Electric Insight will show you each of the CPUs that's involved in the build and what works that CPU at any given time, so you can see the progress of the build, and you can see, for example, that all of a sudden there's a gap in time where only one machine in the cluster is doing any work and the others are all waiting, and so from that, you can figure out that, there must be some dependency that's forcing everything else to wait, and using Electric Insight, you can go in and explore and figure out what that dependency is.

You can also get lots of other information about steps that fail and what went on.

One of the problems with Make, for example, is that people put a little at-sign in front of many of the commands in their Make file, which causes those commands not to be echoed to the Make log when Make runs. So when you go back and look at the log, it's very difficult to figure out what was going on at the time of an error because often the interesting commands weren't even echoed.

Well, with Electric Insight, we keep all those around and we can show you exactly which commands were running and furthermore, which Make files they came from and which Make files those Make files were invoked from and so on.

So at a whole bunch of different levels, Insight allows you to see what's going on inside your build.

GRAHAM-CUMMING: Let's talk a little bit about these overall -- you've really talked about software production management involving visibility, which we just talked about, performance in terms of speed of builds and overall manageability.

What are the benefits you think organizations get if they install some piece of software production management, software or procedures?

OUSTERHOUT: Right. I think you're going to see the benefits in three areas. The easiest one to measure is developer productivity, and I'll come back in a second and give you an example of that, where if you run things faster, developers simply get work done more quickly. Second area is quality. This one is hardest to measure but perhaps the most important.

One of the laws of software development is that the cost of finding and fixing a problem rises dramatically. Orders of magnitude. If the bug is not found until late in the process. For example, if it gets out in the field, it's extremely expensive to fix.

If it's found on a developer's desktop before they check anything else, it's really cheap to fix.

So with better production tools you can find problems much more quickly, run many more tests, and by doing that, you get a much higher quality product at a lower overall cost.

And then the third benefit is time to market. We're seeing customers where they can measure weeks saved out of development cycles by eliminating the bottlenecks that currently occur because of software production management issues. Things like, if a build breaks late in the development cycle and your builds are taking eight or 10 hours, that's probably a lost day for you.

If you can respin that build at half an hour or an hour, then there may not be any days lost at all.

GRAHAM-CUMMING: Sounds great. You know better than me where we are in terms of customers these days. Can you talk about a couple of customers who are really getting benefits and using the tools?

OUSTERHOUT: Let me give one example. Qualcom is a cellphone chip and software manufacturing company that has been using our Electric Accelerator product for a couple of years. The problem they face is that cellphone software complexity is rising dramatically. These cellphone wars are likely to be won by whoever can come out with the next cool feature that gets teenaged girls to trade in their old phones for new ones.

So their complexity is going up dramatically, and they have to support a variety of platforms. When they make a change to their software, they need to rebuild that and test that on as many as 30 or 40 different chipsets in order to make sure they haven't broken anything.

So they have a huge build problem that's becoming worse and worse as time goes on.

At Qualcom, we were able to get dramatic speedups in the build, and after we installed Electric Accelerator, we were able to make measurements of developer productivity. Using the software that comes with Accelerator, we can measure how often people are doing builds, how long those builds take, and then since we know how much faster we're running the builds, we can estimate how long the builds would have taken without Electric Accelerator, and from that we were able to figure out how much time we were saving the developers, and the answers are pretty cool.

We're saving about 8 hours per developer per week in one team, and about five hours per week in other teams.

So that's a pretty large productivity gain. There aren't many ways that you can get five, 10, 20 percent gains in developer productivity.

GRAHAM-CUMMING: Give them a lot of coffee.

OUSTERHOUT: We don't let them sleep. Then a second example was LSI Logic, interestingly, another embedded systems company, again with a very large, complicated test matrix, and there, their biggest problem was managing the complexity of all of the different test sets they had to run, all the different chipsets, and they'd been using our Electric Commander product -- actually our Accelerator product as well -- but mostly recently, particularly the Electric Commander product, which has allowed them to manage this huge number of test heads they had and the complex matrix of tests to keep up with that, and furthermore with some of the tools Electric Commander for reporting, they've been able to extract more data about their testing than they could before so they have a much better sense of what's going on inside their tests and how stable their software is.

GRAHAM-CUMMING: That's very interesting. It's very nice having a chance to talk to you.

OUSTERHOUT: Thanks very much. It's been fun talking with you, too.

Originally published in Queue vol. 5, no. 2—
Comment on this article in the ACM Digital Library