A Conversation with David Brown

Interviews

October 10, 2006
Volume 4, issue 8

Download PDF version of this article PDF

A Conversation with David Brown

The nondisruptive theory of system evolution

This month Queue tackles the problem of system evolution. One key question is: What do developers need to keep in mind while evolving a system, to ensure that the existing software that depends on it doesn’t break? It’s a tough problem, but there are few more qualified to discuss this subject than two industry veterans now at Sun Microsystems, David Brown and Bob Sproull. Both have witnessed what happens to systems over time and have thought a lot about the introduction of successive technological innovations to a software product without undermining its stability or the software that depends on it.

David Brown began his career in computing with the disruptive innovation brought about by the microprocessor generation: first on the SUN project at Stanford University and then as a founder of Silicon Graphics. He has now worked for more than 14 years on the Solaris system at Sun, where nondisruptive innovation (evolution and sustainability) is critical. Both the length and breadth of Brown’s career give him a unique perspective on systems and how they evolve. Brown led the Application Binary Compatibility Program for Solaris and discusses how new functionality is introduced in that system without breaking existing applications.

Interviewing Brown is a legend in the field, Bob Sproull, a Sun Fellow and head of Sun Labs. Sproull is well known for his early work in computer graphics and the seminal work, Principles of Interactive Computer Graphics (McGraw-Hill, 1973), which he wrote with William Newman. Sproull was also a member of the pioneering team at Xerox PARC that designed the first personal computer, the Alto, and its operating system. Sproull’s comprehensive systems expertise makes him deeply familiar with the key problems inherent in their evolution.

BOB SPROULL To begin, let’s hear about some of your roles in the Solaris engineering group at Sun.

DAVID BROWN I arrived at Sun in 1992, around the time of the change from SunOS 4 to Solaris. At that time we were beginning the Systems Architecture process, which I was hired to help design and get started. This was because we observed a couple of basic things.

One was that we now had several hundred engineers concurrently developing projects to put into Solaris, and there was a question about how we were going to do traffic control on all that and preserve the integrity of the system.

The second was this Homer Simpson “Doh!” moment when we found that our existing customers weren’t terribly excited about going to the new Solaris 2 system because it didn’t run a lot of the stuff they already had. This derived from the fact that in going from SunOS 4 to SunOS 5 we initially cut over a little bit too carelessly: We did a largely incompatible “big bang” at first.

My initial day-job in 1992 consisted of thinking about what our engineering review processes needed to be to get this whole System Architecture effort off the ground. We realized that we needed to be very careful about how we made changes to the system. One basic idea was to have every new project going into the system reviewed from an architectural perspective by the senior engineers. Another was to track the interfaces we were putting into the system that people were going to come to depend on, so we could preserve those and not disrupt applications or other layered products built on top of them. A big focus was defining what every new project’s “interfaces” would be. This was a bit of a challenge at the outset because many of the engineers—including the senior engineers like me who were supposed to be thinking about and reviewing this—often didn’t understand what interface actually meant: interface offered to whom, and for what purpose?

BS Could you summarize what you think the issues are that have to be balanced between the vendor (the operating system vendor, implementer, evolver, changer) on the one hand and the customer who wants to use it for his or her application or running a business on the other hand? What are the forces that animate changing the operating system in the first place?

DB The most obvious animating forces come from the customer. They have these things called applications that they use to run their businesses. They’ve made big investments in developing and deploying them, and they’re very interested in keeping them up and running. This concern about not disrupting their existing applications extends to the layered software upon which those things might run—and all in its compiled form.

In other words, they’re not just saying, “Oh gee, can we recompile the source code to get another working binary?” Rather, they are saying, “Gosh, I’ve already deployed thousands of these things across my enterprise, and I want those existing running binaries to continue to work without change when new releases of the operating system (or other parts of the system software) come out.”

BS Part of the problem is the customer wants the benefit of all of the advances, but no disruption of their world. They want the faster hardware, the bigger memory, the bigger, faster disks, the automatic file mirroring. They always want more performance.

DB Yes that’s also true. But on the customer side, rule number 1 is: “If you undermine what I’ve already built, I’m not interested.”

End users and customers are very attached to what I refer to as the “problem set.” They’re very invested in solutions that help them go after their particular business problems. On the systems software (vendor) side of technology, we’re very absorbed by the “solution set”—all these technological opportunities that are before us. The big motivator for us is to bring these new benefits out to our customers (and thus attract them to buy our new systems). Although it’s clearly also attractive to the end users and customers as you point out so correctly, I tend to view this as being more vendor-driven.

But these two things—new features vs. stability of existing ones—tend to be in tension. In order to push new features into the system—and this may be the crux of what we’re focused on here—you have to be very careful about the way in which you deliver them. They can’t undermine any of the existing commitments that you’ve made in previous versions of the system, because people have already built applications or other layered software upon them that needs to continue to run.

BS Let’s move on to talk about the techniques, both technical and organizational—or just structural—that you can use to manage interface evolution, or just evolution generally. I’ll start with a simple question: What do you mean by backward compatibility? And what do you mean by sustainability?

DB Backward compatibility, or what we usually call “upward compatibility” in terms of the way we build the system, means at a given point in time—let’s say Solaris 2.1—the system has a bunch of features that can be exploited. These are exposed through some interfaces: the means of getting access to them, so that applications can build on them and use them. Upward compatibility means that in the next release of the system, Solaris 2.2, all of these same capabilities are still present, and more specifically, they’re accessible in the same way as they were in Solaris 2.1. You might be creating new features and new capabilities, but these are in addition to what was there before.

BS Moreover, you don’t have to recompile existing applications if you don’t want to use the new features.

DB That is a very important point. The focus is on existing applications in their binary (already compiled) form. This means we’re interested in the applications’ runtime (as opposed to build-time) interface to the system.

When Sun was a technical desktop company—in its early disruptive days—most of the customers were scientists and engineers who were happy as long as they could recompile their source code for the next release. But in the evolution to Solaris 2, when we shifted our focus to sell these systems to enterprise commercial customers, things changed. What you find is that it’s not just a couple of guys at Lawrence Livermore Labs who are writing their own code and retreading that stuff all the time anyway. Instead, it’s these guys at Morgan Stanley with this enormously complicated bond trading system that has taken years to build, and they got it running on Solaris 2.6. But now we’ve come out with Solaris 2.8, which supports the blazingly fast new hardware (or whatever): It offers some various other advantages of scale, performance, or robustness that make it attractive, and we want them to be able to use that.

It’s not so much that the Morgan Stanley guys don’t want to have to recompile their application, but the recompilation to get a new binary also implies that they have to redeploy it to all these places where it’s already running. They want the existing application, in its already-compiled form and as deployed, to continue to run when they upgrade to the new version of the operating system. (“Build Once, Run Forever” was a play on the Java marketing slogan that was coined to describe this.)

BS How do you achieve that? Could you also talk a bit about shared libraries and their naming?

DB The key question is: How are we going to characterize what the existing deployed applications (and layered infrastructure) depend on? One way to look at this might be: What are the ways in which an application could be broken? There’s a subtle difference between those two questions that’s very important. Trying to characterize all the ways in which any application could be broken is really daunting. There are just too many ways in which something could be undermined.

So, an important principle for me in approaching this was to forego thinking about complete solutions (in order to deal with any possible way in which any application could be undermined), and instead to ask: What are the things that every application must do to get access to the functionality in the system? Then make those primary, necessary pathways as robust as possible—really look at what happens there and ensure that the common things that every app has to do will work well.

A primary matter is the way that applications get access to the system’s functionality. In the earlier days, it was by making system calls. Increasingly, however, most of the application-relevant functionality is wrapped up in these things called libraries: collections of related functions such as I/O-related functions, thread management, process management, or file access. The operating system’s kernel—the protected lower-level piece of the system—tends be much less directly relevant to the application. It offers these more primitive low-level facilities such as the virtual memory abstraction and process scheduling and so on, whereas the applications are much more interested in things like: open a file; read and write the file; do some I/O. Those are offered by the system’s libraries. That’s the primary vehicle whereby applications get access to the system.

A further, particularly important feature of Solaris 2 (and most other systems of its generation—this is kind of a late ’80s, early ’90s technological shift) is dynamic linking. In Solaris, the way that applications bind to libraries (and the interfaces within them) is through linking at runtime. In the old days the application’s linkage to the libraries was done at compile-time (what we call static linking). This essentially copied parts of the implementation of the library in question into the application binary. The application’s runtime binding interface was then the system calls (those traps that these bits of library implementation used to access the kernel’s services).

A really important consequence in Solaris 2 (and more recent systems of its ilk) is that the runtime binding is between the application and the system’s libraries. Effectively it’s the surface that’s exposed by all of the libraries in the system that now constitutes the application’s runtime interface—what I like to call “the Solaris Virtual Machine.” As it happens, the executable code in each of the system’s libraries is also shared by all applications that use it: There’s only one copy of libc, for example.

BS But as you point out, the key aspect is that it’s dynamically linked, not just shared.

DB The point really is that since they are dynamically linked libraries (since the binding between the app and the libraries happens at runtime), the application’s runtime interface is at a much higher level than what was happening before. Now (say, circa 1990 onward) you have to realize that the entire collection of the system’s libraries—and not just the kernel—represents the runtime system in total.

BS So this allows you as the systems developer to honor and maintain the contract at the library interface, which is exactly what software developers want.

DB Hopefully, that’s what they want. Sometimes there’s some confusion about that. Part of the sociological transition was to get application developers to realize that they do not want to bind the implementation of the C library into their applications. What they want to do is bind a dependency on the services that the C library offers through its interfaces. Then in Solaris we can actually improve the implementation of the C library in subsequent releases of the system (whether to increase its performance, maybe multithread its implementation, and so on), and your existing application binary gets all the benefits of that when running on a later system release, without making any change at all.

BS A classic example of the benefits of late binding.

DB Right. What you begin to realize is that the point at which runtime binding is occurring for applications of this class is not about making traps to get into the kernel—the traditional system call interface. What you’re now doing is making these runtime bindings one layer up: on top of these libraries. Now in Solaris, the thing that all applications must do is make dynamic bindings to these library interfaces (C language binding to library global symbols in our case). If you can characterize that surface to a first approximation, you’ve defined the application runtime interface to the system, and then this becomes the contract that you want to define and maintain clearly.

BS I’m guessing that as the number and complexity of these interfaces grew, it was important to have some tools—both for you and for customers of these interfaces—to help understand the inventory of interfaces that they were depending on (perhaps down to specific versions), not only to track down problems but also to understand the finer structure of the system that they were building and deploying.

DB Yes, the need for tools to inspect interface dependencies is a very important point. But I think there are a couple of stages leading to this. The first was recognizing that the system libraries are the important dividing line. But then you have to get everybody’s attention: Obviously, you have to get the attention of the operating system engineers who define and stabilize that boundary. But, also, anybody who’s developing applications has to realize what the site of the contract is, too. If they don’t know where the surface is, then they don’t know how to confine themselves to using the ABI (application binary interface) it offers. Now you can get to the tools to help look at all that.

Excuse me for saying this because it sounds so obvious, but you’d be surprised how few systems do this well.

BS I think it’s easy for this to get obscured, because often I’m not just using Solaris’s interfaces. For example, today I might import Perl over the Web and build applications on top of that. Whether I recompile it or not, I have no idea what Solaris interfaces Perl is using, unless there are tools that help me ferret out, or define, that surface. As I use bigger and bigger packages that are prepared by others, my visibility into this whole issue of managing change and evolution at the operating system’s interfaces gets reduced.

DB Yes, this level of indirection represented by middleware seems confounding. It’s a topology consideration that’s part of the complexity question you raised. Proper inspection tools address that, too, but first let me return to the point you made about scale. The number of interfaces that represents the application runtime interface (or ABI) in a system such as Solaris may be on the order of 10,000 (say 200 libraries with an average of 50 functions each)—a somewhat daunting scale.

Given the scale of the interface in a serious system (whether it’s IBM’S MVS, DEC’s VAX/VMS, or Sun’s Solaris), you must have some mechanical way to examine an application (or a piece of layered middleware) to decide that it stayed within the bounds of the safe and stable interfaces that the system was offering. There was this standing joke: Some developer would say, “Hey, how was I supposed to know what I could and should use, versus what I shouldn’t?” Then this “RTFM” retort would sally forth: “Read the [fine] manuals.” But that’s absurd. You can’t reasonably expect anybody, by reading the manuals, to make an assessment about whether or not what they coded to was OK. There are just way too many interfaces in the system for that.

In practice, it’s hugely important that you have some mechanical way of examining an application: to see what it’s using—and to decide whether that’s OK or not.

BS By interface, do you mean a particular function call or do you mean a group of related function calls?

DB I mean an individual entry point. Each library (group of related function calls) is also an important interface since the library names are used by applications to get to the individual functions in question. But libraries are the containers, and that’s not sufficiently fine granularity. It’s necessary to inspect whether applications restrict themselves to the set of shared objects that we said were OK (libraries such as libc, libthread, libaio, etc.) but inadequate. More specifically, they must restrict themselves to just those interfaces intended for applications. Applications must not use any internal or system implementation interfaces within system libraries. Admittedly, in Unix-derived systems, it’s an artifact of the C programming language and its limited interface scoping capabilities that these are visible to applications.

BS I think the point you’re making is that to be able to understand the dependency on an entry-point-by-entry-point basis is extremely useful in figuring out how changes may propagate to influence you.

DB Yes, and that’s because the system can be evolved not only by the introduction of new libraries, but also by the addition of functions to an existing library.

BS Let’s switch from compatibility to focus instead on evolution. How do you introduce change in a controlled way?

DB There are two aspects to this. One is technological, and the other is more conceptual or sociological. Indeed, I think the latter—the principles and adherence to them—is really the bigger thing.

We came to realize pretty early on that when you’re making changes to the system, if you do something that takes away existing functionality or makes some other incompatible change to the stuff that’s already there, that has negative impact on your customers because it’s going to undermine some applications that have already been camped there.

The commitment not to break any existing applications is an absolutely fundamental tenet. The understanding is that strict upward compatibility is a constraint, not just a goal. You cannot break anything that has already gone before.

This translates directly to how you offer new functionality (whether or not you define new interfaces to deal with it). First, it must be engineered so that it won’t break any existing interface or offer that’s part of the contract (ABI). Next, you must really think about the public access points to this new functionality, because you’re going to be committed to maintaining that functionality in a strictly upward compatible fashion subsequent to its introduction.

So, there’s an initial couple of principles: Maintain strict upward compatibility from one release to the next, and define the application interfaces clearly so you don’t allow any applications (or layered software products) that use them to be broken in a later release of the system.

Both of these things are easy to say, but they can have pretty broad implications in terms of what you must actually do when you’re making changes to the system. If you’re introducing new functionality and it’s completely independent of anything that’s in the system right now, it’s pretty straightforward what you do: Add new functions to a library, or add a new library containing the functions. It’s an additional component—new stuff you can use. The engineering technique is really obvious.

But then you can get slightly trickier circumstances when you impart change that is a bit more pervasive. For example, in going from uniprocessor to multiprocessor platforms, you would like to introduce the ability to have multithreaded applications. Now you have a change that is going to affect a lot of the existing interfaces that are already in the system.

It adds new capabilities or semantics that are going to be exposed through some existing interfaces. For example, the Unix system has this important interface called fork(), which relates fundamentally to the process model. It’s the basis of concurrent execution in the system. Prior to multithreaded applications, the fork interface meant: Make a copy of this process and then run that concurrently with this process. But what does fork() mean now, in the case of multithreading, where you have multiple concurrent threads of execution running in a single process?

It could mean copy the process for all its threads of execution, or it could mean copy the process to execute just this one current thread of execution. There are some decisions to be made about what the new semantics of this existing interface are going to be, in light of this new pervasive feature change. You also have to ensure that the introduction of the new semantics in no way affects the abstraction that single-threaded applications were relying on in the past. The introduction of 64-bit interfaces is a similar conundrum (although perhaps worse still because it impacts the semantics of fundamental data types at the language level).

Those are the places where things can get a little trickier. A little bit more cogitation may be required while doing the engineering to ensure that you’ve thought through all that’s already out there and what it depends on and how you want to impart this new functionality.

BS When you introduce something brand new, you may not get it quite right, or put slightly differently, it may evolve relatively rapidly compared with things that are tried and true. The question I have is perhaps more sociological than technical. Are there any techniques that have worked for introducing new things and letting you, the offerer, change them more rapidly earlier on until you get them right?

DB In systems that are at a more mature point in their evolution, a huge amount of their functionality has already become quite stable. We might just observe much of this existing stuff and simply declare that it’s stable and we’re not going to mess with it.

But the sharp edge of this equation is just what you described: When you’ve got this new technology you want in the system—but you haven’t quite got your head around what it should look like, or perhaps even what features you think are going to be wanted—there can be a lot of pressure to get this experimental interface out there to gain some experience. You want to put this stuff in the system for developers to play around with and build test applications on top of so you can get some feedback. But you’re not committed to these interfaces yet, so it must be clearly distinguished somehow.

In Solaris quite early on, we came up with this idea of Interface Taxonomy. It’s a classification that essentially defines the intended scope of use of an interface: Is it intended for use by third-party applications; or for use by other parts of the Solaris implementation; or is it internal to the implementation of this particular project? We call the various levels in this taxonomy the interface’s commitment level or, analogously, its stability level. These levels reflect the kind of change that can or cannot be made to the interface when the system is changed. Must it be maintained in a strictly upward-compatible way, or could it be changed in an incompatible way?

For example, we have the Public level (which indicates that application software may use the interface), on down to commitment levels such as Consolidation Private and Project Private (to reflect interfaces that are to be used only within the implementation of a particular subsystem or even a particular implementation project). Each interface introduced by any engineering project that will add to or change the system must be defined in this way.

Simply stated, as the exposure and use of an interface becomes broader, the degree to which we must maintain its stability becomes much greater. Of course, primary interest is with those interfaces labeled Public, because that’s where third-party applications and other layered software are allowed to camp.

After giving some more thought to the full software lifecycle—including both the emergence of new functionality and the obsolescence of really aged stuff—we introduced commitment levels called Experimental and Evolving. The idea was that such interfaces would not be part of the guaranteed safe and stable terrain; they were for developers to experiment with. The Evolving level was the halfway house to Public: We would make our best effort to maintain compatibility with those, but it was too early in the lifecycle of the technology to be confident that this was what we really wanted.

For Solaris, this definition of interface scope is really a primary aspect of the system’s software architecture, and the basis for how we evolve the system gracefully. It says who/what we expect can use any given interface, and what constraints are therefore imposed on us when we make changes to the system as a result of that. Ultimately, it has a lot to do with defining the rules of engagement with people who are building stuff on top of our system. It says, “Hey, third-party application developers who might resell this stuff to other end users, thou shalt not resell something that’s built on top of these Experimental interfaces because that’s not guaranteed to be safe.” The trouble with that is ISVs can’t get the in-the-field experience with customers.

BS Interfaces are the modularity or the construction techniques that we have in the software engineering world, and interface technology has progressed beyond where it is with C global symbols, for example. We now have languages that type-check across interfaces, but it seems to me that there are still a lot of things that are implicit in interfaces that we don’t check and that are implicit in the interface contract that we don’t check for invariance.

One of my favorite bugaboos is performance: that certain functions are assumed to be fast, while for others, it’s OK to be slow.

For example, in libc, if I get a character, I assume that has to be pretty fast. If I read a million bytes, I assume that can be slower, and almost all the performance contract is implicit. I’ve never seen even a comment, let alone a piece of descriptive syntax, that is actually interpreted in any way that commits an interface to certain performance properties. Yet the correctness, perhaps not in the functional sense but in the merchantability sense of our applications, depends completely on a performance contract that isn’t broken from one version of the system to the next.

If you had a magic wand and could add something to interfaces that could be checked, whether dynamically or statically, what would give you the most benefit in terms of preventing the kind of breakage or backward compatibility problems when you evolve the system independently of a customer evolving an application?

DB There’s this question about the way we’ve gone about defining the interfaces, which I described unabashedly as an idiot-simple enumeration of the application contact points. It’s the set of knobs that you’re allowed to grab onto, and we give you an electric shock if you grab a disallowed one. But it’s much more challenging to characterize some of these usage-related properties of the interface. I like to use the word semantics for that, and performance is one good example. There are some expectations about how responsive each interface is, and that is in no way characterized in any interface specifications or definitions that we have. Many other behavioral aspects of the interface are equally unspecified in any rigorous way.

Currently, things such as performance and other semantic integrity questions are assessed by pre- and/or post-integration testing. We have something called Perf PIT for the former. These happen after project implementation and are outside the domain of interface definition. There’s a bunch of performance benchmarks, some of which are industry-standard things, such as TPC-C (the transaction processing benchmark of the Transaction Processing Performance Council). We just run these tests as part of the system’s build process, whether before or after the project has been integrated, but typically after it has been developed and certainly before release. Then if there’s a performance regression, we engineer like mad to fix it.

The provocative question was how might we drive these semantic stipulations back into the interface definition practices? When I was involved in technically leading the ABI program, one of the big decisions I made was about how much energy and effort to put into interface definition and interface-related practices versus other mechanisms. One of the really big watersheds for us was the huge gulf between syntactic definition of the interface and the semantic definition of the interface.

Characterizing the semantics of the interface and testing was going to be such a huge, ominous, burdensome sort of thing. I wondered how we were going to get all of the engineers in the company who were developing interfaces to go through that process of definition, before even mentioning the process of testing. I couldn’t conceive of how we were going to impart that sociological change and manage the cost associated with doing that.

BS Let’s follow that last thought—that this was as much a sociological as a technical concern—with a more general one: this whole question of evolving a more rigorous engineering of the interface. Tell us a bit more about how you got the engineers to come along on this trip.

DB Step one is recognizing the key problem. Then a couple of long-haired guys—maybe as I was once upon a time—think about how to introduce some new technology or rocket science to go after this. This is necessary, but immensely far from sufficient to solve the problem.

What are the follow-on steps—after you’ve come up with these beautiful brainchildren and maybe implemented some glossy prototype that shows people how to do it? That is the rest of the “journey of 1,000 miles.” Whether you can actually get all of the engineers, the n-hundred engineers who might put a new interface into a library in Solaris, to engage in these practices is a fundamentally social question. You have to succeed at this to have your foundation for stability. It’s a contract, and you have to maintain it; otherwise, it has no value.

This is a pretty big hurdle because you’re adding overhead to what an engineer does. You’re putting another rock in his or her backpack. You’re saying, “Not only must it be both 32- and 64-bit capable, both SPARC and x86, work properly under multithreading, be internationalized, etc.”—the whole list of requirements that any project must already meet—“but now you also have to define all these interfaces carefully and make sure that they’re subject to proper upward compatibility.”

That is a social change management problem. Part of the reason for picking these highly simplistic necessity-focused as opposed to sufficiency-focused kinds of interface definitional practices is that there’s some vague hope that you can actually get the engineers to take this on board, and that you can build the tools to enforce this—that is to say, observe that it has been done in all cases and audit the interface from release to release to ensure that it has maintained upward compatibility.

BS I think you could put that slightly more positively. This is the hallmark of good engineering: You didn’t try to boil the ocean. You didn’t try to set an impossible goal or one that only certain gurus could practice. You found a middle ground that made huge improvements over the alternatives, yet that everybody could practice. Innovation is still distributed throughout the organization. Anybody is perfectly capable of going through the process of adding functionality to the system. The tests and the tools are all out there for everybody to use. You preserved the essential social structure of the engineering enterprise.

DB Yes, I was mostly being emphatic about respecting what other people in the organization do, and realizing how hard their job is and how much leverage you need to give them in order to get them to do it.

The first hurdle is your own engineers, but then what makes this a really huge task is that you’ve also got to get all of your third-party application developers and middleware developers on board with this idea. And you’ve got to get your end-user customers on board to explain to them what the value proposition is that you’re trying to impart to them, so that they’ll be the force feedback that pulls on everybody else in the ecosystem to deliver these benefits (in particular the ISVs who are sometimes a little bit at odds with this from a business perspective).

For me, this is a big awakening. Starting out as an engineer and being a technical software systems guy made it easy to underestimate that the vast majority of the effort is in communicating succinctly what the problem is, what the benefit is going to be, and then what everybody is going to have to do to retread the ecosystem to get to this better place.

Then there is constant reiteration and support, demonstrating that we’re not just wagging our lips, we’re really doing this. There’s this persistence, not just over one or two releases, that we have this bright idea. We started this back in Solaris 2.4, and here we are 12 years and seven releases later in Solaris 10, and these practices are in place. They’re institutional in our organization.

BS What do you think are the major lessons that you’ve learned over the years? Obviously, keep it simple, set achievable goals, look for the main effect. But maybe you could impart other lessons about how systems are structured or built or evolve in the first place.

DB A really important colleague in my life was Paul Haeberli, whom I worked with at Silicon Graphics in the early days, and he once said to me, “You know Dave, it’s not about these big genius ideas. What really makes a great system is thousands of detailed engineering decisions that to a first approximation we got all pretty much right.”

There’s definitely an element of that in here, which is very hard to talk usefully about. It’s this conceptual shift, in this case a zealousness that you have to have about compatibility and engineering for compatibility, and the belief that we’re going to go to reasonably extreme measures to come up with specific techniques. People such as Tim Marsland, who’s now a Sun Fellow, were responsible for a lot of these things in the implementation when we did the binary compatibility between SunOS 4 and SunOS 5—all the retrofitting of SunOS 5 to build bridges so that those earlier SunOS 4 binaries would continue to run without change in SunOS 5 (i.e., Solaris 2).

BS Or to put it slightly differently, buried beneath what may look like a simple result is some pretty tricky engineering.

DB Sometimes you get some high-level realizations and you can base a whole campaign on it as I did with the ABI. I said, “Let’s just focus on necessity: What are the things that every application has to do, and how can we just take an initial whack at this?” That’s what we did with this library and symbol-level of definition of the interface. It’s very simple and very far from sufficient to ensure robustness. But I would point out that a lot of the robustness comes from a whole bunch of other things that happen, whether it’s the architectural review process where the senior engineers look at everything, or all the testing we do before we let a release out.

There’s just a lot that goes on that’s extremely important and not necessarily glorified. There are a lot of “quiet heroes,” as Marsland once said, who have done incredibly important and unsung things to make this a reality.

Another high-level lesson for me is that certain attitudes and points of view are important in the way you come at this, things such as respecting other people in the organization and the amount of work you might be giving them and really looking for the value proposition and who’s going to get it, and where the incentive feedback is coming from.

At an even higher level, I think that some developers don’t even address this because they think innovation and stability are mutually exclusive. But it turns out that you have to change your perspective and realize you can impart new—arguably revolutionary—functionality, but in an evolutionary sort of way. This is the mindset you need to have.

Originally published in Queue vol. 4, no. 8—
Comment on this article in the ACM Digital Library