A Conversation with Tim Bray
Searching for ways to tame the world’s vast stores of information.
Tim Bray’s Waterloo was no crushing defeat, but rather the beginning of his success as one of the conquerors of search engine technology and XML. In 1986, after working in software at DEC and GTE, he took a job at the University of Waterloo in Ontario, Canada, where he managed the New Oxford English Dictionary Project, an ambitious research endeavor to bring the venerable Oxford English Dictionary into the computer age.
Using the technology developed at Waterloo, Bray then founded Open Text Corporation and developed one of the first successful search engines. That experience led to his invitation to be one of the editors of the World Wide Web Consortium’s XML specification. He later founded a visualization software company called Antarctica Systems. He joined Sun Microsystems in 2004 as director of Web technologies.
Bray is a graduate of the University of Guelph with a B.Sc. in math and computer science.
Who better to quiz Bray than Jim Gray, manager of Microsoft’s Bay Area Research Center and a distinguished engineer in the scalable servers research group. One of his current projects is building online databases for the astronomy community, part of a larger agenda of getting all information online and easily accessible. Gray, a recent winner of the ACM Turing Award, is revered for his knowledge of database and transaction-processing computer systems.
JIM GRAY Many of the people reading Queue are in the early stages of their careers and quite curious about what’s in store for them. You’re a success story, so maybe walking us through from your early work at Waterloo to your more current interests would be helpful to our readers.
TIM BRAY The job with the dictionary project at Waterloo University was about as much fun as you can have and get paid for it.
The project was the OED (Oxford English Dictionary), the biggest dictionary ever produced in any language. It had grown so big that they realized this wasn’t tenable; they had to go back and integrate all this stuff, including applying computer technology, it being the ’80s. The shape of the data and the shape of the problem resisted a lot of the conventional technical approaches.
So a deal was cooked up between IBM and Oxford University Press and the University of Waterloo. A bunch of Canadian government money came in, on the stipulation that they produce not only academic research, but also working, usable software. They needed to hire some actual software engineers from the private sector, and that included me as the manager of the research project.
On the day they brought me in to interview, they showed me some of the electronic versions of dictionary. It was what we would now call XML. It had little embedded tags saying entry, word, and then pronunciation, etymology, a brief quotation, and the date, source, text, and so on.
It was my Road to Damascus experience, really. I looked at it and I saw that the markup said what the information is, not what it looks like. Why isn’t everything this way? I still basically am asking that question.
We built a special-purpose language for manipulating this thing, and we built a search engine that could actually handle deeply nested structures. Most search engines operate on the assumption that the fundamental unit of everything is the document.
JG When I think of the OED, I think of the term and then a definition, and a typical use of that definition.
TB The big Oxford English Dictionary itself is a scholarly publication. This means that it doesn’t make any assertions about the use and meaning of words that aren’t backed up by direct evidence, which is included.
So if you think about a word that appears as a couple of different parts of speech and with lots of definitions—there are a lot of words like that in English—you’ll have the word and then you’ll have sort of header groups that will have its pronunciation and its etymology and variant spellings marked as to whether they’re obsolete or not. Then you’ll get into the senses. The senses are a hierarchy that can run as many as five or six layers deep.
At some level of granularity—an individual meaning or a group of meanings—there’ll be a paragraph full of illustrative quotations, each with its title, author, date, and the actual text. So right there, you have four or five or six levels of structure before you even start. A lot of the queries you might want to do cut across that in funny ways. You say, “OK, I want quotations prior to 1800 of words where Italian appears in the etymology.”
JG So you fundamentally go to a data storage system and a query system?
TB Yes, that’s right.
JG The OED was the first open source project that I can think of. It was a collection of scholars who were working together and collecting quotes from all over the world. Did the fact that this was an open source project influence you at all?
TB It certainly influenced me, but I’m not sure I would sign off on the notion that it’s actually open source. I was damned by [GNU Project founder] Richard Stallman in egregiously profane language for working on it.
TB Well, literally thousands of people around the world diligently read books looking for usages of words and writing them on slips and sending them to Oxford. Many, many millions of these things are in filing cabinets in the basement of Oxford. Then Oxford, of course, turned them around to do a commercial product. It’s not as though the underlying citation store or the dictionary itself are open for free access to anybody except for Oxford.
So I don’t think it’s really open source in some of the essential characteristics. It is certainly community-based and community-driven. And it clearly became the case that some of the unpaid volunteers became thought leaders in terms of how you go about finding things.
JG After the OED, you founded a company, right?
TB In 1989 we founded a company based on the technology used by the OED project, called Open Text Corporation. We took all the technology to market. What seemed to stick was the search component, which was a system called PAT (short for PATRICIA, which stood for practical algorithm to retrieve information coded in alphanumeric). It ended up using what was subsequently named by Udi Manber [then a professor of computer science at the University of Arizona] as a suffix array data structure.
At that time, the search engine market was a fairly vibrant, bustling market with lots of little companies in it. We didn’t get rich quick, but we paid the rent and we were growing slowly.
Open Text’s virtue as a search engine was that we could do this deeply nested text—what was then becoming formalized as SGML (Standard Generalized Markup Language). Also, we were comfortable operating over the network on streams of tag text. Those were only moderately interesting selling points at that time.
JG I’m surprised the IBM guys weren’t pushing SGML. They’ve had these SGML efforts for a good long time.
TB Well, they were, to some extent. I mean, they funded all that ISO work and got the standards done. They had the first ever full text search product. But, in those days, I never really saw that much of them in that market.
SGML was so expensive and so complex that it never really caught on outside the extreme high end of the publishing market—for example, boring technical documentation, the legislation of the European Union—where cost was only a secondary object and budgets could go into seven and eight figures comfortably. That’s why Open Text never became that big of a success story.
Then the Web came along. I was at the SGML conference in 1993 or ’94, and Eric van Herwijnen, who was a major figure in the early days of the Web—though you never hear of him anymore—gave a speech saying this Web thing and search engines are going to be pretty big. I was sitting there in the audience and the whole thing fell into place in my head: how you would have a crawler and a spider and you would copy things in and put URLs on the results page. I was so excited, I was physically shaking for two days because I could just see how the whole thing would work.
My CEO, Tom Jenkins, agreed to turn me loose to work on it myself, and I spent six months basically doing nothing else and built the crawler and the interfaces. We shipped in April 1995. At that point, there was Lycos, Infoseek, and us, and that was about it.
It was one of those times that will never be repeated. Our usage statistics went up by about 20 percent a week for eight months. There were investment bankers lined up outside the door wanting to do deals with us.
It was wonderful but also horrible, because it turns out the technical characteristics of our search engine were not well suited for that kind of a query load. It scaled beautifully with data size, but very poorly with high query rates. I lost weeks and weeks and weeks of sleep, hacking and patching and kludging to keep this thing on the air under the pressure of the load. Open Text has since rewritten the search software to handle this stuff better.
We did an IPO in January 1996, and the company turned away from Web search because that was such a flaky, shaky business model that nobody understood. Instead, it turned toward content management. I got bored and left.
Just as I was leaving Open Text, Jon Bosak got in touch and wanted to form a working group to put something like SGML on the Web. I originally said no because I wanted to work on the HTML working group, which seemed like a sexier place at the time, but I eventually changed my mind and decided he was right.
At that point, there were maybe 20 people in the world who understood SGML and had any hands-on Web experience, and 11 of us ended up on the working group. We had all known each other for a while and we knew the issues. The right thing to do was not technically very taxing, although Unicode was still pretty fresh at that point, and making the commitment to Unicode and then figuring out how to do that right was challenging.
The main design of XML was sketched in about 20 weeks between July and November 1996.
JG I assume that the burning issue was keeping it simple.
TB And we missed. XML is a lot more complex than it really needs to be. It’s just unkludgy enough to make it over the goal line. The burning issues? People were already starting to talk about using the Web for various kinds of machine-to-machine transactions and for doing a lot of automated processing of the things that were going through the pipes.
HTML obviously is the most successful data format in the history of the universe for delivering information. It was never designed to be processable, to the extent that it was designed at all. Ted Nelson [Internet pioneer known for coining the term hypertext] famously once said that trying to bring order into HTML would be like trying to graft arms and legs onto hamburger. He had a point. And SGML sort of had what HTML needed—a deterministic grammar and extensibility and a few other things.
It was pretty clear that if you were really going to do e-commerce and various other kinds of machine-to-machine transactions, you needed something that was designed for that. The idea was to take SGML and throw away the 95 percent that never got used and retain the 5 percent that did.
JG And the presentation stuff was part of what was thrown out?
TB XML on day one had zero presentation. The separation of content and presentation is kind of an elusive goal that, in almost no application, is ever really fully achievable. But to the extent you can achieve it, you usually win. XML at least doesn’t get in your way as you try to achieve that.JG If you had it to do over again, what do you wish XML had been in that first incarnation? Is it what you wanted?
TB There’s a lot of cruft in there that turns out to be a bad return on investment. The whole notion of “entities” in the technical XML sense is attractive and certainly was useful in the publishing applications where we all grew up. But in practice, it turns out to cause all sorts of hairy implementation nightmares that are just not worth it.
The notion of, in particular, unparsed entities that were essentially external objects of nontextual type, with one more level of indirection than the Web gives you, sounds plausible, but it turns out that the one level of indirection doesn’t pay for itself. The Web itself gives you enough indirection.
There’s this thing called notations that nobody has ever used once in the history of XML, so probably it shouldn’t have been there. Should DTDs (document type definitions) have been part of the picture, or should we have just left them out? If you were doing XML now, yes, you would clearly leave DTDs out.
JG Would attribute distinction stay?
TB I think so.
JG The link mechanism would stay?
TB The IDREF? We probably could have lost that without any damage. Just like anybody who has ever used Lisp who then learned markup, when I first learned SGML I was mostly against attributes. I just couldn’t see why you needed two different ways to mark up data structures. Eventually, I observed that they didn’t really add that much to the difficulty of processing. I also observed that there’s something in the human mind that likes having these two markup idioms to use.
I eventually decided that they were not harmful and people liked them, so why go against them? Consider “<a href=”somewhere”>a label</a>”. That just feels right; that just feels like a good idiomatic succinct piece of design to me, and it uses both elements and attributes.
JG What were the burning issues that the 11 people working on XML were trying to address? Was it contentious?
TB The solidarity in the group was incredible, largely because most of us had known each other for years and years and years. We had no idea how high the stakes were, how big it was going to be. We would fail to come to consensus on some things, and eventually, it would come down to a vote. But we knew each other so well that, after the vote, the losing side would then go out and defend the outcome, which is a real working definition of rough consensus.
Probably the most hotly contested issue was the error-handling rules. XML has draconian error handling. You’re not allowed to recover from some classes of error. There were probably hundreds, maybe thousands, of e-mail messages about it. One interesting thing is the group never met face to face until after the job was done. It was all done by teleconferencing and e-mail.
Microsoft was obviously one of the early leaders—it gave Jon Bosak support in getting XML off the ground, and Jean Paoli was hired by Microsoft, in large part, to do this work. Those of us who had come from the Unix side of the planet didn’t want XML to be a Microsoft property, so we were leaning fairly heavily on Netscape to get involved. Finally, after an incredible amount of battering, Netscape condescended to notice XML, but said the company had no one who knew anything about it.
I’d been doing it myself—I was an independent consultant at the time—on a pro bono basis, so I went to Netscape and asked how much it would pay me to do it. So I got paid by Netscape $1,000 a week to do what I’d already been doing. I was fine with that.
When Netscape suddenly became active, the first thing it wanted to do with Netscape was MCF (Meta Content Framework), which eventually led to the work that became RDF (Resource Description Framework). Microsoft really went insane. There was a major meltdown and a war, and I was temporarily fired as XML coeditor. There was an aggressive attempt to destroy my career over that. That was really the only serious chunk of politics in the lifetime of the working group.
Then when we finished and went out to market it to the world, it was remarkable. It was like throwing your entire weight against a door that isn’t even latched. More or less the whole world said, “We can use that, sure.”
JG There is no XML “version 27.5”?
TB XML was frozen and published in February 1998. As it came toward the end and it became obvious—well, not obvious, but likely anyhow—that this was going to get a lot of momentum, we were besieged by requests for extra features of one kind or another. We basically lied and told the world, we would do all that stuff in version 2. You have to shoot the engineers and ship at some point, right? I think there will never be an XML version 2. There is an XML version 1.1, but it’s controversial and not widely supported.
JG It seems to me that all of these standards take on lives of their own. What used to be a 12-page document now comes in three volumes. The sequel is an encyclopedia.
TB To start with, there’s so much stuff based on XML. The installed-base problem would be nightmarish. Secondly, I don’t perceive any consensus.
JG But you must have done something to stop it. Organizations have a way of propagating themselves.
TB Originally, XML was supposed to be a three-part project: there was going to be the actual language itself; there was going to be a new schema facility for it; and there was going to be a style-sheet facility for it. At some point in 1998, the XML working group turned itself off and formed some new working groups, including a hyper-linking working group, a style-sheet working group, and a schema working group. There is to this day an XML core working group that caretakes the language and does errata and that kind of thing.
But every time the notion has been proposed to do an XML 2.0, it has been fairly rapidly shot down.
JG What happened next in your career?
TB I got involved in the work that produced RDF, but I found RDF to be a really hard sell. That quickly became mixed up with a whole bunch of classic KR (knowledge representation) people who wanted to go refight the AI wars of the ’80s. And I just didn’t care.
JG Just what is RDF?
TB RDF is a general-purpose facility for expressing meta-data, which is to say assertions about resources. I use resource in the technical term, Web resource. So RDF models the world as a series of triples, where you have a resource, and then you have a resource-property-value triple, you have a resource that has a URI (uniform resource identifier), you have a property that also has a URI, and then a value that can be a literal value or another URI.
JG I generally identify RDF with the Semantic Web.
TB The whole Semantic Web was launched by the RDF activity and now has grown to include OWL (Web Ontology Language), which is a general knowledge representation language. But, boy, there are problems. The XML serialization of RDF is horrible; it’s a botched job.
You know, KR didn’t suddenly become easy just because it’s got pointy brackets. Doug Lenat has been off working in the desert on that for decades and nobody has ever made a buck on it yet, as far as I know.
Motivating people to provide meta-data is tough. If there’s one thing we’ve learned, it’s that there is no such thing as cheap meta-data. The whole point was to make search run better at some level. Google showed us the power of what was always used in the academic citation index—namely, the number of incoming links.
JG Inferring meta-data...
TB Inferring meta-data doesn’t work. Google doesn’t infer meta-data. It’s a deterministic calculation based on finding links and counting links and doing transitive closures on that. Inferring meta-data by natural language processing has always been expensive and flaky with a poor return on investment.
I spent two years sitting on the Web consortium’s technical architecture group, on the phone every week and face-to-face several times a year with Tim Berners-Lee. To this day, I remain fairly unconvinced of the core Semantic Web proposition. I own the domain name RDF.net. I’ve offered the world the RDF.net challenge, which is that for anybody who can build an actual RDF-based application that I want to use more than once or twice a week, I’ll give them RDF.net. I announced that in May 2003, and nothing has come close.
JG But you went into this feeling fairly optimistic. At a certain point, did you become disillusioned?
TB My original reason for believing in all this was just the notion that it ought to be easier and cheaper and there ought to be less friction in interchanging meta-data. My original vision of RDF was as a general-purpose meta-data interchange facility. I hadn’t seen that it was going to be the basis for a general-purpose KR vision of the world.
JG It seems that XML and, to some extent, RDF have been fairly successful in electronic data interchange where OSI (Open Systems Interconnection) and ASN.1 (Abstract Syntax Notation One) failed. Why?
TB Two huge lessons come out of ASN.1. What it does is tell you all about data types. If you have a stream of ASN.1, it says, “Here’s a 35-character string, and here’s a 64-bit IEEE double-precision number and floating-point data, and here’s a non-negative integer.” XML says, “Here’s some text called label, here’s some text called price.” Historically, it would appear that it’s more valuable to know what something is called than to know what data type it is. That’s an interesting lesson.
The second lesson is that one of the huge reasons why ASN.1 never took off was that the tools were crappy and they weren’t widely available. On the other hand, on the day that XML shipped—no, before XML shipped—there were multiple open-source, high-performing, well-shaken-down parsers and various engines. Who knows what would have happened with ASN.1 if it had been like that?
JG So you were sitting on the RDF committee, you were otherwise a free agent. Then what happened?
TB In 1999, the bubble was on. There were venture capitalists standing on every street corner giving out money. I just decided that I wanted some. Most of the technologies I’ve bet on over the years have worked out pretty well. But then there was VRML (Virtual Reality Model Language), which I got all excited about.
I had attended the first-ever VRML world conference in December 1995. There were maybe 150 people there, of whom at least half were venture capitalists. The level of excitement in the air was astounding. We were sure we had the future in our hands. It was going to be a 3-D Web that we would all be walking around inside.
So I founded Antarctica, which provides really slick interactive visual interfaces to complicated datasets. The first dataset was the dmoz.org subject directory of a few million Web sites. It shipped with 2D and 3D interfaces, and I was excited about the 3D side. But like every other attempt to apply 3D to business problems, it failed. We got funded, and finally, in 2003, the business started to get some traction.
The software started to find a good market in what’s called the business intelligence sector. It’s a long way from my original “cyberspacey” vision, but I think it’s going to become a very successful business. By the end of 2003 there was a good executive team and a good engineering team, and they didn’t need sweeping visions or guerilla development. I wasn’t really a big value-add anymore, and I quit January 1, 2004, and took a vacation in Australia. I really enjoyed my time off before I went to work for Sun.
JG Switching gears for a moment, there are at least three pretty widely known apps you’ve got under your belt. Tell us a little bit about Lark, Bonnie, and Genx.
TB Lark was the first XML processor, implemented in Java. I wrote it myself. I used it also as a vehicle to learn Java. It shipped in January 1997 and actually got used by a bunch of people. But then I realized by early ’98 that Microsoft was shipping one and IBM was shipping one and James Clark was shipping one. I didn’t want to compete with all those people. So, I let Lark go. It was fun to write and I think it was helpful, but it hasn’t been maintained since 1998.
Bonnie is a program I wrote when I was on the OED project. The size of the OED file was 572 megabytes. In those days, computers had memories of about 16, 32, and maybe 64 megabytes. Clearly, our applications were I/O intensive, and we cared a lot about I/O. So I wrote Bonnie, which is specifically aimed at the semantics of the Unix file system. It’s a Unix file system benchmark. It still exists in various versions and will substantially outlive me.
Genx came out of some debate that erupted on the syndication front. Some of the people working in syndication were extremely upset about XML’s strictness, saying, “Well, you know, people just can’t be expected to generate well-formed data.” And I said, “Yes they can.” I went looking around and found that there are some quite decent libraries capable of doing that for Java and Perl and Python, but there didn’t seem to be one for C.
So sitting on the beach in Australia I wrote this little library in C called Genx that generates XML efficiently and guarantees that it is well-formed and canonical.
JG Wow, it’s interesting to see, in the microcosm, how XML continues to march forward—that is to say, you’re on vacation, sitting on the beach, thinking about XML and this strictness issue, and you bang out Genx. So then zooming out to the macrocosm, what’s the main lesson you would draw from the whole XML experience?
TB In retrospect, I think the most remarkable thing about XML is that it didn’t show up until 1996. It seems so painfully obvious that you would like to have a way to get data from computer to computer that is radically independent of any operating system, file system, or application. The easiest way to do that is to say put everything in text and put a marker where it starts and a marker where it ends, and go from there. So it is surprising it hadn’t been done before.
I guess one explanation is that, to do it properly, you really need to crack the internationalization nut, and Unicode hadn’t really been well established before that point. But XML has really resisted anybody’s attempts to co-opt it or to make it application-specific. Definitely, we should be happy with the way it’s come out.
You know, the people who invented XML were a bunch of publishing technology geeks, and we really thought we were doing the smart document format for the future. Little did we know that it was going to be used for syndicated news feeds and purchase orders.
I’ve been doing some work with the Open Office document format since I went to Sun. As far as I know, that is sort of like the canonical example of what we thought we were building XML for. It’s a super-clever data format. An Open Office document is actually a Zip file, with different XML streams in it: one for the payload data, one for the style sheet, and one for all the various things that go around the document. Absolutely, remarkably clever.
JG What are the technical bees in your bonnet at the moment?
TB Well, I should talk a little bit about syndication because that’s really where a large proportion of my energies are going. Clearly, if you look at RSS (Really Simple Syndication), and you look at the growth rates and adoption curves, it looks a whole lot like what we saw on the Web in ’93 and ’94. There are currently more than 3 million syndication feeds. They’re growing at the rate of approximately 14,000 per day. There are on the order of 300,000 new posts per day, and those curves all have that same shape that we were looking at 10 years ago.
This is hot stuff. People are using it. Clearly this is a new model of communicating information that has some sort of a sweet spot in the spectrum of communications. We don’t know what it is yet because we’re just getting into this. But clearly, people like it.
Two issues have come out of this. One is the whole notion of the spectrum of communication. If I wish to interact with somebody, I have a lot of choices: I can sit down in a room with that person, like I am with you; or I can talk over the phone or, increasingly, over videophone; or I can chat on IRC (Internet Relay Chat); or I can send an e-mail, post to my blog, or work on paper.
The incremental cost of a small or medium quantum of communication is about the same in all of them—it’s about zero. So, we’re faced with a choice about how we’re going to deal out our communications across this menu of media. It should be fun to watch.
In any case, the world of RSS has been kind of wild and woolly and prone to flaming and bad behavior, and partly as a result, there are a lot of versions and none of them is specified that well. So I’m co-chairing an IETF (Internet Engineering Task Force) working group that is publishing a protocol under the name Atom that tries to capture all of the prior art in this stage and might provide a good basis for winding down the syndication wars. I think it’s going to have an impact.
JG It should be a fun journey.
TB I think so.
Originally published in Queue vol. 3, no. 1—
see this item in the ACM Digital Library
For additional information see the ACM Digital Library Author Page for: Jim Gray