Jason Hoffman has a Ph.D. in molecular pathology, but to him the transition between the biological sciences and his current role as CTO of Joyent was completely natural: “Fundamentally, what I’ve always been is a systems scientist, meaning that whether I was studying metabolism or diseases of metabolism or cancer or computer systems or anything else, a system is a system,” says Hoffman. He draws on this broad systems background in the work he does at Joyent providing scalable infrastructure for Web applications. Joyent’s “cloud-computing” infrastructure has become the foundation for many of the increasingly popular applications developed to feed into the social-networking site Facebook.com.
In our discussion with him this month, Hoffman discusses some of the key technologies behind that infrastructure. Among these technologies is virtualization, which we explore in-depth in the four feature articles that make up this month’s Queue Focus. Hoffman also shares his insight on the popularity of Ruby on Rails, a technology he has been involved with since its inception and about which he is frequently asked to speak at conferences.
Joining him in the conversation is Bryan Cantrill, a Sun distinguished engineer in the Solaris kernel development group who recently joined Queue’s editorial board. Once included on MIT Technology Review’s “35 Innovators Under 35” list, Cantrill is an accomplished engineer who is best known as the author of DTrace, an observability tool that many engineers consider indispensable for making applications perform and scale better.
BRYAN CANTRILL One of the nice things about Silicon Valley is that CTOs often come from disparate academic backgrounds. That said, I do wonder if you’re the only CTO with a Ph.D. in molecular pathology working for a Web startup. You and I were on a Ruby on Rails scalability panel in February, and I swear after that panel, you were practically mobbed by people seemingly seeking your autograph. How do you go from a Ph.D. in molecular pathology to being mobbed at a Ruby on Rails scalability panel? What’s the path there?
JASON HOFFMAN The path was simple. I was a practicing scientist and a lecturer at a university in San Diego, and I was using a lot of blogging tools as part of my teaching methods. I got to know a few people who were writing such tools, and lo and behold, they’ve grown up to be people like Matt Mullenweg, who wrote WordPress. So I just happened to be around when those sorts of things were conceived.
In the case of Rails, I was using a lot of online project management tools and ended up using a product called Base Camp pretty much right when it came out. I found out it was written in Ruby, got introduced to the guy who did it on a contract gig, and asked him what he was up to next. He says he’s thinking about spinning out the way he’s doing it and creating this thing called Rails.
I started out by helping these guys put together infrastructure for their own projects and teaching them a bit about how to run development teams. A year and a half into it, I had a dozen full-time employees and this side business, and then it got to be a bit silly to keep on doing other things.
BC Take us through to the present and your current company Joyent.
JH The idea for Joyent was pretty simple. If I were to sum up what the total goal of the company has been, it’s that somebody with an idea for an application—not just a Web development framework but a whole system to operate inside of—should be able to write it, get it started, and have it go from no users to a billion users, one site to 12 sites, one continent to four continents, without rewriting it and without racking or stacking a single piece of hardware during that process.
BC I’d like to talk about the cloud that you have at Joyent, particularly in terms of Facebook, which has completely changed in the past couple of years from a Web site to a real platform for applications development. What was Joyent’s role in that?
JH I’ve been surprised by what some of the Facebook apps have done. When the platform first came out, I was, like, “Well, whatever.” I showed up for the last five minutes of the party, grabbed a brownie, got a cocktail real quick, and headed out of there. That’s the truth.
The first thing that caused a little bit of a change in my mind was a customer—a pretty well-established startup—that decided to feed into the platform for experimentation and marketing purposes, just to see what would happen. One developer, with input from some other people on a team, put together an application, and within 60 days had basically gained a million and a half users on Facebook. On its own Web site, that number took them two years to hit.
BC Wow! Talk about an extreme problem. Obviously a ton of these apps are stinkers, and people are throwing thousands of them out there. So I’ve got no idea if my weekend of work is a stinker or if all of a sudden, I’m going to have a million users.
JH And sometimes even more. Typically what happens is you throw some applications on somebody else’s server—let’s say, a generic hosting account—and the next thing you know, a couple of days have gone by and you’re pushing more bandwidth than you should be, and they just shut you down and you’re done.
We don’t do hosting. We do real infrastructure for people. And by that I mean gigabit-per–second-grade load balancers, real storage, and all those sorts of things. For these guys, where 30 days in they’ve pushed out 150,000 gigabytes of data, the typical wholesale pricing is about $20,000.
Then you start thinking, no wonder people have to monetize these things a bit, because the big problem is sending traffic to this Web services infrastructure from the infrastructure you have to be on to host the app.
So for us, in terms of not just limiting our own expenses, but also making it a real bonus for people, we decided just to run private lines. It’s pretty simple. Let’s spend $2,000 a month on something that is 10 times faster than something that costs $20,000 when it’s going out over the public Internet.
BC And then eliminate any of the bandwidth constraints on your customers.
JH Yes, and take latencies down from tens of milliseconds to a sub-millisecond.
BC That’s great. Obviously, virtualization has to play a huge role here. You want to give each Facebook developer his or her own sandbox to play in. You guys are obviously using virtualization. It’s a hot topic now, but we’ve conflated many different kinds of virtualization, virtualizing a bunch of different kinds of layers. Where do you guys virtualize and why?
JH We don’t use a hypervisor technology, meaning that typically there’s a server, there’s metal, and something has to go on it. The something is a traditional operating system or what people are calling hypervisors such as Xen or VMware’s ESX or the other couple of lesser-known ones out there.
We put a real operating system on hardware, and that has been OpenSolaris’s Nevada line. The virtualization we use is a unique flavor of Solaris Zones. It’s Solaris Zones that are actually wrapped inside their own ZFS pools that are coming off of their own LUNs (logical unit numbers) and a whole bunch of fun things around there.
There are a few reasons why we ended up going that route instead of the hypervisor route initially. One reason was, historically, we’d all been using FreeBSD jails for a while, and the administrative tools for making Solaris Zones are just nicer. The second reason was that we actually still wanted to maintain a certain degree of control. So the fact that there’s still a shared network stack means you can’t do things such as snoop traffic, and you can’t do ifconfig changes inside a zone so that people who end up in there can’t screw themselves up. People do it all the time. They have sent in support tickets complaining about how they can’t delete /dev or /proc, and they have no idea what these “folders” are doing there.
BC “Good news, I deleted /dev for you.”
JH Yes, and it’s really irritating because they’re trying to clean things up. It’s frustrating to them. So, we’ve gone with a virtualization that sits on top of an operating system. It is lighter and more flexible, and the stability that we get is, say, the stability of OpenSolaris, and it has a full repertoire of real tools. When it crashes, we can figure out why. When it’s having an issue, we can figure out why. If there’s a bug in ZFS, and there’s a null transaction, and there’s some data sitting behind it that’s in memory that hasn’t flushed a disk, we can actually drop in the MDB (modular debugger) and make that happen.
BC Don’t try that at home, kids.
JH No, don’t try that at home. I couldn’t care less about even the sort of reliability or stability of an operating system that’s sitting there. I say that because before Solaris was open sourced, we were using a very nice, stable, reliable FreeBSD 4.
When we went to FreeBSD 5, a lot of that reliability and stability disappeared, especially on platforms that had more than one CPU and a bit more RAM and so on. What became completely apparent at that point was that we had no way to figure it out in production. It’s that simple. You can’t run a debug kernel, and even then the stuff is almost nonsensical. At the same time, you can’t reproduce the kinds of behaviors and loads you get when you’re messing around with a workstation.
So the criterion for what we wanted to move to next wasn’t stability or reliability. It was visibility. That’s it. I appreciate the fact that a lot of people are making efforts around Xen or Sun’s XVM. I think it’s pretty cool that when XVM dies, it somehow pushes its crash dump into Solaris, which somehow does this and that. Great. That’s nice. It’s a hack—and it’s a hack because fundamentally now with a hypervisor that’s sitting there, you’re dealing with something beyond the edge of the universe. You’re dealing with something that is put together to run on metal and support a guest operating system on top of it—and it really is going to be 10 to 15 years before people have really built in the visibility stuff.
BC I spent a long time thinking about that problem from the DTrace perspective. It’s a hard problem because just the discovery of what’s up there is a total layer violation. I don’t see how you solve that problem. You said 10 to 15 years. I wonder if it won’t be a lot longer.
JH I mean 10 to 15 years before people go, “Hmm, we should be able to see what’s going on.” I don’t know if this goes back to the pathologist in me, but if you don’t have a way of diagnosing issues, you actually don’t know what the issue is, and you have no idea what course to take to fix it. It’s nice that we’ve made these computer things, but we don’t actually know any way to figure out what the pathology is.
BC To sacrifice that observability seems to be a huge step backward.
JH Yes, that’s the thing.
BC Moving up the stack a bit, Ruby has really caught fire with Rails in particular. Joel Spolsky made the assertion on his blog that C succeeded because of the book, The C Programming Language, a claim that I reject on its face, but it’s kind of an interesting idea. Why was Ruby on Rails successful?
JH Like anything, there are a bunch of reasons, and they all contribute to it. We’ll just start with Rails and David Heinemeier Hansson, who specifically created it and released it.
One reason is that he released it in an MIT-licensed framework. That kind of license helps, since it’s about as close to being in the public domain as it could possibly be. He released it by showing a nice little screencast of it. It was amazing how popular that little video of him writing a little blogging app in five minutes was.
BC I saw it. Everyone I know saw it.
JH Another contributing factor was the timing. Rails 0.5 came out July 24, 2004, so we’re talking about a time before Flickr was bought by Yahoo. The whole idea of a Web 2.0 or another cycle picking up again was new at that point.
A lot of people were doing design and development at that time, especially in the Web space. A lot of them were unemployed, or they were doing contract work and were on basically year two or three coming out of the collapse that happened before that. It just wasn’t a good time.
Rails came out at about the time when things were starting to pick up. All of a sudden the industry is going into an upswing, and a framework is emerging that’s agile and allows people to get things done faster. Most importantly—and this is exactly what all these other sorts of frameworks miss—they made a product that converted them from contract workers to product producers.
Anybody who is doing contract work like that would be perfectly happy to have products that make them money while they’re sleeping. Now you have not only this framework that maybe for technical reasons is neat or for language reasons is pretty cool, but you also have a business case study that it came from.
Suddenly, you have one-, two-, three-, four-person shops out there that were doing contract work to stay alive over the past couple of years thinking, “Oh, my God, I, too, can have a product.” That example of them doing a framework coming out of a little business that made that transition is very important in people’s minds.
BC Did people see that model and know that they could contribute to Rails, or did they see that model and know that they could do their own framework?
JH David was very good about Rails not being forked, not becoming the foundation for other frameworks. If you actually look at the number of commits that have been done in Rails, the number of tickets that have been dealt with, and the fact that some features have been put in but more features have maybe been taken out, then you see that it’s a really, really disciplined open source project.
David obviously was very clever and intelligent about what he had to do there—and the fact that he was a dual computer science and business major helped. When he put the framework out it was very liberally licensed. Contributions were accepted, but he was the sole committer and decision-maker of those for the first year. A lot of the contributions that fed back into Rails came from real product development.
BC So it was constantly being used in anger and wasn’t just an intellectual exercise.
JH Exactly. Then at the same time, I think the open source project itself has been very well marketed and managed. David has done a stellar job of extracting a framework out of a contract gig and taking it all the way up to the point where things are relatively small, but you still have 2,000 people showing up for a conference somewhere for it.
The other thing that contributes to it is that if you look at how many Java frameworks are out there, how many PHP frameworks are out there, how many Ruby frameworks are out there, you realize that Rails has the market share of a language. Ninety-five percent of Ruby developers are using Rails—I mean, there are people who are learning Ruby because of Rails, the framework. So what you have is a framework, and in a lot of ways a product philosophy, now being the main influencer of the language.
BC That’s an interesting thought, in terms of the framework propelling people into the language. As I mentioned earlier, I disagree with Spolsky’s statement that it was a book that sold C. I think C is another example where that happened. It was Unix that propelled C.
JH There’s one thing that Ruby is very, very good at: writing domain-specific languages in Ruby. Rails is a framework for us and for building apps, but from Ruby’s perspective, Rails is a domain-specific language for expressing what a Web application should be.
When you look at the fact that Ruby as a fundamental language allows those things to come out of it, and then combine that with this framework feeding so much into the language adoption, it’s a really interesting dynamic to think about.
The funny thing is, I remember being on the Ruby list when Rails was released, and a lot of the long-time Ruby people didn’t show up for the whole first year of Rails. In fact, the last Ruby conference in San Diego—there were no Rails conferences yet—had only about 200 people and was dominated by Rails people. You had people complaining, “I couldn’t get a ticket to the Ruby conference because all these Rails people showed up.”
The initial feedback for Rails was, “Oh my God, this kid’s done too much PHP and Java in his life. This does not look like Ruby.” Then a year or a year and a half after that, nobody wrote Ruby like that anymore. Now, a lot of Ruby code that is not Rails looks like Rails.
BC It sounds like Rails succeeded because of a lot of factors, not all of them technical. What else is out there that is up-and-coming or interesting?
JH I think the biggest thing that does not yet exist is a way to answer the question, “What really are my data stores?” Let’s say you have an app that generates an object that is a pure object for that application. What if that object is now a file of a specific type, such as an image file or an audio file or a movie file? What if that object that is a movie file has to be transcoded into eight different things?
What I still have not seen—and I shouldn’t complain because I haven’t generated it myself—is the data taxonomy of online applications. What is the whole range of data that they produce and consume? What is the best way to store that data, retrieve that data, manipulate that data, and serve that data? That kind of framework on that sort of back end has not even remotely been put together by anybody.
BC How much of that data is transactional? That’s a part of the taxonomy, obviously.
JH Here’s the beauty of it. What I really want is a choice about whether it’s atomic or not, whether consistency matters, whether integrity matters, whether something has to be transactional or not.
BC It seems that you’ve got people going to one extreme or the other. You’ve got people rejecting ACID (atomicity, consistency, integrity, durability) lock, stock, and barrel and going with in-memory databases and saying, “Hey, if it’s on memory in three different machines, that’s practically nonvolatile, right?” It seems to me there’s some baby being tossed out with the bath water.
JH Well, yes, because relational databases, for example, are very good when you have to type SET in front of something. If you’d like data that a decade from now you can still retrieve for a new application, it’s fine. If you actually want to trade off, say, some aspects of consistency or integrity because you’re introducing your own replication, then fine.
My point is that what you have are guys with different types of data storage basically complaining about each other: “Oh, just use the file system.” “Oh, just use relational databases.” What I don’t see is anybody saying, “Hey, wait a minute. What data do we produce? What do we consume? What do we need to store? What do we have to retrieve? What do we have to manipulate? What do we have to serve? And, finally, how are we going to do that in the most efficient and scalable way?”
Most importantly, that decision should be pushed to the application developer. What I mean is that first you have control of your app methods, and you know what your objects are, right? You know what kind of data is coming out. You know what you’re going to be doing with it. What you should also be defining is how you’re expecting it to be stored, how redundant you’re expecting it to be, and what the service levels are on these things.
BC Should you be defining your bandwidth-versus- latency tradeoff and where you are in that matrix?
JH You should be able to do it. I was talking to a friend of mine who is having a child. He shows me his registry on babiesrus.com. I looked and I said, “Oh, you know, I really think you need a sterilizer. It’s not in your registry, and here’s the sterilizer I want to get you. Can you add it to your registry so I don’t have to enter in the shipping and everything?”
He adds it to his registry, and I hit refresh, but it’s not there for me. Hit refresh. Not there for me. Hit refresh. Not there for me. But it’s there for him.
That’s because the guys who wrote the primary key back end for Amazon made a decision that the person who adds something to a list should immediately see that data, and anybody new to the list should see that data (because when I fired up another browser, I saw the new data just fine). But anybody who just looked at it five minutes ago probably doesn’t care that there was just an update because they were just looking, and may already have clicked on something to add to a shopping cart. So why would they show me the update?
BC The question is, do you think they implicitly or explicitly restricted the coherence domain?
JH They did that on purpose. That’s the tradeoff. On a large-scale system, that’s what allows you to surf faster. I would love it if—and hopefully we’ll provide this—a guy would sit down with a framework like Rails, having gone through and defined the data he wanted, and then just as you set caching headers for Web servers, decided for a given piece of data how replicated it should be. How many copies of it should exist on a system? How resilient should this be? What should the integrity of this data be?
Then, you can actually say things like, “The rules for the presentation of this are as follows: if somebody wrote a blog post, they see it right away. Their friends see it within this time period. Strangers see it within this time period. Search engines see it within this time period.”
You begin to define when things are allowed to show up in the system. The reality is that some things have to be immediate and low latency. The problem with this is that people are always writing Web apps as if every single bit and piece of it has to be immediate and low latency, and they almost demand that of things that cannot be immediate and low latency. They clog up the whole system with so many things that should be offloaded to the background or should have some sort of scheduling.
It’s stupid to think that every time I do something in Facebook, it immediately explodes out so that all my friends see it. No, it can be staged in. Every time I’ve seen applications that people are writing as Web apps, they’re doing realtime updates.
What you see time and time again is that people who end up in this situation come up with yet another one-off, yet another hack to stay up, yet another thing just to get by so they can maybe cash this baby out. What’s really lacking is a full system framework.
BC That’s a good view of computing...circa 2037 maybe.
For all the other great bits that are fit to print—but that didn’t fit on the page—please tune in to our upcoming Queuecast of their discussion at www.acmqueue.com.
Originally published in Queue vol. 6, no. 1—
see this item in the ACM Digital Library