A Conversation with Phil Smoot

Interviews

January 31, 2006
Volume 3, issue 10

Download PDF version of this article PDF

A Conversation with Phil Smoot

The challenges of managing a megaservice

In the landscape of today’s megaservices, Hotmail just might be Mount Everest. One of the oldest free Web e-mail services, Hotmail relies on more than 10,000 servers spread around the globe to process billions of e-mail transactions per day. What’s interesting is that despite this enormous amount of traffic, Hotmail relies on less than 100 system administrators to manage it all.

To understand how they do it, and to learn more about what it takes to manage such an enormous service, we invited Hotmail engineer Phil Smoot to speak with us. Smoot has been with Microsoft for 11 years and is a product unit manager in Microsoft’s MSN division. He manages the product development teams in Silicon Valley that are responsible for the Hotmail-MSN Communication platform, which includes storage, e-mail delivery, spam prevention, protocol services, directory services, and data warehousing. Prior to Hotmail, Smoot worked with a variety of groups at Microsoft including the Visual Basic team, Microsoft Research, WebTV, and Microsoft Sales and Consulting. His academic background is in physics and computer sciences.

Smoot is interviewed by Queue editorial board member Ben Fried, who has his own experience managing a large IT infrastructure. As managing director in Morgan Stanley’s IT department, Fried manages teams in New York, London, Tokyo, and Hong Kong that are responsible for the diverse infrastructure—including Web hosting, document management, workflow, instant messaging, collaborative tools, and desktop productivity applications—that supports the company’s global knowledge workers.

Fried’s background in IT includes stints as a dBASE II programmer, front-line support manager, Windows 1.0 programmer, and Unix systems programmer. Prior to joining Morgan Stanley, Fried was lead programmer for a California startup building mission-scheduling software for NASA’s orbital observatories.

BEN FRIED Let’s start by talking about how you got into this job.

PHIL SMOOT Prior to Microsoft I worked at Pacific Gas and Electric building client-server services applications. I joined Microsoft in 1994 as a consultant, and my job was helping enterprise customers and ISVs build or port their services on Windows and SQL Server. I worked with Tom Barclay and Jim Gray creating the original TerraServer database (1 TB single-node database system). I worked with the WebTV team after that acquisition. Over time, I noticed I was gravitating to services, both internal and external, that were creating large scale-out types of applications. About 5½ years ago I found the opportunity at Hotmail to work on one of the largest distributed systems in the world, and that’s what I’ve been doing ever since.

BF Can you give us some sense of just how big Hotmail is and what the challenges of dealing with something that size are?

PS Hotmail is a service consisting of thousands of machines and multiple petabytes of data. It executes billions of transactions over hundreds of applications agglomerated over nine years—services that are built on services that are built on services. Some of the challenges are keeping the site running: namely dealing with abuse and spam; keeping an aggressive, Internet-style pace of shipping features and functionality every three and six months; and planning how to release complex changes over a set of multiple releases.

QA is a challenge in the sense that mimicking Internet loads on our QA lab machines is a hard engineering problem. The production site consists of hundreds of services deployed over multiple years, and the QA lab is relatively small, so re-creating a part of the environment or a particular issue in the QA lab in a timely fashion is a hard problem. Manageability is a challenge in that you want to keep your administrative headcount flat as you scale out the number of machines.

BF I have this sense that the challenges don’t scale uniformly. In other words, are there certain scaling points where the problem just looks completely different from how it looked before? Are there things that are just fundamentally different about managing tens of thousands of systems compared with managing thousands or hundreds?

PS Sort of, but we tend to think that if you can manage five servers you should be able to manage tens of thousands of servers and hundreds of thousands of servers just by having everything fully automated—and that all the automation hooks need to be built in the service from the get-go. Deployment of bits is an example of code that needs to be automated. You don’t want your administrators touching individual boxes making manual changes. But on the other side, we have roll-out plans for deployment that smaller services probably would not have to consider. For example, when we roll out a new version of a service to the site, we don’t flip the whole site at once.

We do some staging, where we’ll validate the new version on a server and then roll it out to 10 servers and then to 100 servers and then to 1,000 servers—until we get it across the site. This leads to another interesting problem, which is versioning: the notion that you have to have multiple versions of software running across the sites at the same time. That is, version N and N+1 clients need to be able to talk to version N and N+1 servers and N and N+1 data formats. That problem arises as you roll out new versions or as you try different configurations or tunings across the site.

Another hard problem is load balancing across the site. That is, ensuring that user transactions and storage capacity are equally distributed over all the nodes in the system without any particular set of nodes getting too hot.

BF It sounds like there’s a capacity planning problem, making sure that as you’re changing things, you have the capacity to handle the variations you need to test.

PS Right, and that probably goes back to the specification process. As you specify the new feature, what is the anticipated effect that you’re going to see on the live site? What instrumentation needs to be in place to measure the effect? Can you measure that variance in the QA lab? Ultimately, when you deploy the new feature, you need to validate your original assertions.

BF Do you find that in doing development you spend a lot more up-front time preparing or thinking about things to make scaling easier than you would if you had only a few machines and didn’t have those problems?

PS The big thing you think about is cost. How much is this new feature going to cost? A penny per user over hundreds of millions of users gets expensive fast. Migration is something you spend more time thinking about over lots of servers versus a few servers. For example, migrating terabytes worth of data takes a long time and involves complex capacity planning and data-center floor and power consumption issues. You also do more up-front planning around how to go backwards if the new version fails.

Another big difference that I’ve found in shipping products versus shipping services is that you have to have a real awareness of exactly what effect an error or failure is going to have on the operations team. You have to ask: Is this really going to pull an engineer out of bed in the middle of the night?

BF Are there specific things one has to learn to write code that way, and if so, do they get taught in the standard computer science curricula? Do you spend a lot of time training people to understand your problem space?

PS Understanding versioning, up-front instrumentation, and cost and operational implications for a design comes with experience. Learning how to isolate site failures from the rest of the system is not something I was taught in school. At every level—failing LUNs (logical unit numbers), failing servers, failing services, failing networks—you want to make sure that all calling services don’t tie up all of their resources on any particular component. And you want to make sure that operationally you can isolate a broken or failing hardware component or server or service from the rest of the network. Most of this experience comes with time. We try to build our bench with folks who understand what mistakes not to make. New hires tend to want to do complex things, but we know complex things break in complex ways. The veterans want simple designs, with simple interfaces and simple constructs that are easy to understand and debug and easy to put back together after they break.

The best practice is to instill these ideas early, within the design review and code review processes.

BF The operational aspect of your environment is very interesting. In my experience people have a mindset that there are fixed ratios of administrators to systems. I would bet that a lot of practicing system administrators think, “Well, after I get up to, say, 1,000 servers, I’ll need my boss to hire another one of me.” And you’ve already suggested that we want to have the same number of administrators no matter how many systems you have.

PS Administrators per server is a metric we’ve tracked pretty closely, and it varies depending on the kind of server. When you ultimately get down to the cost of these big, huge megasystems, there’s going to be a good chunk of it for your hardware and another big chunk for human costs, so the more you automate processes and the more resilient you can be with handling failures, the more competitive you can be.

BF What skills do you look for in administrators, and what ways should we think about administering these systems that are different from common practice? Are there insights or mantras to keep in mind when building a system for simple administration—especially a system that might one day be big?

PS The administrative mantra is to automate. Scripting can also go a long way. From an engineering point of view, the requirement has to be to build automation and instrumentation into the service from the get-go.

BF That’s difficult to achieve. It’s very common to find engineers who don’t want to do administration, or operations people who are so heads-down that it’s difficult to get them thinking along those lines.

PS The reality is that managing a live site—and this is mostly because of spam and abuse—puts a ton of pressure on the development and system engineering resources. The folks on the front line don’t always have the time to provide timely feedback to the product group, and the product group, which is not working on the live site day in and day out, doesn’t understand everything that needs to be automated.

BF I’d like to talk a little about tools. In particular, what tools do you need to build rather than buy?

PS Clearly, we’re a Microsoft shop and we’re going to leverage everything that the public can leverage, which would be Visual Studio, SDK tools, and SQL and all the tools associated with it. Custom tools that we may build would be more in the area of deployment, metrics gathering, ticketing, bug tracking, code coverage, monitoring, inventory, failure detection, and build systems.

We do leverage the Windows operating system’s perfmon (performance monitor) counters, event logs, Active Directory, and things like that. But we also may supplement them with custom tools for additional granularity or debugging or logging. We also have a number of processes and tools in place to help us understand what the current state of the site is.

You need change management processes and ticketing systems to track which changes are being applied to the system, which hardware is being changed, which service pack patches are being installed, which application upgrades are being installed, and which operating system bits are out there. You need to track configuration changes, for example, if you want to tune things from X threads to Y threads. You need to be able to go back in two weeks’ time and say, “You know, we noticed this performance blip and we see that it originated a couple of weeks ago. What exactly changed on the site?”

BF You’ve got a particularly hard problem there: you have lots and lots of servers, a rapidly changing code base at every tier, and lots of different versions of everything. Is it essential to be able at any point in time to understand exactly what’s running everywhere or what ran everywhere?

PS Yes, it’s a hard problem. The key is getting the underlying operational infrastructure in place and then being disciplined across all parts of the organization so that you’re all marching the same way, so that all deployments come through one way, all imaging comes through one way, and all applications generate errors and alarms and get monitored in the same way. You’re going to be able to get economic advantages by scaling out your operational people less and less because everything’s consistent.

BF Do you find that you have a class of problems that vendors won’t solve because there just isn’t a big enough market for the tools to manage that kind of complexity?

PS There are just not many megaservices in the market. ISVs are certainly not developing for them.

BF Does spam create scaling challenges?

PS Spam has been an ongoing battle for years. The ongoing scaling problem was that as you brought on more and more mail servers to handle the incoming load, the ever-increasing number of mail servers could easily overwhelm the storage servers. This was amplified by the fact that more mail servers were also required to handle more filtering CPU cycles. The good news is that we have seen a leveling off over the last year or so. We also feel we are roughly 98 percent effective at identifying spam messages and sources.

BF Can you quantify in some way the extent of the spam problem?

PS It is massive. Years ago we saw as many as 3 billion incoming messages. This has declined, but the estimates are that 75 percent of all e-mail is spam. Over the past couple of years our techniques have gotten better, and our partnerships with other major ISPs have improved. I would say spam is still gross and abusive, but it hasn’t been getting worse lately.

We do continue to react to spam on a daily basis as spammers continue to seek out holes in our defenses. What we see now is more sophistication in the spammers—more phishing schemes, people trying to get credit card numbers and that kind of thing.

BF The data-oriented problems—the data-scaling issues—that you have must be intimidating. What are the challenges in managing all of that data. What about backups, for example?

PS The notion of tape backups is probably no longer feasible. Building systems where we’re just backing up changes—and backing them up to cheap disks—is probably much more where we’re headed. How you can do this in a disconnected fashion is an interesting problem. That is, how are you going to protect the system from viruses and software and administrative scripting bugs?

What you’ll start to see is the emergence of the use of data replicas and applying changes to those replicas, and ultimately the requirement that these replicas be disconnected and reattached over time.

BF Are there scaling reasons to think about the benefits of a command line for managing over a GUI, or are there other things to think about?

PS Our operations group never wants to rely on any sort of user interface. Everything has to be scriptable and run from some sort of command line. That’s the only way you’re going to be able to execute scripts and gather the results over thousands of machines.

BF Earlier you touched on big problems with QA, such as how to replicate what you see at a massive scale without having to duplicate that scale in your lab. Are there ways of thinking about the problem that can help?

PS We strive to build tools that can replay live-site transactions and real-type live-site loads against single nodes. The notion is that the application itself is logging this data on the live site so that it can be easily consumed in our QA labs. Then as applications bring in new functionalities, we want to add these new transactions to the existing test beds.

BF How formal are your standards for application development and QA? Do you have many formal, highly structured processes?

PS We certainly have best practices, which are feature specs that tell the engineers what to build, and design specs that tell QA what to test, and QA specs that are used to validate that the engineers and the program managers are testing exactly what we’re intending to build. Then we have specifications that say, “How do you deploy and build the entire stack from the metal up to the application bits?” The pressure is on us to deploy new functionality at “Internet” speed. The trade-off is specification rigor versus time. The genius is getting the balance correct—not too much and not too little and within rapid iteration cycles.

BF Do you think about having ratios of QA people or QA cycles to development cycles? Is that a necessary practice?

PS In the past we’ve thought about a one-to-one ratio, but this is changing. The thinking now is having much more functional testing pushed back into engineering, and having QA focus on process and integration harnesses and testing. Insofar as the Hotmail service is part of the larger MSN service infrastructure, some of the harder problems we face are the integration testing issues.

BF Are there problems that are unique to scaling for the Internet as opposed to scaling for some large, internal computational problem or trying to build a computational grid for that kind of thing? When you are building a scaled system, what kinds of problems do you think the Internet introduces that you don’t see elsewhere?

PS The problems are those of basic client-server programming—that is, figuring out the browser/http/server data-access patterns and optimizing the protocols, extending these protocols as new functionality is introduced, and ensuring that these protocols work across geo-distributed data centers when the speed of light becomes a factor. Designing applications with built-in redundancy so that they are resilient to abuse is also a challenge.

BF What’s a good way to think about capacity planning? When you’ve got a lot of servers, how do you go about replacing a server or adding a new one?

PS One model we’ve had success with is the notion of a cluster, which is a unit of build-out. For example, for this many users we require X servers, Y networking, and it costs Z dollars. The ideal is that clusters can be built out in a cookie-cutter fashion. Another notion is that you want to minimize the number of hardware SKUs (stockkeeping units) because QA has to test all these SKUs, which introduces complexity into the deployment processes and slows the product life cycle. Minimizing the SKUs, though, has to be balanced with cost considerations (i.e., some SKUs require optimizations to keep prices down, which implies more SKUs).

BF So you’re thinking about everything down to the iron and the wire in order to do QA accurately?

PS Yes.

BF That’s something a lot of us don’t think about. We think about a computer as a computer and a network as a network and don’t want to worry about any of that.

PS If you talk to my operations buddies, they will tell you the way they think of the stack, too, goes all the way down to the floor—to the electricity, to cooling, to bandwidth. We have to be efficient at laying out the entire stack, repeatably, and verifying that it’s built correctly.

BF What do you think might surprise our readers about managing large numbers of servers or deploying a megascale system? Are there surprising things that drive how you think about writing software?

PS We spend a lot of time analyzing the specific SKU for the application we need—in case, for example, we overbuild some sort of X box for which we thought the application requirement was, say, two processors, and it turns out it was one processor. Those processors suck a lot of juice and require a lot of cooling and floor space and therefore limit how much can be deployed over time.

BF Are any innovations or technologies just maturing now that will make any of these problems easier, other than the next turn of the crank on Moore’s law?

PS The one problem I think about is disk arms versus disk space. A lot of applications are I/O bound and not disk bound; yet what you see are disks getting bigger and bigger and bigger over time. I wonder how that market works. I mean, at what point do disk manufacturers say, “OK, I think a terabyte is enough”? One of the harder problems is how to balance I/O and data. Do you put all of the hot data in one place and run it on different kinds of disks that are very good for that kind of thing, or do you run it with the cold data so that you can blend it? That has always been a fairly interesting question. I’m a big fan of warm data.

BF Is storage going to change enough so that maybe just the next round of disks will be fast enough that you don’t need to worry about that?

PS If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

BF What about handling failures in the software or the hardware?

PS As you go to, let’s say, a commodity model, you have to assume that everything is going to fail underneath you, that you have to deal with these failures, that all the data has to be replicated, and that the system essentially self-heals. For example, if you are writing out files, you put a checksum in place that you can verify when the file is read. If it wasn’t correct, then go get the file somewhere else and repair the old file.

BF So you have to put a lot of thought into a commodity solution versus a “premium” solution—the commodity has up-front cost benefits, but there are also a lot of operational or other costs.

PS There’s probably an interesting crossover, which you allude to. In some small applications—and I don’t know what small is anymore—the cost of the engineering is the dominant factor. At a certain point, however, the engineering cost is overwhelmed by the operational costs.

There are lots of productivity tools that make an engineer’s life much easier because you can get services for free. But those services themselves might not be the most efficient. So in a smaller application, you can get away with it. But in a megaservice, things are built from the ground up and optimized for cost as the R&D effort essentially is dwarfed by the operational cost.

BF Here’s the ultimate open-ended question: If you were to talk to someone who was about to walk into a situation managing or operating and engineering a megaservice, or was thinking about creating such a service from scratch, what kind of advice would you give? What questions would you want this person to keep in mind?

PS I’d see if this person really wants to do it. Has he or she considered teaching? The best advice is just basically to keep everything as simple as possible—simple processes, simple SKUs, simple engineering. These systems get to be very big very fast. I don’t think there’s really any one particularly hard, gnarly problem, but when you add them all up, there are lots and lots of little problems. As long as you can keep each of those pieces simple, that seems to be the key. It’s more of a philosophy, I think, than anything else.

Originally published in Queue vol. 3, no. 10—
Comment on this article in the ACM Digital Library