MICHAEL VIZARD: Hello, and welcome to today's edition of ACM Queuecast with your host, Mike Vizard. This edition of the show focuses on the system management issues associated with large scale computing systems and networks. Joining us today to discuss these issues is Rob Gingell, formerly a chief engineer of Sun, who today is the CTO of Cassatt, a company that makes tools for automating server management. Rob, welcome to the first inaugural-I guess that's slightly redundant-Queuecast.
ROB GINGELL: Well, thank you for having me.
MV: Oh, my pleasure. Why don't you tell the folks a little bit about what Cassatt is before we jump into it because I'm not sure that Cassatt is quite a household name yet, but it's working its way there.
RG: Well, we're trying, anyway. We're a company building service-level automation software for enterprise networks, and we're about trying to build a system for scale. And the value we're offering is to help people address their complexity problems. What we're building is effectively the operating system between the other operating systems.
MV: So the topic for today is large scale networking and computing in general. People throw around the buzzwords of "utility computing" and "on-demand" all the time. From a technical perspective, is this really in your mind a new style of computing, and what distinguishes it from the previous generations of distributed computing?
RG: Well, it is a word that people throw around a lot. It's one of these things that if you ask 10 people, you're likely to get about 15 answers for what people think it is, and the variations that people have about it are all about things such as increasing flexibility, charging models for how you get computer resources, either becoming or being a utility in a sort of electrical utility sense. It's almost a sociological change going on in organizations where lines of business are taking greater responsibility for the information content of the information services that drive the business, but are expecting to consume more and more utility-like services. And I think it's something that's being promoted by the fact that the cost of the underlying resources, primarily servers, is dropping a good bit, so that the habits that we've had as an industry that were formed largely in days when the mainframes were the computers, and they were expensive and rare, and everybody had to sort of worship at the altar of the IT environment, is changing a lot. But among all these definitions, the one technical thing that I think permeates all the various definitions that you'll hear for it is the idea that between applications and the deployment resources that are used to run them, there's a separation being created so that you can move those applications among different deployment resources as you have the need to do so, either to create an amplification of the service by adding replicated resources or simply to make better use of the resources you have. And in that usage, it's almost something that's related to another word that you hear tossed around a lot-namely, grid computing. And that's another one of those terms that has 15 definitions for every 10 people, it seems. And they're related in the sense that if utility computing is about creating a separation, then grid computing is about aggregating resources, either at a logical or physical level, but to solve some specific problem. So I actually think of these things as sort of duals of each other.
MV: I guess going forward, that would seem to mean that just about every computing asset becomes some kind of virtual entity that I can manage. So in terms of what that means for how we go about provisioning servers and managing our networks, will that be a fundamental change in philosophy?
RG: Yeah, I think it's going to be a large change, although it's going to be a change that's going to be reminiscent of some changes we've seen in the past. One of the big implications of it is we're going to rediscover sort of resource management as a function of how you have to operate these environments. In the last couple of years, as people have built computing environments out of gangs of servers-whether it's real servers or virtual servers-we've accumulated a variety of programming models-j2ee.net, things like that-which have addressed the needs of how programs are structured across these environments, but we haven't really addressed how it is the resources themselves get managed, the other half of what an operating system provided you in a single box. And the challenge is going to be to do this over a very different scale of complexity than we've traditionally dealt with. We spent sort of the first half-century of computing learning how to do resource management within a single computer, and that's what operating systems started out being as a way of utilizing the resource across multiple demands. And now we're going to have to accumulate that same function across a collection of resources. So in some ways there's a bit of a "Back to the Future" thing going on here, where I guess fortunately for those of us who are old enough to have been around the first time we went through, we're going to get to reapply some of these experiences to the network environment that's going to be created out of them. And I think my only fear about this is that the Journal of the ACM, for instance, is going to get another whole set of papers that look like paging algorithms, only different, just at the time I thought we were finally done with that.
MV: I guess if I understood you correctly, we've basically done some level of virtual computing assets within the operating system, where we have JVMs or we have something like VMware. The challenge now is to take that concept out to a distributed computing kind of mindset. How long do you think it will be before we get there, and where are we in the arc of that development?
RG: Well, I think it's probably something that will develop over the remainder of the decade. And we're pretty early, I think, in making progress across that arc, although we've probably spent the last 10 years as an industry making relatively serious efforts at doing this. And the reason I don't think we've gotten all that far is simply the inertia of many of the applications in environments that we've built up previous to this, many of which unfortunately are probably ill-suited to being part of a real networking environment. And one of the things I'd like to make a distinction about is the difference between distributed computing and network computing. And one of the sources of difficulty we have is that in the early part of networking, we too often let the operating system people have a lot to say about how it is that networking showed up in the context of an operating system, where most operating system designers basically wanted to build an operating system that was effectively a single system across all of these resources. I was actually one of those people, so I'm going to speak a bit with the zeal of a reformed person in the sense that I think it was erroneous to let operating system people do this because operating system people are forever trying to partition parts of the problem away and just present ever and ever greater idealized resources to people. And the trouble is you can't really do that with networking in the way we've done it with things like memory and local storage and so forth. Networks introduce a series of properties that at some level the applications actually have to participate in. Before coming to Cassatt, I was with Sun for 20 years, and a lot of what we did there was about doing network computing rather than distributed computing, in part because the network was a really different entity than just a big computer sort of distributed. And Peter Deutch, when he was at Sun, formed a collection of fallacies of distributed or network computing environments that were basically things that are laughably false, but which every system to date seems to have included as part of its premises. If you'd like, I think I can rattle these off for you. He came up with seven, and then later added an eighth because the last one was at first thought to be obvious, but later was found not to be. And the fallacies are that first of all, the network is reliable in the same way that a single system is reliable. And the meaning of that was that systems as single computers tend to fail in their entirety when they fail, such that when you go about recovering them, you get a chance to know the state of everything. But the reality is networks are not reliable in that way. They fail partially and in ways you can't tell, unless you have to program them with the expectation that in fact failures can occur and you have to have the application prepared to participate in the reliability of it. Another fallacy is that the latency of a network is zero, so that echoing characters across the network, as efficient as it is, is echoing them across a single terminal. Bandwidth is infinite. The network is secure. The topology never changes. There's one administrator. You know, that's fallacious in sort of two directions: There's either no administrator or there are many of them, and you're only lucky if they're likely to be working conspiratorially toward a common end. The cost of networking is zero, is effectively nil in terms of the overhead it produces. And that finally, the network is homogeneous. And if you look at almost all of the systems that we've built thus far, they all suffer from quite a number of those fallacies, and we're only now really learning to sort of cope with that. And so this drive towards virtualization because it creates more things that have to be connected is going to pressure this and I think we'll begin to make some progress in building network environments that are really friends of the network, as opposed to things that are trying to deny its existence.
MV: So as I go to extend my applications across the Web, I'll try and emulate something that Amazon and Yahoo have done: I'm going to discover that I probably need to think about rewriting that application in the first place because extending it is going to be harder and fraught with all kinds of technical issues that are just going to result in time-outs and whatever else that may happen because the network is not as stable as the operating system that the network was originally built for.
RG: Yeah, I think that's true. And actually Amazon and things like that are pretty good examples because they've gone part of the way, but the other factor that will be a change that started developing over the last five years has been the emergence of so-called network or Web services, where it isn't humans interacting with a service by driving a browser, but where there may be other programs driving the network by using some of the same interfaces that the humans now work off of. And a property of having the programs operate the network, as opposed to humans, is that the humans can often do things that programs are not really inclined to do naturally. For instance, if you get a spurious browser error, you know to go retry it, or there's a retry button, and you can make sense visually of what sort of shows up out of that. Another program won't really do that. And so part of what will cause the drive to have better models of things will be the fact that it's programs driving cascaded uses of these services, and not just humans.
MV: A lot of the conversation these days coming out of Intel and AMD talks about putting support for virtual computing in at the hardware level. I imagine over time there will be more and more intelligence built around virtual computing and at the chip level. What's your take on what they're up to and what impact that will have as far as accelerating this whole process?
RG: Well, part of it I think is a natural evolution of what we've seen historically in processor and system development, which is that as functions become ever more common, they tend to set them at low order in the technology stack and to get built into more fundamental layers. Earlier attempts at creating virtual notions such as the sort of virtual machines that operating systems exported, like UNIX or NT, which were providing not just virtualizations of the hardware, but of an idealized environment, were of course supported by things in the hardware, such as memory protection and other devices that made it more efficient to offer those functions. And the fact that people are now creating instances of virtualized machines in order to enhance utilization is another natural drive to cause acceleration support for that kind of functionality to appear in the chips. I actually think there's a bunch of other things that can begin to appear in chips, because one of the effects of moving the programming environment to the network is that the applications that people are writing are increasingly being expressed in things like JAVA or languages like it, where the instruction set that interprets the program is not usually the instruction set of a physical or even virtualized multiprocessor. It's an abstract instruction set. And one of the features about that, although it's not clear that the hardware people have yet appreciated this as a feature, is that the compatibility constraints that have existed for the last ten to twenty years around microprocessors, which effectively made it very hard to innovate in the functions that could be offered there because to do so would disrupt the compatibility environment, that made them sort of economically very useful. Now that that boundary is moving off the hardware, there's a real opportunity for the hardware designers to go through another round of innovations to exploit the freedom they're getting by the fact that they can do things that are somewhat radically different than what they've historically done. In Intel's case I'm aware of at least one other effort that they've thought about doing, where, for instance, they're planning on putting support for management technologies into the chips. They've already started that with some of the desktop chips, and I believe they're going to be doing that with the servers in the next year or so. And that's another form of the sedimentation that's going on to deal with changes in function.
MV: So then it gives me the opportunity going forward to think more aggressively about how to deploy say 64-bit applications alongside 32-bit applications on this shared platform without having to have this massive disconnect that says, you know, well, everything I invested in beforehand is no longer going to be supported on this new platform.
RG: Right.
MV: What is your take on when will this come together in a way that it's easier to manage? Because one of the things people think about with large scale systems all the time is, "Well, I need 15 guys in lab coats to run that, and I don't have that capability." I guess the cherished goal of lights-out management may never come to be the end result, but somewhere in the middle there's got to be an ability to manage this stuff in a way that the average IT organization can get its head around.
RG: Well, I hope there's something to help your management, and hopefully products like the ones we're building will be part of that for a number of people. But, yeah, I think this is going to result in a fairly large change to the demographics of both the numbers of people that we currently think of as system and network administrators, as well as the kinds of tasks that they do. With all of this virtualization eruption that's going on, as well as the surrounding complexities of network computing, we're going to have to change the level of efficiency by which each person that participates in administration operates these systems, because otherwise the cost of the combinatorics if we continue to operate them the way we do today will simply limit their growth and success. So it's clear that we have to have computers doing much more of the management of these things, and frankly a lot of that's going to become menial, as well, so it's a task well-suited for computers. So in that sense, those efficiencies will help make it possible to operate and administrate these environments without the same level or same rate of acquisition of administrators. But I don't think it ever goes to zero: Instead, what happens is that the functions you ask the administrators that are left to do will become higher level and probably more related to the policy setting or business activities of the organization that's employing them. So rather than being someone who concerns themselves with backups or recabling systems in order to meet increases in demand or new functionality or something like that, the administrators of the future will engage in tasks of policy-setting that are more integrated into what it is they're doing. For instance, if they're part of a manufacturing organization, they'll be setting up the policies that say that at the end of a sales quarter they want the systems to be giving priorities to the sales functions so that they can book orders, and at the beginning of the next quarter they will give priorities to the financial people so that they can do the book-closing functions and so forth, and to operate at that higher level policy, rather than just the mechanics of how each of the pieces work. And that's partly because the way the pieces work today are very focused on the individual boxes, and in this world that we're proceeding to build, the individual boxes don't matter that much. It's really a function of the aggregate of them that delivers the service that you're caring about.
MV: So do you think there'll be fewer people required to run a large scale IT environment, and maybe in general fewer people in IT because the machines and the systems are going to be that much smarter about managing themselves?
RG: I think there'll be fewer people doing those sorts of tasks. Whether that results in a total reduction in the number of people doing IT, I'm not sure, because some of those people, particularly the ones that accomplish this up-leveling of skills to be more policy and operations related, may get diffused into other parts of the organization. And in general, I think that's true of computing generally, that computing as a stand-alone subject is becoming less and less interesting as computing becomes embedded behind other disciplines. I think it's going to become incumbent on a lot of information technologists to not only be skilled at IT, but to become skilled in an application area that allows them to be useful in the context of how IT is used in an organization.
MV: Rob, thanks for being on the show today. That was pretty insightful stuff. And this is Mike Vizard signing off for the first Queuecast, and thanks for listening.
RG: Thank you.
Originally published in Queue vol. 4, no. 6—
Comment on this article in the ACM Digital Library
David Collier-Brown - You Don't Know Jack about Bandwidth
Bandwidth probably isn't the problem when your employees or customers say they have terrible Internet performance. Once they have something in the range of 50 to 100 Mbps, the problem is latency, how long it takes for the ISP's routers to process their traffic. If you're an ISP and all your customers hate you, take heart. This is now a solvable problem, thanks to a dedicated band of individuals who hunted it down, killed it, and then proved out their solution in home routers.
Geoffrey H. Cooper - Device Onboarding using FDO and the Untrusted Installer Model
Automatic onboarding of devices is an important technique to handle the increasing number of "edge" and IoT devices being installed. Onboarding of devices is different from most device-management functions because the device's trust transitions from the factory and supply chain to the target application. To speed the process with automatic onboarding, the trust relationship in the supply chain must be formalized in the device to allow the transition to be automated.
Brian Eaton, Jeff Stewart, Jon Tedesco, N. Cihan Tas - Distributed Latency Profiling through Critical Path Tracing
Low latency is an important feature for many Google applications such as Search, and latency-analysis tools play a critical role in sustaining low latency at scale. For complex distributed systems that include services that constantly evolve in functionality and data, keeping overall latency to a minimum is a challenging task. In large, real-world distributed systems, existing tools such as RPC telemetry, CPU profiling, and distributed tracing are valuable to understand the subcomponents of the overall system, but are insufficient to perform end-to-end latency analyses in practice.
David Crawshaw - Everything VPN is New Again
The VPN (virtual private network) is 24 years old. The concept was created for a radically different Internet from the one we know today. As the Internet grew and changed, so did VPN users and applications. The VPN had an awkward adolescence in the Internet of the 2000s, interacting poorly with other widely popular abstractions. In the past decade the Internet has changed again, and this new Internet offers new uses for VPNs. The development of a radically new protocol, WireGuard, provides a technology on which to build these new VPNs.