Listen to an MP3 of this article  

Large Scale Systems: Best Practices - Transcript

Transcript of interview with Jerod Jenson, Enron Online

This is a transcript of Large Scale Systems: Best Practices, our Queuecast interview with Jarod Jenson, formerly of Enron Online and now chief architect for Aeysis.

MICHAEL VIZARD: Hello, and welcome to this edition of the ACM Queuecast with your host, Mike Vizard. That's me. Today we're going to be talking about performance tuning issues as they relate to really high performance systems in general, and how everything we touch is getting more and more complex. Joining us today to discuss this issue is Jarod Jenson, who was formerly a technical architect with Enron On-Line, and today is the Chief Technical Architect for Aeysis, a consulting firm that specializes in business critical applications and related performance tuning. Jarod, what are the most common mistakes people make when they first start running into performance issues on really complex systems?

JAROD JENSON: Well, there are several of them, aside from just causing the problems themselves through sort of happenstance or intention. Probably if I break down the list into a few elements, the most common problems, number one would have to be that they never had a good test harness for doing performance analysis to begin with. So a lot of times they run into these problems in production, which will reduce your ability to debug the problem. So test harness is the number one thing. That's how these problems get into production. Probably the second thing is everybody's going to deny it's their problem, right? "Hey, our app works when we tested it over here in development," or "The network looks good," or "The system looks good." You know, it's always that. Everybody up front is going to deny it, so it takes a little while just to get some ownership around the problem. I guess next will probably be that people don't use the tools that they have at their hands. So you'll walk into a problem and say, "Okay, give me the data you've collected to date," and everybody will be like, "Well, we have the metric that shows it's slow," but they haven't collected the data from just the tools that come with the OS or the metrics they get from their own QA systems that come with the applications or networking sniffers or packet analyzers or anything. I guess probably the next thing, as they start getting through these others, is they fail to eliminate things that are complete improbabilities, so they're chasing this ever-widening universe of things, and they don't go through the list and say, "You know what? It can't be this particular thing over here." Or, conversely, they don't recognize things that are appropriate probabilities for causing the performance problems. I guess maybe the last thing in the list, once you get through all of these, would be just the lack of patience. In doing these on a day-to-day basis, I've sat and just watched -- you know, stared at a sniffer for literally three hours, no moving, no lunch, no food, no water, just literally staring at a sniffer and solved a problem. So there's got to be some patience associated with it and knowing that you've got to get back in and dig into this data numerous times. That would probably be my top five list, I guess.

MV: It feels like complexity is the enemy of performance, and yet as we continue to evolve on computing, we see these higher levels of abstractions and more components, whether they're JVMs or VMware or whatever else is coming down the pike. There's more operating systems and middleware sitting between the application code, the business logic, and the underlying systems. Are we forever doomed to having this increasing amount of complexity, or how will we ever get to the point where we can manage it?

JJ: Yes, I do think we are doomed to have this complexity for at least the foreseeable future, at least when I'm doing this. And in fact, a funny anecdote. I was just at a conference recently, and one of the senior senior technical people for the company hosting the conference stood up and said, "So everyone knows we've added another layer of abstraction so that you don't have to worry about the operating system." My head lowered, and I'm like, "Okay, great." But it's true. As we add these layers of abstraction from the hardware, the OS, you know, the container, middleware, all these different pieces, it does get significantly more difficult to debug the problem. The only hope that we have is that people start adding very low probe effect instrumentation, either within each layer or some way that we can monitor this externally: for instance, like BA�s- JRockit JVM. They've started adding quite a bit of instrumentation to this thing to see what's happening with garbage collection and object allocation and things that are a lot lower overhead than JVMPI or JVMTI. Coming with SunShading on Mustang, there's going to be DTrace probes natively in the JVM. These types of things will help break into each one of those layers because clearly we have good monitoring for networks and OSs and things. So as we can break into each one of these layers and these middleware components, it gets a lot easier. But to be honest, it's not a terribly new problem. It's been around for a while. It just happened that before there were third party classes and libraries, things like STL or whatever that were doing this. Now these layers are a little more well-defined. They're not just libraries that are loaded into the app. So it does get more complex, especially when we don't have a source code for them. But with instrumentation I definitely think it's a solvable problem. We'll be able to debug these things.

MV: Well, how well is that instrumentation process going? Because you mentioned one company that seems to be doing it right and there's 9 million products, so it sounds like we've got a long way to go on that front.

JJ: The good news is generally technology is sort of follow the leader, right? And one of the technologies I use most right now is DTrace. And clearly the advent of DTrace has sparked other people to doing things very very similar. We see that they're working on a port of DTrace for Free DSB and I think that what we're seeing in JRockit may be an artifact of the stuff that's coming in Mustang. So I think once we get these and BEA makes the changes, and then IBM's going to say, "Look, we're going to need this instrumentation, as well." I think it's going to have a carryover effect. We're going to start seeing it in all these different components because you don't want to be the odd man out. So I definitely think there's going to be this waterfall effect, and we're going to see this happening a lot more.

MV: Now, for those who don't know what DTrace is, why don't you describe it. And while you're at it, what are the other performance tuning tools that you kind of like out there?

JJ: DTrace, dynamic tracing, is a facility available in Solaris 10 that allows for instrumentation of the OS of Solaris 10 itself, and of applications. And this is done dynamically, so there's no instrumentation prior to the implementation of DTrace and it dynamically modifies the OS or the running application, so there's no recompiling, no relinking, no changes necessary. And we can use this to find out the interaction between applications or within the application itself or within the application of the OS systemically on a real time basis, which is great. Like I said, DTrace is probably the thing I go to the most, obviously, especially if I'm using Solaris 10. Other tools out there, there's the old standbys that people have used for years, truss on Solaris, strace on Linux. Another one that I like, especially for Linux, is OProfiler and VTune, which is sort of similar. It takes statistical analysis of where you're spending time and code. Those are very good tools. And then sort of one of the old school ways is just -- for native code, not for managed code -- are things like an LV preload, where we can create the instrumentation and you have to rerun the app for those, which is a little painful. So clearly not a production solution, but a good way to help solve the problem. So those are kind of the things I lean on the most. And then there's the standard tools, compilers, analyzers and debuggers and things like that that we can use.

MV: So it sounds like we're getting to the point where we can actually tune the application in flight versus having to pull over to the proverbial pit stop and take the car apart.

JJ: Right. I mean, number one design center for DTrace, for instance, is that it's usable in production. And this is great. I mean, even before Solaris 10 was GA'd, a major financial customer said, "Our problem is big enough that we're wiling to put Solaris 10 in production and use DTrace to solve it because we cannot reproduce the volume that we have in production." And glad to say we were able to find the problem. And so that's where I want to see it go. First we have to have a test harness because you want to try to prevent these problems from getting into production if at all possible. If they do make it to production, everyone knows that you don't come over to your user acceptance environment or your stage environment and reproduce the problem, right? You reproduce the symptom because if you could reproduce the problem, you know what the problem is and you just solve it. So you reproduce the symptom and hope that's right. And clearly that's suboptimal. So the more tools that are available for us to debug these things in production, the faster we'll be able to solve these problems. And my goal personally is to make SA, network administrators, and developers need to find a job because I know that performance issues are the ones that generally keep people there after hours.

MV: So that kind of sounds like what my wife says when I lose my keys, and I say, "Have you seen my keys?" And she says, "Well, where was the last place you saw them?" And I always go,"Well, if I knew that, I wouldn't have lost them, right?" So the performance thing is the same idea.

JJ: Exactly.

MV: In the old days -- or not so old days -- client, server, everything fed into the database. You kind of knew where you stood. Today with service-oriented architectures, where you've got the database, the file system, the applications, and everything points to everything. It's just very distributed. Has the database become something of a congestion point or is the performance problem bigger than the database these days? What challenges does SOA really represent?

JJ: Well, I can tell you every engagement that I go on where there's a database involved in the application at all, the database does have some impact on the performance. It's very rare that the database doesn't have a major component, because it's generally the check point for every transaction or the persistence data store, whatever it happens to be. And so at some point, everybody's got to go to it. So clearly it has a big impact on performance. In fact, so much so that one of the trends I've seen recently that I haven't really decided whether I agree with it or not is that a lot of people are trying to write their own databases, so they're not using Oracle or Sybase or MySQL or Postgres or anything else. They're trying to write their own, which is I guess kind of a desperate measure. So clearly there's still a point of contention that people are doing this. Although we've disbursed the computational piece of it, we make the transactions such that there's a shorter contention in the database so it becomes a latency issue more than a throughput issue. And from what I've seen, more people are comfortable in dealing with throughput issues than latency issues. It's a lot easier to see over the course of 60 seconds what was our throughput and find some way to mitigate that. It's a lot harder if somebody's trying to say, "We need sub-millisecond transaction rates" to be able to in that sub-millisecond find out where your problem is. So what we're doing is moving the database from, "Hey, we can't pull out 100,000 records quickly enough," to "We can't do this transaction in 500 microseconds or something." So it's just changing the nature of where the contention is in the database.

MV: So given the nature of the transactions, they're not as synchronous as they used to be. There's more partners involved. There's more companies, more databases, more applications. How does that affect system performance when everything is essentially an asynchronous process?

JJ: Well, like I said, the number one thing that it affects is really latency. Latency and scalability are the two big things that I'm seeing as the trend now. Because with these asynchronous events, when they come in, you know, to be honest it's how did your application respond to these asynchronous events. For instance, one of the big things that has kind of been my Holy Grail of trying to stamp out is in the Java world, notifyAll and in C/C++, it's P thread con broadcaster con broadcast. Where now people are so cheap to create threads, you know, either VSTL or Java or whatever it happens to be, people create tens, hundreds of threads, literally. And then when they get these asynchronous events, they do this notifyAll or the do a con broadcaston every single thread. And so on four-way or even eight-way machines, you have 100 runnable threads simultaneously. So you have scheduling issues and everything else. So latency and scalability, that'll actually turn into a negative scalability problem, become huge issues. So the asynchronous nature of it is making people have to stop, and things that they could get away with before when it was a throughput oriented system, or where so much work was done in the database that your code accounted for 10 percent of the overall transaction rate�these bad practices that people have had before they've had to go back and readdress. So like I say, right now I think in the performance world, latency is the big thing people are after. People tend to have a better grasp on throughput. So the asynchronous nature of what's happening is just kind of having to shift that focus that people have had for the past however long, few years.

MV: Now, in the old days, I would just throw hardware at the problem, right? So today we have 64-bit systems. We have dualcore and quadcore coming down the pike. So why can't I just throw hardware and more network bandwidth at the problem and rest easy?

JJ: You know, that sounds great. And wallet tuning is a wonderful thing, but it doesn't work. For instance, like I've said, that con broadcast or that notifyAll is actually going to be negative scalability. The more processors that you throw at it, especially on big SMP machines, for instance, you can actually get worse performance with more processors. You think, "Well, I have more cores to schedule on so it's going to be better." But you have all these things like cache like conflicts and TLB shootdowns, and all this wonderfulness that happens within the hardware that makes it SMP or multi-core that you don't see, especially on these NUMA type machines. I mean, Opteron machines are NUMA by nature. Bic (?) Sparc boxes are NUMA by nature. So these things where we start doing this behind the scenes a lot of work is going on at the hardware level that's killing this. So you can't just do that. There are times, though, where things like 64-bit computing can be a great thing. Let's say that you have a financial application, maybe the trading engine or something, and now if I can cache all the events that happen so I don't have to go to the database and a 64-bit address space, that can save you that look-up, that query to the database and contention to the database, and so now if you can build that distributed system, you can let the local caches deal with it so you don't go back to the database. So it's a tradeoff, but bad coding practices generally will get worse if you just throw hardware at it. And I've been on a number of engagements where they're like, "We went for a 4-way to a 16-way, and performance just tanked." There's lock contention and all sorts of things that create these negative scalability issues.

MV: And that goes back to that complexity issue. What about grid computing and on-demand computing? I mean, IBM's out there trying to tell people that, you know, "Well, we'll just have systems that automatically scale and descale on-demand whenever you need it." And that sounds like a nice pipe dream, but is that really the problem?

JJ: That is very common. Almost every company is trying to come up with a grid solution. You see them everywhere. The problem is mapping your business problem to the grid, right? That is where the complexity is. I mean, and then coding that once you get it. I mean, if we take a transactional system, you know, the same example I used a second ago. If we say, "We're going to look at a financial trading application." You have to have a single point that guarantees the serialization of these transactions, that you don't get a cancel in before the trade actually is in because you'll miss the cancel and the trade happens. So at some point you have to lock and you have to serialize these things. That application does not lend itself very well to a grid, and if it does, if you say, "Yeah, but we can split up the parsing of the message and all this," but then you run into, "Yeah, but what is my interconnect latency?" and all these other issues, and you've got to code for those. Then there are things like, you know, we say that a new Mersenne Prime was just found. That application lends itself very well to being distributed from a grid perspective. Hey, go for it, you know, and the coding is not much more difficult. Each person can easily take their unit of work and do it off in isolation, you know, without interacting with the rest of the app. So it's more the business problem of whether the grid is going to help you or not. But the one thing it does in 99 percent of the cases, is it pushes the complexity from a third party -- you know, so your operating system vendor, Red Hat or Sun or whoever it happens to be -- it pushes the complexity from them to you because building a distributed application is a nontrivial exercise. Getting it right is an even bigger problem. I see a lot of people that do it and build it, but getting it right is very, very complex. And I know that when I go and look at a grid application, those are by far the biggest brain-busting ones. Because one, they're difficult to debug, almost impossible to debug. And then two, it's the complexity of the code behind it. It's generally written by people that are a lot smarter than I am, and so these things are hidden in major -- in just really esoteric pieces of the application that you would never have indicted if it wasn't on a grid.

MV: Now in these scientific applications, a lot of people use fancy distributed parallel systems for specialized applications. I guess they're built from the ground up. Are we ever going to see that kind of stuff go mainstream, or is that always going to be a niche way of solving the problem?

JJ: I hope not. I mean, it really has to be a niche way of solving the problem because if everybody goes out and builds their own custom system for solving their application, it's bad for the company in that you have the employees in control at that point. I guess that's not bad for us, right? But the employees kind of get into control because you've got such specialized knowledge. And then, two, if you have problems, there's no resource. I mean, I go to Google all the time to look for a problem. You know, if this is a known problem and somebody else can solve it, why keep fighting it? And that's going to be more difficult. So clearly on certain apps, you know, missile guidance systems, I hope those are specialized and that somebody's taking care and making sure that it solves just the problem at hand. I'm not trying to say that all computing should be truly commodity, but we have to try to use as much as we can and build scalable applications within the constraints of areas where people have general expertise and only go into those specific areas that it's true competitive advantage for a business or a lot of savings or life-threatening measures, right? But to sit and try to create your own custom solution for every app you have at your business, it's going to doom you and will in the long run doom somebody. Of course it keeps me working so I can't be too upset, right?

MV: Everybody talks about real time computing, and yet if we have all these performance challenges around these large scale systems, how do we ever get there?

JJ: Well, the term "real time" is incorrectly used in a number of cases. I mean, true real time is more deterministic latency than anything. I mean, you can have a real time system that has a deterministic latency of 20 seconds, but you have to be within 20 seconds, and 20.1 seconds means something really bad. The example I just used, missile guidance. That is a real time system. It may be they have 10 milliseconds to compensate or whatever it happens to be, I don't know. But anything above 10 milliseconds means somebody dies. Well, if you're stock trading, and you see those companies that have the "We guarantee your trade happens in 60 seconds or less," it goes over to 61, fine, they don't get to charge you 14.95. It's not like you blew up a hospital instead of a tank, or whatever. So there's different things. But I think in terms of what most people are trying to think about real time computing, which means very low latency, there's a lot of work left to be done because, as I said before, I think a lot of people try to focus on throughput because low latency wasn't realistic with the hardware and the constraints that we had before. So I think now with low latency, people are going to have to kind of change what they're thinking about, and I think they can. I've looked at a number of systems where we've been able to reduce the latency down to what are almost physical theoretical limits in removing all the fluff. And again, these are very specialized systems. I do think it's achievable, but I think people have to understand what's meant by real time, and that's just you're not getting that latency out of the picture.

MV: All right. Now, last question, and I want to touch on something you said earlier on. We see this all the time. The developers blame the network people. The network people blame the systems people. And the systems people blame the developers. How do you break that somewhat unvirtuous cycle where you get people to focus on the problem versus just point fingers?

JJ: Yeah, and this is a major problem and this is one of the big things I see. And I admit that I used to be a party to this, as well. The number one thing is people really do need to take ownership. There's two good pieces to taking ownership to a problem. One, if it was your problem, you look good because you got to it and solved it faster. And if it wasn't your problem, at least you look like the guy that was trying to participate. You know, logic failures or things like that, anything that results in a core dump, these are things where generally it's very easy to tell who is at fault. I mean, if the system panics, okay, well, it's an OS issue. You know, if the application core dumps, which 99 percent probability it's an application issue. So those are the easy. Performance is the one that keeps people sitting in conference rooms around round tables and after 5 o'clock. So if everybody comes together and literally says, "Okay, look. I'm going to take ownership of my piece. I'm going to go look at my piece with the tools I have at my disposal and I'll come back and tell you what I find," not, "I didn't find anything," but come back and say, "Here's what I saw. IO was sub-millisecond. Networking, you know, we're seeing low latency, or, you know, only 1,000 packets a second on a gig network, no problem." Whatever it happens to be, come back and give the metric so that other people understand it. And then take those other two positions and have at least enough information to be dangerous. Be able to ask that networking guy, "Okay, so I know -- I don't want to know the throughput number. What's your packet per second?" Because minimal frame, you know, 64 byte packets are going to be a different throughput than if somebody's doing full frame packet. So that throughput number may be irrelevant. It may be a packet per second issue. And on the SA side, be able to ask somebody, "Where are we spending our time? User, kernel? What were we doing when we were in the kernel?" because clearly if we're in the kernel it's not directly solving the business problem, although it may be necessary if we're doing reads, writes, whatever. "But why were we in the kernel?" Things like that. Just to be able to ask the appropriate questions. It's at least enough to make somebody go, "Hm. You know, how much does he know? What isn't he telling me? I'd better go back and cover, and make sure I have that information." So it just spurs people along to have this information. But you're right. It's a huge problem. And in every case that I've seen where people want to work together, the problems have been solved faster, bar none. So it does work, and it's not a sign of weakness or the old adage, you know, "My shit doesn't stink" kind of thing. But if you go and try to solve it, everybody will go home sooner, guaranteed. That's it.

MV: So it sounds like you need a group counseling session. Jarod, I'd like to thank you for being on the show today and sharing your thoughts, and we wish you the best of luck in your future endeavors. And hopefully most of the folks got a lot out of this. I know I did. And I suspect that this issue will continue to be with us for many years to come, especially as we move more and more business processes out over the web. So thanks again, Jarod, and have a good day.

JJ: Great. Thank you.

acmqueue

Originally published in Queue vol. 4, no. 8
Comment on this article in the ACM Digital Library








© ACM, Inc. All Rights Reserved.