Interviews

  Download PDF version of this article PDF

A Conversation with Jarod Jenson

Pinpointing performance problems

One of the industry’s go-to guys in performance improvement for business systems is Jarod Jenson, the chief systems architect for a consulting company he founded called Aeysis. He received a B.S. degree in computer science from Texas A&M University in 1995, then went to work for Baylor College of Medicine as a system administrator. From there he moved to Enron, where he played a major role in developing EnronOnline. After the collapse of Enron, Jenson worked briefly for UBS Warburg Energy before setting up his own consulting company. His focus since then has been on performance and scalability with applications at numerous companies where he has earned a reputation for quickly delivering substantial performance gains. Recently, Jarod made a splash at JavaOne: he ran a booth to which attendees could bring their Java applications, and he guaranteed application performance improvements in under an hour.

Leading the questioning of Jenson is Kirk McKusick, well known for his extensive work in the open source software community, primarily with FreeBSD, although he points out that he has worked with other systems, including Solaris. McKusick has a consultancy in Berkeley, where he also teaches and writes. He is a past president of the Usenix Association and is a member of the editorial board of ACM Queue.

KIRK McKUSICK One of the highlights of your career was developing EnronOnline. What were some of the scaling issues and bottlenecks that you had to deal with in the Enron system?

JAROD JENSON EnronOnline was a Web-based trading application. We had several hundred, even thousands of commodities that we would price in realtime, the same way that equities are priced. We were trying to push realtime pricing information out to clients who could do instantaneous transactions on them. People who are familiar with financial markets—the commodity markets—would recognize EnronOnline as sort of the same thing.

We had a lot of the same issues that the markets had trying to push out realtime data—not only within our local network but also to the customers—as quickly as we could globally, and trying to make sure that what every trader saw on the screen matched what every company in the world had on theirs. It would kill us if somebody called to say, “Hey, I’m seeing a price of $1.10, somebody else sees $1.08.” If we were off that much, we couldn’t have been successful, so being able to get that information out in a timely matter was a big deal.

Obviously, as with most applications, there’s a database that provides that persistent store for the application. It’s the checkpoint for every transaction. We had some issues back at the database, as most people would for an application of this nature. Internally, we had a number of systems that the traders used for their pricing and getting that information out and for dealing as trades came in.

We had a lot of the same scaling issues that most people would with an e-commerce application. When we first started building the application, even before we went live, we knew we were going to have performance and scalability issues. For example, no Web server at the time that we benchmarked could push out the pricing information as fast as we thought it would have to go out, because getting all of those price ticks involves basically just a request-response, request-response as quickly as you can.

A traditional financial application has dedicated networks, where you can guarantee it. Pushing over the Web, however, introduces different issues. We learned that this was going to be a bottleneck, so even before we went live with EnronOnline, we started a parallel development effort that basically started with a clean slate and began redesigning based on the information that we were learning from the first version of EnronOnline. We took just the people we needed to complete the development and deploy EnronOnline, and we took as many people as we could for that parallel development effort to completely redo EnronOnline.

That was probably the best decision we made overall, because we forced ourselves to start with a clean slate. This avoided dealing with historical code where people would say, “Hey, I don’t want to touch that.” We just took the use cases that we had defined, but did them in whatever manner we thought would be best based on the information we learned. If we had gone live and then just tried to go back and fix what was known to be bad code, then it wouldn’t have worked. Starting from scratch was by far the best decision.

KM Starting with a clean slate with an architecture that you knew you had some chance of being able to scale?

JJ Right. We knew how to implement this from a logic standpoint, but we then had to ask, how do we scale, how do we perform? That was the real focus of the second version of EnronOnline.

KM While at Enron you straddled between the developers and deployers and the system administrators. Bridging the gap between those groups is a huge problem in many organizations. Do you have any insight on how you made that work?

JJ It is probably one of the biggest issues that I see. Everywhere you go it’s almost as if those two groups are at odds. There’s a tremendous amount of finger pointing between them.

I went to each of the groups and tried to understand what they needed. I asked them to tell me about their applications and what would be the issues they were going to run into. Then I would go to the system administration side, where at that time I had a significantly stronger background, and try to apply what I had learned.

There were questions as simple as malloc(3C) versus mmap(2). If we talk about native code, the developers may understand how to use each one but they don’t understand what the impact is going to be on the system and how that’s going to interact from a systemic point of view. You try to help them understand. I would tell the developers, “Let’s find out how you can take advantage of or abuse the operating system.”

Instead of them having to learn how to do the application, understand the business logic, and then on top of that, try to pile on this other specific knowledge, I would try to give them help in that arena. I think that telling them, “OK, I know we’re going to have a huge memory footprint, so maybe we need to use large pages,” is a benefit to them because you’re providing information rather than just being reactive to a performance problem.

People ask simple questions, such as: Which is faster, memmove(3) or memcpy(3)? You can ask if they have overlapping regions, which they can generally answer very quickly, and then you can help them understand which is going to be better and where there are, for example, processor-specific routines that can make their lives better.

People just have to learn that we don’t have to be at odds if we get involved early in the development process, and if we really try to help each other understand what we’re doing and how we can help each one of those groups. A lot of these issues can get resolved early on. The problem is, when those issues come to light when you’re already in production or close to production, then it’s a firefight. Temperatures go up and people are frustrated, and at that point, it’s too late.

KM So it really comes down to spanning the various parts of the organization. Do you try to do that at the developer level, or are you carrying on these conversations with the architect for the developers and the head of the people doing production?

JJ What is always best, in my opinion, is to go to where the rubber meets the road. The developers are going to be the ones doing the implementation. Don’t try to force something down their throats and make them conform to a standard.

You have to have these standards, but you have to understand what their use case is going to be, what’s going to be important to the developers, and how you can help make that easier on them. I think you have to deal directly with the person who is writing the code. If you don’t do that, then there’s still going to be the potential for that old game, where you whisper in one person’s ear, and it goes around the circle and comes out completely different. It’s going to be hearsay, and it may not be exactly what they wanted. And if it’s not exactly what they wanted, they won’t use it and you’re at a loss again.

KM What tools are available for doing performance tuning that you find useful? Where are the problems, and what are the shortcomings?

JJ When I started doing performance work for EnronOnline, there weren’t any really good tools. There were tools that required either relinking your application or putting code in it.

The number-one thing I did when I was doing performance work before was create library preloads and use the library interpositioning to analyze some aspect of that application without modifying it. Clearly that works only with native code, not with managed code, so Java was almost undebuggable with the tools that existed.

JVMPI (Java Virtual Machine Profiler Interface) helps a lot, but with it enabled, the performance is just pretty bad generally. It can completely change what you’re trying to observe.

A lot of new tools have come out. DTrace is the one I lean on the most heavily because it provides an unmatched observability into the system from kernel to application and across applications, and with Solaris containers, across containers. Other tools that I’ve used that are helpful are Intel’s VTune, which provides basically a statistical sampling of where your application is spending the majority of its CPU time. I’ve used OProfile for Linux a few times. It is along the same lines as VTune and is tremendously helpful.

Outside of that, there are very few really strong tools for doing the analysis.

KM How do you go about isolating the performance problems then?

JJ I think that most failures in solving performance problems happen in the initial approach. If I go to a customer and ask that they characterize the performance problem or define what they are looking for—is it latency, is it throughput, what are your issues?—a lot of times even that simple question cannot be answered, and you have to describe what the difference is between a latency problem and a throughput problem.

In the case of Solaris, I run mpstat(1M) and get a view of whether we are user-land intensive—is there a tremendous amount of system time, context switches, and cross-calls? Mpstat provides a good systemic view of what’s happening on the box. Many times the customer believes it’s the operating system’s fault that the application is slow. You look at it and discover it’s spending 100 percent of its time in user land, with almost no system calls. There’s not a lot the OS is doing to hinder them. So the first thing that I do is figure out exactly where the potential problems are. Are we doing I/O, are we doing networking, etc.? Once you get that picture, then based upon the type of application that you are looking at, you start coming up with theories. If we are seeing a lot of system time, let’s take a look at the system calls—are they network-related or I/O-related? With DTrace, we can very easily examine all the system calls.

A couple of other OS-related tools we use are truss and strace. I guess those go without saying, though.

We use all these tools to find out exactly what we’re doing. The profile providers from DTrace, OProfile, and VTune are some of the best because with very, very low overhead to the system, you can see, in terms of our user-land time, that there is a lot of malloc or a lot of memcpy. Just that statistical representation will allow you to focus in on the code and determine where you’re having the potential problem.

Then you can start refining that hypothesis and digging deeper. Generally, once you characterize what’s happening on the system and make that initial hypothesis about where the performance problem is, it actually starts getting very easy with the tools that we have now.

KM Part of the problem often comes out when you have this giant tarball that’s Java or C++, and there is a lot of middleware in the system and there are interactions between the layers. The layers individually may be OK but they are somehow interacting in bad ways. How do you unwind that ball of string to figure out who’s responsible or where you can make changes to make things work better?

JJ As more and more layers of abstraction are added, the problem gets more and more difficult. For example, you take a statistical sampling and find you’re in your Java code only 4 percent of the time; the rest is in the application container and garbage collection. I was recently looking at a Java application that was allocating more than 200 megabytes of objects a second, and the majority of the work being done by the system was pure garbage collection.

If you don’t reduce the amount of objects you’re allocating, there is no change you can make to your code that will benefit your performance.

The profile provider in this particular case let us see the garbage collection threads. Fortunately, the customer was using ParallelGC, one of the garbage collection options that you get with Sun’s JVM. The downside is ParallelGC binds a thread to each CPU to do garbage collection in parallel.

When this garbage collection happened, you had on this particular system 12 CPUs that were busy doing stop-the-world garbage collection as fast as they possibly could. Obviously they’re using a lot of CPU time, so the scheduler kicks them off the CPU at a less than optimal point—none of the application threads is runnable now, anyway. Being able to correlate between these different layers is a little easier now with these statistical tools, but it’s still a huge problem. If you’re using something that completely lives within the JVM and it can tell you all your method calls and how long you spend in them, but you get no understanding of that application container or that JVM or the OS around you and you can’t correlate those things, then you’re going to have a lot of trouble solving a performance problem.

You have to look at things systemically. If you don’t, you’re done. There’s almost no way to solve a problem without doing that. In fact, I would be willing to say that in the vast majority of cases where I look at things that are inside of an application container or Java code, the changes we make are things that were observed either in native code or in the operating system. The Java HotSpot compiler, for example, does a pretty good job of fixing even bad code. With the runtime information that it gets as the application is running, it can move things around and really help deal with those issues.

For example, if you ask someone if they have any native code, they may say no, but then you find out that their connection to the database is done via calls through native code through the classes they are using. You look inside that native code and see they are allocating memory, by default on Solaris. There can be scalability issues around the native malloc, and something as simple as dropping in libumem(3LIB), which gives them a scalable memory allocator, fixes the problem.

So they were looking at their Java application, not understanding why it’s slow. Well, that’s because it was out in native code, and that is a common problem.

KM What is the big problem in performance tuning? Is it tools, is it garbage collection, is it lack of knowledge between the groups?

JJ It’s a combination of all those. In the Java world, I’m glad there’s garbage collection, but the number-one thing people have to stop doing is allocating so many objects. One of the first things I do is look at object allocation, because most garbage collectors stop the world, and with applications that are especially latency-sensitive, that’s going to kill you.

When every thread gets forcibly stopped from doing what it’s doing while we do garbage collection, that’s going to be a problem. Different JVMs and JVM options can help mitigate the problem somewhat, but people are going to have to learn that allocating objects should be done as sparingly as possible, especially in hot code paths. People are going to have to think about that from a Java and native code perspective.

What keeps me up at night now is this: in the Java space, notifyAll(). In the C++ space, it’s pthread_cond_broadcast(3C). Especially in Java, the creation of a thread is easy and it happens behind the scenes. A lot of the time you’ve got potentially hundreds of threads, and then somebody does notifyAll() to every thread—say, for the receipt of a single unit of work, a single network message—and all of a sudden, you’ve got 100 threads on a four-CPU machine that all wake up and say, “Gee, I want to run.” And so we fight for the locks associated with this. Once we get the locks, we realize we have no work, so we fight to get the lock again to get out. You end up with this massive number of context switches, and very little work getting done. I’ve seen many applications where I’ve just said drop those three letters—the word all—and performance goes up tremendously.

In one such application I saw, there were so many threads on a four-way machine that, literally, just changing notifyAll() to notify() resulted in a 170 percent performance gain.

Another problem is that people tend to want to log absolutely everything the application does. That means string creation, and they build these strings by concatenating strings together. So each time you get that concatenation, you get a new string object, and then they have to build a new one that can concatenate this next thing.

The funny thing is, a lot of people know that this is a bad practice, but they still do it. Sometimes, people feel they need to make the performance trade-off for debuggability. That’s fine, but occasionally the mechanism can make the issue even worse. One customer had rolled their own logging framework and they were writing to the file one byte at a time—I don’t even know how this gets implemented. From an OS perspective, you’re doing a system call each time and you have to trap into the kernel. Luckily, most operating systems will buffer that and perform a single physical I/O, but there is still a tremendous overhead. The other thing that I see in these threaded applications is synchronization problems.

I generally experience my biggest wins in heavily threaded applications. In the past there haven’t been good tools to look at lock contention. Now, however, if you’re using Java, there are some JVMPI/JVMTI events that you can look at to find those contention points; and in native code, it’s very easy on Solaris using the plockstat(1M) command.

I would say that the majority of the fixes that we do are fewer than 10 lines of code—either manipulating it some way, removing it, or adding it—to get major performance gains.

KM Perhaps this naturally leads into Jarod’s “top mistakes” in improving performance. Number one clearly is that people are spending time in the wrong place because they have never checked to see where the application is actually spending all its time.

JJ Chances are your algorithm is good, so you should do some hunting elsewhere. Use the tools you have today, such as DTrace or VTune. Ask where the application is spending its time. If it is in the code, you’re back to where you started, but I would be willing to bet if you’re having massive performance problems, it’s not directly in your code or more precisely, what you think is your code. Something is happening that you don’t know about. Sometimes you may find that a C++ application is spending a lot of time in class foo. But the developer says, “What do you mean class foo? We ripped that code out years ago.” Sorry, but this one quick script clearly shows it’s still there.

When you pull out your data structures book from college, and you’re going to implement something, don’t always turn to the last chapter. Just because it was in the last chapter doesn’t mean it’s the best. I looked at one application, for example, where they were creating a memory cache so that they didn’t have to go to the database a lot. Brilliant idea. The problem is, they were storing transaction IDs in a red-black tree that were all sequential. Because they are sequential, they’re rebalancing about every other insert; also, 99.9 percent of the time, it’s only updated. The only time it’s going to be searched is if a cancel or a bust comes back from the market. For a successful trade, it’s almost never searched again. The performance on this application was just horrible. The read-write lock that was protecting the tree was being held for 37 seconds.

KM Yow!

JJ Just by changing that to a hash, performance rose substantially. So don’t always turn to that last chapter and assume that it’s the best. You need to look at the algorithm you’re using and ask if it fits the problem that you’re trying to solve.

KM I’d like to bring the discussion back to teams and people. I’m curious how we can best make performance expertise scale within an organization. How do we clone 50 Jarods?

JJ That question comes up a lot, actually, for the most part because I do Solaris, Linux, Sybase, Oracle, and all these different systems. When I was at Baylor College of Medicine we had mostly free rein and we had all those technologies, and then at Enron, we had some pretty broad power to look at all these different technologies. I’ve been fortunate in that respect. At a lot of companies, people may not have that variety. They may be an Oracle shop, where you’ll never get to see Sybase. What I tell people is you don’t necessarily want a go-to guy. It’s great if you have one, but that’s going to be a rare occurrence. What you want are go-to people. So as I said at the beginning of this conversation, first understand where the problem is: Is this a latency issue? Is this a throughput issue? Are we CPU-intensive? Are we OS-intensive? Then go to the person with expertise in the appropriate area.

If you don’t want to be a kernel developer, you don’t have to be. If you’re the guy in the storage group who deals with the storage infrastructure, maybe all you need to do is know the I/O tools, and know them very well. If you’re the network guy, know the mib provider, and know it very well. If you’re the application guy, know the pid provider, and know it very well. If you think you have an OS issue, then go to the person who has that expertise—who lives it every day—for help. He may come back and say, yes, it was his problem and here’s the issue involved; or he may say he found another piece of data that now implicates the network, so then the problem goes over to the network guy.

You almost create a little SWAT team. On a SWAT team, there’s a sniper and there’s a guy who busts down the door, and then there are the guys who barrel through the door. You need your own SWAT team with the guy who can bust down the door and find the general problem and then pass it over to the next guy. Some companies are doing this, and it has been a huge success.

I’ll tell a customer that their problem, for example, is starting to look like an issue with the application container that they’re using, and the customer will go get the guy with expertise in that field. When we do an engagement like that, it goes so much faster. The expert comes over and we can solve these problems very, very quickly, but the number-one advice is, don’t point those fingers.

If you can’t identify what the problem is, then everybody gets involved and takes the stance that it is their problem.

KM Have you been successful at pulling together these SWAT teams within organizations?

JJ Yes, there are some organizations that I’ve done some training with and helped them understand this team concept. I can take a system and run an app on it, then go through the approach with the team. When you get to that portion that’s relevant to each person, you can see that person perk up and pay attention.

That’s good. People don’t have to know everything, if they know their piece well—and I’m starting to see some of that. If you know when you have to get involved and you know that there are tools that can help you—now that we finally have them—you should leverage them. I mean just absolutely leverage them to the full extent.

KM But it sounds like you still need that generalist to lead the SWAT team. You talk about particular people lighting up when it gets to their part, but somebody has got to be leading this thing.

JJ Yes, and to be honest, that generally is one of the easier parts. That person has to look at as many problems as possible and has to get involved in every performance issue, even if it’s not directly related to that person’s area. I’m by no means a strong Windows person, but if all the Windows desktops were having a performance problem, I got involved. Most of the time, I would sit silently. If I heard something that sounded plausible or if I could provide input, I would; but just hearing other people go through that troubleshooting process and hearing how they approach a problem only makes you stronger. Occasionally, some networking issues came up and I could provide input—had I not been there, they would have fought that problem longer. Everybody got a knowledge transfer from that.

KM It sounds like one of the things that you’re doing is transitioning organizations to look at the data instead of just talking about how the problem isn’t theirs. How do you teach a reverence for the data and have it trump the finger pointing?

JJ First, take ownership. If you say this is my problem, you’re going to want every piece of data that you can possibly find to help you solve this performance problem.

The second thing is, you have to believe. Computers are fancy and all, but at the core of it, there’s only so much they can do. You can’t be intimidated by it, so you know that the data has to be there. Hopefully, you now have a tool suite that allows you to collect a lot more data, so you are able to focus your hypothesis a lot faster.

I recently went to a customer who was having a perceived problem and all they gave me was a legal piece of paper that had numbers written on it. They had actually collected the data. Without ever touching a keyboard, we were able to find the problem.

You just have to believe that the solution is in that data, if you can collect all the data.

You can collect some data while you’re sitting there thinking. You may find something that helps you make progress toward solving the problem. Some people laugh at me because I’ll sit and watch mpstat output scroll by the screen for 10 minutes. Maybe I don’t know where to go or maybe I’m just hoping to find a pattern.

About eight years ago, we were having a network problem at Enron, so I sat in front of a sniffer. We would collect a minute’s worth of data and just go through it, page by page. I did this for about two hours, and I finally found a pattern. I think it ended up being a bug in the software, and the router was sending out corrupt packets, but it was doing them every n number of packets, then the pattern would change. It was one of seven different ones that would happen after n packets, so, if you just looked for one minute or 10 minutes, you wouldn’t have seen the problem. But just by taking the exact same data and staring at it for longer, it finally materialized. The solution is there. Sometimes you may just need a little patience to stare at a screen for a while.

acmqueue

Originally published in Queue vol. 4, no. 1
Comment on this article in the ACM Digital Library





More related articles:

David Collier-Brown - You Don't know Jack about Application Performance
You don't need to do a full-scale benchmark any time you have a performance or capacity planning problem. A simple measurement will provide the bottleneck point of your system: This example program will get significantly slower after eight requests per second per CPU. That's often enough to tell you the most important thing: if you're going to fail.


Peter Ward, Paul Wankadia, Kavita Guliani - Reinventing Backend Subsetting at Google
Backend subsetting is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm balances the number of connections per backend task, deterministic subsetting has a high level of connection churn. Our goal at Google was to design an algorithm with reduced connection churn that could replace deterministic subsetting as the default backend subsetting algorithm.


Noor Mubeen - Workload Frequency Scaling Law - Derivation and Verification
This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system.


Theo Schlossnagle - Monitoring in a DevOps World
Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better. DevOps enables highly iterative improvement within organizations. If you have no monitoring, get something; get anything. Something is better than nothing, and if you’ve embraced DevOps, you’ve already signed up for making it better over time.





© ACM, Inc. All Rights Reserved.