Download PDF version of this article PDF

You Don't know Jack
about Application Performance

Knowing whether you're doomed to fail
is important when starting a project.

David Collier-Brown

You can't just measure a program's performance, Dave. You have to run a benchmark!

That's what I hear all the time when I ask about the performance of some program. Well, it's only sort of true. If you want to know at what load the program will bottleneck, how big a machine you need to do 100 requests, or how soon you need to upgrade the server, then you can measure it. You just have to do it in a different way than you expect.


How We All Usually Do It

If a program has to do a specific amount of work per second, you usually look at the amount of CPU it uses at a given load and try multiplying. If you're at five requests per second and 40 percent CPU and want to triple the load, you can credibly say that "15 requests would take 120 percent CPU. That won't work."

If you wanted to run 10 requests per second, however, that would be 80 percent CPU, and you wouldn't be so confident. Of course, you would want to do a benchmark and plot out the performance curve.

That's because you're using the wrong diagram.

Remember seeing the two diagrams (figures 1 and 2) in a textbook? The one shown in figure 1 was often labeled utilization or throughput, rose to some value, and then leveled off.

You Dont know Jack about Application Performance

That's likely what you're used to, and it's in the units you want to use, so it's the one you probably try to work with. You can see that 150 percent will be off the curve entirely, so that won't work, while 50 percent is on the linear part of the curve, so that will work.

The diagram in figure 2 doesn't get much attention, as it's in units that aren't used much, and it doesn't appear to offer anything, even though you can tell they're interrelated in some way.

You Dont know Jack about Application Performance


Capacity Planners use the Second Graph

The graph in figure 2 is the response-time curve, often called the "hockey-stick curve," or just "_/". The only people who always use it are capacity planners and performance engineers. They use it because they can take a few measurements, plot some lines, and credibly estimate both curves.

They start with a time, not a percentage. They measure how long it takes the program to do one operation, with the cache all warmed up, with nothing else running, with a bunch of sample values, each with exactly one operation happening.

Let's say you're requesting an image and when you measure, an individual request to your program takes 100 milliseconds—a tenth of a second. This is the service time, which is the main component of response time: the curve in figure 2 that's headed up and to the right. The point at which it turns upward is called the inflection point and is also the point at which the utilization or throughput curve in the first graph starts to flatten out.

At this point, you've probably caught one thing: that's the thing that you want to stay at 80 percent of, for best performance.

A capacity planner or performance engineer can estimate the second graph with lines in a spreadsheet, draw the curves using a queuing-network program, or do a formal experiment with a model-builder that instruments and then builds one or more queuing networks from the application. That's the information you need to know:

1. At what load will the program bottleneck?

2. How big a machine is needed to handle 100 requests?

3. How soon does the server need to be upgraded?


Start Simple

Let's say you have a spreadsheet and one data point—100 milliseconds at one request per second—and you want to work in requests per seconds. You want to answer the previous three questions.

Previously, modeling problems like this with a queuing network has shown that you're drawing a hyperbola between two straight lines and that the equations of the two lines are almost completely dictated by:

• The one data point you measured.

• A careful choice in the way you set up the other data points that you hope to plot.

Let's start with drawing the lines and see how much that alone reveals.

The first line is easy: It's the horizontal line that the "lower leg" of the hyperbola is going to start off close to. It starts at 100 milliseconds, and because it's horizontal, it stays there for all X values you want to use. This is the yellow horizontal line in figure 2.

The second line has a slope computed from 100 milliseconds and a point at which X is zero. The latter is at negative (1 second − 100 milliseconds) = -0.9 seconds. Those two related data points dictate the slanted yellow line in the graph: It starts at -0.9 and slopes upward at y = 0.1x − 0.9. It crosses the horizontal line at almost exactly 10 requests per second, which is the inflection point—not just in the lower curve but the upper one as well.

Now you can answer some of the questions about performance.

1. When will it bottleneck? Between eight and 10 requests per second. The inflection point is at 10, and it's going to start slowing down a lot around eight requests per second.

2. How big a machine is needed for 100 requests? You just worked out a safe non-bottlenecking value of eight requests for a single CPU, so you need 100/8 = 12.5 CPUs.

3. When does the machine have to be upgraded? That depends on how fast your business grows.

Assume that the load on the machine grows exactly as fast as the business. In that case, you can sit down with senior management and get a growth number to use for capacity planning. It will probably be some percentage per year, so you will have to work out a compounding-increase graph, and then see how long it will be before you hit eight requests per second.


Are We Done Yet?

At this point, you could stop: You know when the program will bottleneck, and you can estimate and budget based on that.

But why does this work? How do you get a nice curve like the one in figure 2? And how can you apply this to something large, such as the capacity planning for a machine or cluster?

The answers to these questions all involve queuing networks, so let's have a look at what is really happening. Hint: It involves queues building up.



Let's start by drawing a horizontal line, representing one second, and some blocks, each representing one request being served.

Request number one takes 100 milliseconds (a tenth of a second). So does request two. Request three, however, has a problem. Two is still running. Three arrives 310 milliseconds into the second but can't run for 90 milliseconds. That's the gray area, before the green service time. Three takes 190 milliseconds—100 running and 90 waiting. Not good!

The total response time for request three is 190 milliseconds, because it has to sit in a queue. Request four has the same problem, and five is even worse.

That explains the delay: If you have more work coming in than you can handle at the moment, something has to sit around in a queue, waiting for its turn to run.


Where Did the Curve Come From?

The curve came from probability.

At one request per second, there is no chance that two requests will show up at the same time. At 10 requests per second, it's very probable a bunch will show up at almost the same time, and the laggards will have to sit and wait in queue.

If you average a large enough sample, you will see a curve that starts out almost parallel to the x-axis and bends up to parallel the diagonal line.

There are queuing network solvers that will draw the curve: One of the best is Neil Gunther's PDQ,1 which I used to draw table 1 and figures 1 and 2. If you're interested in the science behind capacity planning, Gunther's books are must-haves.


A Short Detour onto the Sloping Line

Notice that I didn't yet explain the equation of the sloping line in the spreadsheet. That's because I wanted to draw the diagonal line of green boxes first.

Looking at just the green squares in figure 3, you'll notice that they make a diagonal line up and to the right. That represents the best the machine can do. There can be gaps if there isn't much load, as between requests one and two, but if the requests come in one after another, they make a line of squares, corner to corner.

You Dont know Jack about Application Performance

It is in fact the same diagonal line as in figure 2. In figure 4, it is plotted in a spreadsheet, using the formula y = 0.1x − 0.9 (in black on white) where:

You Dont know Jack about Application Performance

• The 0.1 is the slope of the curve, which is equal to the service time in seconds. It makes the line go up and to the right at 45 degrees.

• The 0.9 is one second minus the service time. It makes the slanted line go above zero at just below 10 requests per second. I chose to work in requests per second to make the calculations easy, so 0.9 seconds is the amount of time between requests at one request per second. It's sometimes called the "sleep" time, Z.

Because you can't do two things at once (at least per core), the machine will bottleneck, jobs will end up in queue, and you can draw a line on the spreadsheet representing the delay that results when you give the machine too much work to do.


A Harder Example

Say you are asked to serve cover illustrations for a book-sales site, which takes 100 milliseconds for a standard-sized image. For this case, however, you need to complete in less than 150 milliseconds. If you are too slow, the book won't be displayed, the end customers won't buy it, and you would be in deep trouble for promising something you couldn't do.

To respond in an average of 150 milliseconds, you have to set a limit of something less than 12 requests per second. But how much less? Taking too many and being too slow on all of them will waste all your efforts.

That's where the queuing network shines: solving for response time at low through 80 or 90 percent load. Using PDQ, you'll get a result like table 1: Look down the response column to see when you will reach 0.15 seconds, and it's at a far smaller load than expected. You can take five requests per second if you are to stay under 0.15 seconds per request.

You Dont know Jack about Application Performance

Your customer reports show about 80 requests per second at the peak time of day, and you have a 20-core machine dedicated to this task. If you can handle no more than five requests per second per CPU, the machine will handle 100 requests per second.

You're confident you can achieve 80 requests per second, so it's worth trying, even under these stringent conditions.


Isn't that a Lot of Work?

Well, no. I don't so much build spreadsheet and PDQ models as I use them.

For big jobs, I use Teamquest Predictor,2 a package that already knows how to collect data about multiple places in a server where queues can build up. It's the classic tool to ask, "How soon do I need to upgrade the server?"

With a modeler, I can collect data from a machine that isn't failing yet and predict what will happen if load increases by a certain percent each quarter. When I find a bottleneck, I can tell the model to pretend I've added CPU, memory, or disk, and see how much I need to fix the bottleneck.

Figure 5 illustrates simulated CPU queue delay building up and starting to bottleneck a small file server until a second CPU is added to the simulation late the next year, where it recovers. I/O queues then start to grow in year three, until more disk heads are added, in proportion to the CPU that has been added.

You Dont know Jack about Application Performance

With a model like this, you can experiment with adding planned future loads to a machine and find out well ahead when you will have to budget for extra CPU and disk or memory.



You don't need to do a full-scale benchmark any time you have a performance or capacity planning problem. A simple measurement will provide the bottleneck point of your system: The example program will get significantly slower after eight requests per second per CPU.

That's often enough to tell you the most important thing: if you're going to fail.

It won't tell you if you'd definitely succeed, but knowing if you're doomed to fail is really important when you're being asked to start a project. And you don't have to run a benchmark to find out.



1. Gunther, N. J. 2005. Analyzing Computer System Performance with Perl::PDQ. New York: Springer; (I particularly recommend Gunther's Guerrilla Capacity Planning).

2. Teamquest Predictor, formerly Teamquest Model;


David Collier-Brown is an author and systems programmer, formerly with Sun Microsystems, who mostly does performance and capacity work from his home base in Toronto.

Copyright © 2023 held by owner/author. Publication rights licensed to ACM.


Originally published in Queue vol. 21, no. 2
Comment on this article in the ACM Digital Library

More related articles:

Peter Ward, Paul Wankadia, Kavita Guliani - Reinventing Backend Subsetting at Google
Backend subsetting is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm balances the number of connections per backend task, deterministic subsetting has a high level of connection churn. Our goal at Google was to design an algorithm with reduced connection churn that could replace deterministic subsetting as the default backend subsetting algorithm.

Noor Mubeen - Workload Frequency Scaling Law - Derivation and Verification
This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system.

Theo Schlossnagle - Monitoring in a DevOps World
Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better. DevOps enables highly iterative improvement within organizations. If you have no monitoring, get something; get anything. Something is better than nothing, and if you’ve embraced DevOps, you’ve already signed up for making it better over time.

Ulan Degenbaev, Jochen Eisinger, Manfred Ernst, Ross McIlroy, Hannes Payer - Idle-Time Garbage-Collection Scheduling
Google’s Chrome web browser strives to deliver a smooth user experience. An animation will update the screen at 60 FPS (frames per second), giving Chrome around 16.6 milliseconds to perform the update. Within these 16.6 ms, all input events have to be processed, all animations have to be performed, and finally the frame has to be rendered. A missed deadline will result in dropped frames. These are visible to the user and degrade the user experience. Such sporadic animation artifacts are referred to here as jank. This article describes an approach implemented in the JavaScript engine V8, used by Chrome, to schedule garbage-collection pauses during times when Chrome is idle.

© ACM, Inc. All Rights Reserved.