Comments

(newest first)

Bill Ryder | Sun, 08 Sep 2013 09:23:15 UTC

I like Howard L. Kaplan's explanation.

It's about a 7-8 ms from the bottom to the top of the slope which is in the right ballpark for a full disk rotation. 

I remember the days of disk arrays where they would synchronise their spindle rotation to help deal with this.

Kellyn Pedersen | Tue, 19 Oct 2010 02:24:50 UTC

Id be interested to see the cost correlation via heat mapping when Oracle is able to process in memory, (PGA specific hash/sorts) vs. if performed on standard disk, (temp)  vs. specialized disk such as FusionIO and others able to perform high capacity reads/writes.

Mike Meehan | Tue, 19 Oct 2010 01:53:36 UTC

Is the latency heatmap graphing library available?  What are good tools for implementing this kind of visualization?

Howard L. Kaplan | Fri, 20 Aug 2010 21:06:33 UTC

I think I understand the behavior shown in figure 5. I think it's related to the beats caused by playing two similar-frequency sine waves or to the visual effects of Moire patterns.

The upward and downward slopes are always equal to each other, though the slopes themselves sometimes change. The test program writes to the two disks in strict alternation. If the rotational speeds of the two disks are slightly different, then the two disks' platter orientations will drift slowly with respect to each other. For a while, immediately after a write to disk 2, disk 1 will be in position to respond immediately. Over time, disk 1's position will become less and less optimal, leading to longer latencies, until it's suddenly just more than one full rotation behind the optimal point, in which case it can respond immediately again. As the latency to write to disk 1 increases, disk 2 becomes in a better position to respond immediately after each write to disk 1. Since each disk's latency is measured from the completion of the other disk's operation, we see the resulting "X" pattern. If the rotational speeds sometimes change slightly, that would cause changes in the slopes of the lines making up each "X".

Steve | Fri, 02 Jul 2010 09:04:26 UTC

Interesting article.   Can you make the data available for analysis?

Brendan Gregg | Sun, 06 Jun 2010 00:17:25 UTC

Sorry about the image resolution, the PDF does look better; while the patterns are still visible, if you would like to look at the original screenshots they are http://blogs.sun.com/brendan/resource/analytics-5/figure1.png through figure9.png, and are linked here: http://blogs.sun.com/brendan/entry/visualizing_system_latency

Michael | Fri, 04 Jun 2010 20:09:13 UTC

@Ben I'm not sure what the problem is, I don't think anyone wouldn't presume that they display time series data - which they all are. As for the Y-axis obviously denotes a quantity that is in relation with subject at hand, scale is largely unimportant apart from being linear of which I'd also doubt anyone would presume otherwise.

Graphs are for patterns, trends - anyone who takes measurements from graphs should be taken out back to get shot. 

The only reason you'd care about the actual time if you'd had to map it to a particular event/modification. but those are already pre-marked by a vertical line. The only reason you'd care about the scale of the quantified data would be if you actually had to compare it to an other system.

invisible | Thu, 03 Jun 2010 20:27:13 UTC

What tools were used to collect IO statistics and generate graphics?

Simon | Thu, 03 Jun 2010 12:34:18 UTC

Graphs looks pretty, but what do they say? I have no clue what I am looking at, article does not help me either.

S80Admin | Thu, 03 Jun 2010 11:40:04 UTC
```
Que cosa  mas bonita polla!!!
```

Nicolas Doye | Thu, 03 Jun 2010 09:40:17 UTC

For those complaining about the graphs. Look at the PDF version of the article, they're a bit clearer. However, in general, the numbers are less important than the "lines" in the heat graphs.

NeoAngelic | Thu, 03 Jun 2010 04:01:51 UTC

Can you please link to some higher resolution versions of the charts? I have a hard time reading them, but they look very interesting. :)

Thanks,
Neo

David Collier-Brown | Thu, 03 Jun 2010 00:26:28 UTC

To Ben: I read it as an article about how heat maps
(1) make it easier to deal with huge amounts of data, by hiding the numbers and shoeing the patterns, and (2) expose new and interesting patterns in the data. I really didn't expect a discussion of, for example, disk/SD latency.  --dave

Bob | Thu, 03 Jun 2010 00:06:22 UTC
```
#7 is pretty
```

Ben | Wed, 02 Jun 2010 22:21:22 UTC

For an article that is trying to demonstrate the benefits of 'heat map' diagrams, the illustrations are particularly useless. They are tiny, having been shrunk such that you can't easily read any of the text in them. And there is no simple key / explanation to show what it is you are looking at, and what the blobs mean.

Laboriously explaining, in gigantic paragraphs following each picture, what on earth each image is visualising and how to interpret the patterns, is a sure sign of failure. With suitably clear graphs, with readable axis descriptions and a key to the data, this article could be made so much more useful.

Dav id Collier-Brown | Mon, 31 May 2010 16:47:20 UTC

A tiny niggle to start: I'm a capacity planner and concentrate on latency much as you do. Resource utilization is a poor second-best, but sometimes all that one has.

Something that may help identify what is happening with the lake and the pterodactyl is to break the latency into two parts, the latency proper and the transfer time. If we show the data transfer period separated from the initial delay, we can observe two *causally* different periods.

The first period is the time it takes the request to arrive, possibly sit in queue, be processed,and have all its prerequisites provided. On a disk this is mostly seek and rotation time, with a little processing. With a network app, it's mostly the processing time to figure out what the response is to be. In both cases, it contains the queuing time, so a sudden increase in this latency is a strong indication of having hit a knee and being forced to sit around in queue, twiddling one's thumbs while one waits for a chance to be processed.

The second period is the time transferring data, which tends to be more of a straight line with increasing load. Until it finally bottlenecks, of course, and independently inflects upwards.

When I describe this, I usually call the first period latency, the second transfer time and the sum of the two response time.

Bob Sneed and I have used this separation in the past, to diagnose the strange behavior of a network accelerator board (the initial latency was hugely variable, and completely unrelated to the amount of data transferred), and to identify the first and second bottlenecks in a system, the first from lack of processing power and the second from a limit on I/O bandwidth.

--dave

Sign up for QueueNews

Upcoming Conferences

acmqueue app

Join ACM

Comments