Latency and Livelocks

A koder with attitude, KV answers your questions. Miss Manners he ain't.

Sometimes data just doesn't travel as fast as it should. Sometimes a program appears to be running fine, but is quietly failing behind the scenes. If you've experienced these problems, you may have struggled for a while and then become baffled and/or tired. Kode Vicious knows your frustration, and this month serves up some instructive words on how to deal with both of these annoying problems. Fatigued or mystified by other quandaries? E-mail your problem to [email protected].

Dear KV,

My company has a very large database with all of our customer information. The database is replicated to several locations around the world to improve performance locally, so that when customers in Asia want to look at their data, they don't have to wait for it to come from the United States, where my company is based.

A few months ago the company upgraded its software, which required all of the records in the customer database to be updated as well. When we tested the upgrade program it took only a few minutes to update a large number of records, but when we had to update the Asian customers, a process that had to run in Asia and that touched data in the U.S., the process began to take a lot longer. Since the company has a very fast network connecting the U.S. and the Asian offices, it's hard to understand how the distance could matter. There must be another reason for the time it is taking to run these programs.

Baffled with Bandwidth

Dear Baffled,

There is a big difference between having a big pipe and knowing how to use it. Two things matter in networking: bandwidth and latency. Unfortunately, most people think only of the former, not the latter. Latency is the time a message takes to get from point A to point B (e.g., a client and a server). If I were to guess, I would figure that you tested the conversion program on a local network, probably 100-Mbit Ethernet, where latencies are generally less than 1 ms. You then ran the program remotely across your very fast network and found that it ran much more slowly.

I bet it ran 100 times as slow. How did I pick 100 times?

Easy, the average round-trip time across the Pacific is about 100 ms. What you forgot was a very important constant, and then you forgot how networks work.

The very fast network you speak of was probably sold on a bit-per-second basis, so maybe you have a 1-Gbps link between Japan and the U.S., but that's a measure of bandwidth, not latency. It's the latency that matters in your case. Why? Because your conversion process very likely takes one record at a time, packs it up, makes a request to the server, and then waits for a response.

When the underlying network has very low latency, like a local network, this packing up of a single request and waiting for a response is barely noticed; but if you move to a higher-latency network, it crushes your system's performance under its iron heel.

The thing you forgot is c, the speed of light. Let's say you were running the conversion process in Tokyo and it was talking to a database in California. It's about 5,000 miles from Tokyo to California, and the speed of light is 186,000 miles per second, so a beam of light should be able to make it from Tokyo to California in about 0.027 seconds. That's 27 ms for the absolute fastest time between those two points, a round trip of 5.4 ms. That's already a factor of five slower than your LAN.

Of course, packets don't travel point to point. They are stored and forwarded at various points along their journey—that's how the Internet works. Each waypoint (the technical term is router) introduces its own bit of delay to the packet's journey, until what we have is an average of 50 ms each way between Japan and California. Now you have a difference of a factor of 100 between your LAN and the real network. Reality bites, and in this case it bit you hard. If converting some number of records took five minutes locally, it's going to take 500 minutes (8 1/3 hours) remotely. If I were you, I would bring a book to work, or download a movie like the rest of your coworkers, before you start any more database conversions.

There are ways to ameliorate these problems, though getting around the speed of light has eluded better minds for quite a while. The first is to run all the conversions locally; the second is to write the conversion program so that it batches requests. This makes more efficient use of the high-bandwidth network that your company has probably paid a small fortune to rent. If the database conversion on the server side can process a batch of records in less time than it takes to move all those records across the link, then you will gain some time; but if each record must be processed serially, then you're always going to suffer with slow performance over a long-distance link.

Dear KV,

I've been trying to debug a problem in a network server. Under load the server seems to lock up, but when I look at the process status on the system, the state is always RUN, which means the process isn't blocked. The program is not in an infinite loop, because each time I attach a debugger, the code is in a different part of its processing loop, but there is never any output or progress. How can a program be running, not stuck in an infinite loop, and yet produce no output?

Dead Tired from Looking for a Deadlock

Dear Dead,

What you're looking at—actually what is staring you in the face—is a livelock, not a deadlock. Though most people learn about deadlock when they study computer science, they rarely learn about livelock until they're in one. Livelock is common in software where the general flow of the code is: read, process, write, read, process, write.

In a livelock situation the code is overwhelmed by the incoming load of work. Many systems—including, it would seem, your software—have a safety valve so that if overwhelmed, they can drop unfinished work. The problem is that the safety valve may not be good enough.

If a server drops a request because it runs out of time, memory, or other resources, then it will get stuck in a loop where it reads a request, tries to process it, and fails, then goes back and reads the next request, tries to process it, and fails. For all intents and purposes the code looks like it's running, and it is, but it's not succeeding at doing what it should. It's actually flailing and failing.

One way to overcome livelock is to buy more resources, such as faster processors or more memory. This "throw money at the problem" solution is intellectually cowardly and should be kept as a last resort. In many situations it's just not possible to get more raw hardware power since many companies already run their servers with the fastest processors and the maximum amount of memory. Usually companies do this because they have hired the wrong people to write their software (i.e., engineers who don't think about what resources their code is using—you know, idiots). For the nonidiots there are a couple of other ways to deal with livelock.

A common trick is to build code into the system that allows someone to tune the processing loop. The tuning is fairly simple: it says that the system will process only so many pieces of work per unit of time. Initially the knob is set to infinite, until the system winds up in livelock, and then the knob is turned down to just under the number of requests that caused the livelock. Yes, this means that the system will throw away requests for work, but it will still be able to do work, instead of being swamped and unable to do any work at all. Using this trick requires your software to know how many requests it's processing per unit of time, so you'll need some code to count that.

To implement a system in which the number of requests can be tuned, you need to design the code so that it can throw away requests as early as possible in the processing loop, perhaps even as far back as the read. If the system knows that it's overloaded, it should not waste additional resources trying to bring in any more work.

Often the problem in a livelock situation is related to processing time, in that there isn't enough of it. If you think that time is your problem, then you need to profile your software and see where it's wasting, er, spending its time. If the processing can be sped up, then obviously your code will be able to stay out of livelock longer.

Tuning software is a hard problem, and one that has been covered in Queue before (February 2006). Since I'm sure you've kept all your back issues, you should easily be able to go back and reference the excellent advice therein.

Originally published in Queue vol. 6, no. 2—
Comment on this article in the ACM Digital Library

More related articles:

Ethan Miller, Achilles Benetopoulos, George Neville-Neil, Pankaj Mehra, Daniel Bittman - Pointers in Far Memory
Effectively exploiting emerging far-memory technology requires consideration of operating on richly connected data outside the context of the parent process. Operating-system technology in development offers help by exposing abstractions such as memory objects and globally invariant pointers that can be traversed by devices and newly instantiated compute. Such ideas will allow applications running on future heterogeneous distributed systems with disaggregated memory nodes to exploit near-memory processing for higher performance and to independently scale their memory and compute resources for lower cost.

Simson Garfinkel, Jon Stewart - Sharpening Your Tools
This article presents our experience updating the high-performance Digital forensics tool BE (bulk_extractor) a decade after its initial release. Between 2018 and 2022, we updated the program from C++98 to C++17. We also performed a complete code refactoring and adopted a unit test framework. DF tools must be frequently updated to keep up with changes in the ways they are used. A description of updates to the bulk_extractor tool serves as an example of what can and should be done.

Pat Helland - Autonomous Computing
Autonomous computing is a pattern for business work using collaborations to connect fiefdoms and their emissaries. This pattern, based on paper forms, has been used for centuries. Here, we explain fiefdoms, collaborations, and emissaries. We examine how emissaries work outside the autonomous boundary and are convenient while remaining an outsider. And we examine how work across different fiefdoms can be initiated, run for long periods of time, and eventually be completed.

Archie L. Cobbs - Persistence Programming
A few years ago, my team was working on a commercial Java development project for Enhanced 911 (E911) emergency call centers. We were frustrated by trying to meet the data-storage requirements of this project using the traditional model of Java over an SQL database. After some reflection about the particular requirements (and nonrequirements) of the project, we took a deep breath and decided to create our own custom persistence layer from scratch.