I’m working on a network server that gets into the situation you called livelock in a previous response to a letter (Queue May/June 2008). Our problem is that our system has only a fixed amount of memory to receive network data, but the system is frequently overwhelmed and can’t make progress. When I ask our application engineers about how much data they expect, the only answer I get is “a lot,” which isn’t much help. How can I figure out how to size our systems appropriately?
Memory Not Unlimited
Wait, doesn’t your company roll out all your servers with a minimum of four or eight gig of RAM? Doesn’t everyone do that now? How can you be out of memory? I just do not understand; it is all too much for my little brain to comprehend.
Actually, it’s not too hard for me to comprehend, though there are days when I consider going into my favorite bar until the only thing I can comprehend is that the big bright ball in the sky means it’s time not to go home but to work. Avoid KV on days when he goes from the bar to work, trust me.
There are ways to handle people who don’t want to size their applications properly, but most of them are not allowed under the Geneva Conventions, even if you have a hall pass from an administration official. You can, of course, trick the application engineers. It turns out that these tricks are still allowed under international law.
My favorite legally usable trick is basic recording. Your systems surely have a way to record the overall usage of resources by the operating system and applications. Run the application in a lab with a small amount of memory—perhaps 512 megabytes—and see when it crashes. Double the memory, try again. Each time, record the usage pattern and see if something jumps out at you. Does the application use memory up slowly or quickly? Perhaps there are spikes as a result of certain conditions.
Those spikes might actually be bugs. See if they are, and if so, report them and get the engineers to fix them. Maybe the application just has a memory leak but it runs for so long with the usually available four or more gigabytes of RAM that it takes forever to show up. Memory leaks are still bugs in KV’s book, so report, fix, etc. If you don’t have a lab, then make sure to do the recording on the live system. Recording on a live system has its own problems, however; in particular, it can affect system performance, so make sure to sample at infrequent intervals to prevent the sampling from taking too much time from the application.
The same advice holds true for any other resource an application uses. CPU, interrupt load, input/output, disk space, and all the rest are amenable to measurement. The only way to correct a problem is to understand it, and the best way to understand it is measurement.
Yes, it would help if people actually planned and knew how much they were using in the way of system resources, but the fact is that sometimes that’s not the case, and you can’t just throw up your hands—or just throw up—and walk away, much as you might want to.
I’m working for a banking firm on some of its larger trading applications. By law we’re required to record a whole slew of data, much of which we never use or see again. On some occasions I have actually had to go and find this data, but each and every time it requires a bit of programming to do so. I and others on my team seem to write these throw-away programs on a quarterly basis. Before you ask, yes, we store these programs in a source-code control system. It’s not that we lose the code, it’s that the data we need changes and the original system design did not account for getting directly at all the data being recorded. What’s the right way to make sure we can always get to the data we need?
Query on Queries
Let’s get one thing straight. There are no “throw-away” programs. I don’t mean that in the “there are no stupid questions” kind of way (in fact, there are many stupid questions). What I mean is that if you’re writing code that you intend to throw away, then you are specifically wasting your time, and the time of your team.
To your original point of a system that records data that it doesn’t expose: well, that I just don’t get. How do the engineers at your company even know if they are recording the data correctly? If this is for regulatory purposes, then what will happen when some auditor comes around and says, “Have you been recording all records of type X?” and then follows up with, “Well, then please show them to me.” It would seem that your company is following the letter but not the spirit of the law, and that can lead only to trouble. KV’s simple rules of data collection: if you don’t need it, don’t keep it; if you do need it, keep it safe, and keep it accessible. Keeping data around because you were told to, but for no other purpose, is like people who collect figurines. They give pleasure to only one or two slightly obsessive individuals and they are the first things to go in the trash after those individuals die.
KODE VICIOUS, known to mere mortals as George V. Neville-Neil, works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. He is an avid bicyclist and traveler who currently resides in New York City.
Originally published in Queue vol. 6, no. 4—
see this item in the ACM Digital Library
Follow Kode Vicious on Twitter
Have a question for Kode Vicious? E-mail him at firstname.lastname@example.org. If your question appears in his column, we'll send you a rare piece of authentic Queue memorabilia. We edit e-mails for style, length, and clarity.
Theo Schlossnagle - Time, but Faster
A computing adventure about time through the looking glass
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, Van Jacobson - BBR: Congestion-Based Congestion Control
Measuring bottleneck bandwidth and round-trip propagation time
Josh Bailey, Stephen Stuart - Faucet: Deploying SDN in the Enterprise
Using OpenFlow and DevOps for rapid development
Amin Vahdat, David Clark, Jennifer Rexford - A Purpose-built Global Network: Google's Move to SDN
A discussion with Amin Vahdat, David Clark, and Jennifer Rexford