I'm working on a networked system that has become very sensitive to timing issues. When the system was first developed the bandwidth requirements were well within the tolerance of off-the-shelf hardware and software, but in the past three years things have changed. The data stream has remained the same but now the system is being called on to react more quickly to events as they arrive. The system is written in C++ and runs on top of Linux. In a recent project meeting I suggested that the quickest route to decreasing latency was to move to a realtime version of Linux, since realtime operating systems are designed to provide the lowest-latency services to applications. Our code already runs on Linux, so it should be a no-brainer to switch to realtime Linux and reap the rewards of lower latency.
Of course, there are always the naysayers, and we have one such on the team. He claims that swapping the version of the operating system won't really improve performance and that what we really should do is measure where the system is slow and then change our own code to improve it. The problem is, we've done this several times, and each time we've gotten a small improvement but not enough to justify the investment of engineer time and effort. I really think he just wants to reimplement the system because that's more interesting than changing the underlying operating system. Clearly if you want lower latency and better response time you really ought to start at the lower levels and work up, shouldn't you?
Quicker off the Mark
Repeat after me, "There are no silver bullets." Now, write that 500 times on the blackboard, clean the erasers, and go home. Just because your code runs on Linux does not automatically mean that it will gain any improvement from running on top of a "realtime" version of Linux, for many reasons, which I'm now going to go through at some length. You have only yourself to blame.
First off, I do not understand how you came to the conclusion that because the underlying operating system might be a realtime system, that that would translate into an improvement for your code. Does your code use any special mechanisms to talk to the operating system that might be improved by using an RTOS (realtime operating system)? If you're using standard facilities such as processes, threads, and—since you mention networking—the socket API, you're unlikely to gain any advantage by moving to an RTOS. People who write code for realtime systems use a mix of standard APIs and system-specific APIs, depending on which parts of their code require the realtime facilities. It would be nice if code were "write once, execute everywhere," but that's not even true across various versions of Linux outside of an RTOS environment.
What makes you think that moving to a more highly specialized environment such as an RTOS will be any different from moving between a 2.x kernel and a 2.x+n kernel? If anything, moving to an RTOS will be huge time sink as bits of the code that your team hasn't looked at in months start to break because of the changes in underlying operating-system assumptions.
A second problem that you would face is the shifting definition of realtime that is now in force. Once upon a time, realtime had a fairly narrow definition. It meant a system that reacted to an input in a bounded amount of time. Companies built their own realtime operating systems internally for specific projects or bought commercial systems from vendors. Each RTOS was designed with the
narrow goal of providing service to events in a bounded period. Specific algorithms, data structures, and tricks were used to achieve this goal. Everything in the system was designed around the idea of low-latency service, from the scheduler to the synchronization primitives, I/O subsystem, network stacks, memory management, and so on.
Unfortunately, the term realtime has been broadened, mostly by marketing people, to mean any system that's not a desktop system. Now realtime and embedded mean almost the same thing. This is misleading because even though a system may fit in only a few megabytes of memory (i.e., is embedded), that says nothing about its timing characteristics. In fact, these systems are often based on much larger systems, such as Linux, the BSDs, and even Windows, and have just had most of the "fat" cut out of them. Often the only piece of the system that has been changed is the scheduler, which though necessary, is not sufficient to transform a desktop or server OS into an RTOS.
Why not? The reason that you can't "just scale down" an OS to an RTOS is that there are hundreds of components that can trash the realtime requirements of a system. Device drivers are one example. What good is it to have a preemptive priority scheduler that makes sure that the highest-priority task runs to completion if any driver loaded into the system can hold a resource, preventing another task from running? Each and every component in an RTOS must be beholden to the same timing characteristics as the whole system, or the whole endeavor will fail. In an RTOS one bad apple does spoil the whole bunch. On switching to an RTOS you and your team would have to verify that every component they used, every driver, every kernel facility, adhered to the realtime qualities you are expecting—a nontrivial task even with an open source system where you can read the code.
The next thing to consider with your silver bullet is the wolves you will set loose by switching to a system where you now have to debug not only your application but also its interaction with a very different kind of operating system. If you think that finding timing bugs and race conditions is hard in your user-space code, it is orders of magnitude harder inside an RTOS. Are you prepared to sit with a logic analyzer hooked to your board to find where a timing bug lies? I've done this, it's not fun, and I try to avoid it at all costs. Besides, there is nothing worse than having to slap down the hardware guys around you who make snide remarks about software engineers with screwdrivers.
Although you mention that your team has worked on your system to the point where you think you've exhausted all the optimizations you can make, you don't say whether or not you've measured where the extra latency is coming from. It seems that you're simply assuming that it's the operating system that's slowing you down. It's quite possible that that's the case, but you had better be damned sure first, because you're going to be sliding down a slippery slope to suffering in changing the operating system, and you really don't want to hit bottom and find out that it actually was something else, such as the type of network card you were using or some problem that you had yet to consider in your code.
None of which is to say that an RTOS cannot benefit an application that requires lower-latency service, but before you swap out that rather big component, you had better have nailed every other thing that could go wrong, including the list I've just provided.
KODE VICIOUS, known to mere mortals as George V. Neville-Neil, works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. He is an avid bicyclist and traveler who currently lives in New York City.
Originally published in Queue vol. 6, no. 6—
see this item in the ACM Digital Library
Follow Kode Vicious on Twitter
Have a question for Kode Vicious? E-mail him at firstname.lastname@example.org. If your question appears in his column, we'll send you a rare piece of authentic Queue memorabilia. We edit e-mails for style, length, and clarity.