Dear KV,
I recently developed an unhealthy interest in learning how operating systems and systems software work because I had reached the end of an application debugging session that seemed to point to a bug not in the application but in the code that it was calling, which resided in the operating system. Luckily, the OS I am working with is open source, so I hoped to be able to continue debugging my problem, as I was told many years ago as an undergraduate that an operating system is just another program, albeit one with special powers. When I attempted to debug the problem, I found that, unlike the tools I am used to in application development, the ones used to debug an OS are primitive at best. In comparison to my IDE and its tooling, the tools I had on hand to continue debugging had more in common with stone knives and bear skins than with modern software. Since I know from your bio that you work on operating systems, I thought I’d write and ask, “Is that all there is?” Or perhaps the people who write operating systems are simply so much better at software that they do not feel they lack good tools for their work. I feel like the cobbler’s child who has no shoes.
Dear Cobb,
A venture capitalist once told me, “There is no money in tools.” Since this person was pretty smart at investing in companies, I was willing to take their word for it.
If you look at the software tooling landscape, you see that the majority of developers work with either open-source tools (LLVM and gcc for compilers, gdb for debugger, vi/vim or Emacs for editor); or tools from the recently reformed home of proprietary software, Microsoft, which has figured out that its Visual Studio Code system is a good way to sucker people into working with its platforms; or finally Apple, whose tools are meant only for its platform. In specialized markets, such as deeply embedded, military, and aerospace, there are proprietary tools that are often far worse than their open-source cousins, because the market for such tools is small but lucrative.
Let us first dispense with a myth you bring up in your letter, that those who write operating systems are somehow better developers than those who write applications or any other type of software. Writing code in a difficult environment—such as directly on top of hardware—can definitely improve your coding skills. It will certainly make you more careful because a failure in your code can have dire side effects like crashing the whole computer. Learning to be careful in this way makes you no more of a software genius than any other attempt to understand and extend a large corpus of software.
The difficulties of programming an operating system come from two major places: (1) hardware does not allow for certain types of illusions; and (2) there is a lack of good tooling, as you point out.
Many of the conveniences available for application programming exist because of software illusions created by the operating system on top of the hardware. Consider what happens when your application program hits a fault: It crashes, but it doesn’t crash anything else on the system, and it often leaves a record of what went wrong in the form of a core dump. The fact that the program can’t crash others on the system is due to the illusion, provided by the virtual memory system, that each program has all the memory and cannot affect memory owned by other programs. An OS could act in this way, and, indeed, microkernel OS designs, which are common enough in research, can exploit this feature to make more of the code in the OS restartable. But this feature comes at a cost in terms of overhead that OS designers have been loath to pay, and so operating systems remain “one large program” that, when there is an error, die.
Hardware limitations are not the major roadblock to better tooling for operating systems, since, after all, application writers are provided with plenty of conveniences using software alone. In fact, an operating system’s major purpose is to be a software library that aids in the writing of applications, since no one actually cares about the OS except those who work on it.
Systems could be built so that they were more amenable to good tooling, and better tooling could be built, if we wanted to pause long enough to think about what that might mean. In systems software, the pressure is always on to “just make the machine work!” This means hacking up hardware drivers and other bits to make the box work—not even work well—just work at all. People are so pleased that the OS works and that the applications don’t crash, that they never go back to consider whether the design of the system they are using is amenable to the application or the hardware. Making a system work doesn’t mean the design was the right design, just that you actually made the machine go without the magic smoke escaping.
On my more philosophical days (the ones where I’m drinking more heavily after a long debugging session) I think of OS software as being like the child in the Ursula Le Guin short story, “The Ones Who Walk Away from Omelas.” A child in the story is locked in a basement, barely fed, and suffering greatly, but the child’s existence ensures a happy life for the rest of the town. The reader is informed that if the child were ever let out and treated properly, everyone in the town would suffer. The child exists so that others can have happy lives, much as an operating system exists so that applications can have happy lives/runtimes.
Were we to step back and think about how to make systems software better, we might have principles to bring to the table when designing such systems. They might be something like all large pieces of software in an operating system should be designed to be (1) extended, (2) measured, and (3) debugged, and these principles would relate to how tools interact with the system overall.
Extending a system is easiest to do when it is built around a set of well-known and well-documented patterns; I need a thing that looks like X, so I’ll make a Y that looks mostly like X but with changes. The only place that any such patterns exist is often in the device drivers for an operating system, and even there, the hardware usually dictates the form, and the driver has to twist itself into knots to provide data in the right form and format for the rest of the system.
The computing industry has spent untold amounts of money trying to solve this problem for applications, from the original introduction of software libraries to decades of work with object-oriented languages and tooling. Not a single piece of this kind of work has made it into a major operating system in the past 50 years. The code used to build operating systems has only the most primitive of data types (lists, hashes, the occasional tree), while the libraries used in applications are a veritable cornucopia of modern data types. The original argument against complex data types in the operating system was size, but this argument holds no water in a world where 16GB of RAM is the starting point for a watch or a phone, let alone a modern server.
The idea of extending something complex and intrinsic to the system, such as how memory is handled, or the scheduler is nearly anathema because the interfaces are poorly documented and brittle.
Therefore, the first principle to follow when designing systems software is extensibility. Every subsystem that makes up the operating system itself must be designed by default to be replaced with clearly documented APIs, unit tests, and all the other attributes demanded from application software.
The second principle, measurement of software, has improved over the past 20 years with systems such as DTrace and its child, bpftrace, now available for both application and systems code, but DTrace is not designed for measurement. Current measurement tools were created long after the software they were meant to measure, twisting themselves into knots to unscramble the underlying system and providing a useful, if primitive, method for looking at what the system is doing. A system designed for measurement would already have built-in trace points that call out important transitions in system state so that the tooling—or worse yet, your humble programmer—doesn’t have to hunt around trying to figure out what’s going on in the system. Most software, not just operating systems, is created without an idea of measurement, which is brought in only later when people say, “The system is slow,” which is a bug report message that is both common and infuriating.
Lastly, but not leastly, we come to the tool you probably needed and is why you wrote in the first place: the debugger. An awful lot goes into making it possible to debug applications, not the least of which is OS support for debugging via special system calls. OS designers know that not being able to debug an application is a nonstarter, but for some reason, they still often think it’s perfectly fine to debug the operating system itself with print statements (printk()
or printf();
take your pick).
When bringing up a system on new hardware, maybe “this is all you have,” but a properly designed system would start with debugging hooks, not with something as complex as a variable argument call into a complex console output system. In fact, all that’s required to do something smart with a debugger is a small monitor program that exposes direct memory reads and writes (gdb supports this via gdb stubs). Generally, though, when someone brings up systems software on new hardware, the race is on to make printf()
work, because that’s familiar, and humans seem to love the familiar, even when it leads to poorer outcomes.
If systems were designed with these questions in mind (How do I extend this? How do I measure this? How do I debug this?), it would also be easier to build better tools. The tooling would have something to hang its hat on, rather than guessing what might be the meaning of some random bytes in memory. Is that a buffer? Is it an important buffer? Who knows, it’s all memory!
Someday I hope to have tooling that is as good for systems software as what exists for applications, but first we will all have to walk away from Omelas.
KV
George V. Neville-Neil works on networking and operating-system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly 20 years, he has been the columnist better known as Kode Vicious. Since 2014, he has been an industrial visitor at the University of Cambridge, where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth, but also has been deployed as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City.
Copyright © 2023 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 21, no. 3—
Comment on this article in the ACM Digital Library
Charisma Chan, Beth Cooper - Debugging Incidents in Google’s Distributed Systems
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.
Devon H. O'Dell - The Debugging Mindset
Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.
Peter Phillips - Enhanced Debugging with Traces
Creating an emulator to run old programs is a difficult task. You need a thorough understanding of the target hardware and the correct functioning of the original programs that the emulator is to execute. In addition to being functionally correct, the emulator must hit a performance target of running the programs at their original realtime speed. Reaching these goals inevitably requires a considerable amount of debugging. The bugs are often subtle errors in the emulator itself but could also be a misunderstanding of the target hardware or an actual known bug in the original program. (It is also possible the binary data for the original program has become subtly corrupted or is not the version expected.)
Queue Readers - Another Day, Another Bug
As part of this issue on programmer tools, we at Queue decided to conduct an informal Web poll on the topic of debugging. We asked you to tell us about the tools that you use and how you use them. We also collected stories about those hard-to-track-down bugs that sometimes make us think of taking up another profession.