Another Day, Another Bug:
We asked our readers which tools they use to squash bugs. Here’s what they said.
As part of this issue on programmer tools, we at Queue decided to conduct an informal Web poll on the topic of debugging. We asked you to tell us about the tools that you use and how you use them. We also collected stories about those hard-to-track-down bugs that sometimes make us think of taking up another profession.
Debugging Devices:
What is the proper way to debug malfunctioning hardware?
I suggest taking a very sharp knife and cutting the board traces at random until the thing either works, or smells funny! I gather you’re not asking the same question that led me to use the word changeineer in another column. I figure you have an actually malfunctioning piece of hardware and that you’ve already sent three previous versions back to the manufacturer, complete with nasty letters containing veiled references to legal action should they continue to send you broken products.
Enhanced Debugging with Traces:
An essential technique used in emulator development is a useful addition to any programmer’s toolbox.
Creating an emulator to run old programs is a difficult task. You need a thorough understanding of the target hardware and the correct functioning of the original programs that the emulator is to execute. In addition to being functionally correct, the emulator must hit a performance target of running the programs at their original realtime speed. Reaching these goals inevitably requires a considerable amount of debugging. The bugs are often subtle errors in the emulator itself but could also be a misunderstanding of the target hardware or an actual known bug in the original program. (It is also possible the binary data for the original program has become subtly corrupted or is not the version expected.)
A Paucity of Ports:
Debugging an ephemeral problem
I’ve been debugging a network problem in what should be a simple piece of network code. We have a small server process that listens for commands from all the other systems in our data center and then farms the commands out to other servers to be run. For each command issued, the client sets up a new TCP connection, sends the command, and then closes the connection after our server acknowledges the command.
Debugging on Live Systems:
It’s more of a social than a technical problem.
I’ve been trying to debug a problem on a system at work, but the control freaks who run our production systems don’t want to give me access to the systems on which the bug always occurs. I haven’t been able to reproduce the problem in the test environment on my desktop, but every day the bug happens on several production systems.
Wanton Acts of Debuggery:
Keep your debug messages clear, useful, and not annoying.
Dear KV, Why is it that people who add logging to their programs lack the creativity to differentiate their log messages? If they all say the same thing—for example, DEBUG—it’s hard to tell what is going on, or even why the previous programmer added these statements in the first place.
Outsourcing Responsibility:
What do you do when your debugger fails you?
Dear KV, I’ve been assigned to help with a new project and have been looking over the admittedly skimpy documentation the team has placed on the internal wiki. I spent a day or so staring at what seemed to be a long list of open-source projects that they intend to integrate into the system they have been building, but I couldn’t find where their original work was described.
The Debugging Mindset:
Understanding the psychology of learning strategies leads to effective problem-solving skills.
Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.
Research for Practice: Tracing and Debugging Distributed Systems; Programming by Examples:
Expert-curated Guides to the Best of CS Research
This installment of Research for Practice covers two exciting topics in distributed systems and programming methodology. First, Peter Alvaro takes us on a tour of recent techniques for debugging some of the largest and most complex systems in the world: modern distributed systems and service-oriented architectures. The techniques Peter surveys can shed light on order amid the chaos of distributed call graphs. Second, Sumit Gulwani illustrates how to program without explicitly writing programs, instead synthesizing programs from examples! The techniques Sumit presents allow systems to "learn" a program representation from illustrative examples, allowing nonprogrammer users to create increasingly nontrivial functions such as spreadsheet macros.
To Catch a Failure: The Record-and-Replay Approach to Debugging:
A discussion with Robert O’Callahan, Kyle Huey, Devon O’Dell, and Terry Coatta
When work began at Mozilla on the record-and-replay debugging tool called rr, the goal was to produce a practical, cost-effective, resource-efficient means for capturing low-frequency nondeterministic test failures in the Firefox browser. Much of the engineering effort that followed was invested in making sure the tool could actually deliver on this promise with a minimum of overhead. What was not anticipated, though, was that rr would come to be widely used outside of Mozilla?and not just for sleuthing out elusive failures, but also for regular debugging.
Debugging Incidents in Google’s Distributed Systems:
How experts debug production issues in complex distributed systems
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.
Divide and Conquer:
The use and limits of bisection
Bisection is of no use if you have a heisenbug that fails only from time to time. These subtle bugs are the hardest to fix and the ones that cause us to think critically about what we are doing. Timing bugs, bugs in distributed systems, and all the difficult problems we face in building increasingly complex software systems can't yet be addressed by simple bisection. It's often the case that it would take longer to write a usable bisection test for a complex problem than it would to analyze the problem whilst at the tip of the tree.
Getting Off the Mad Path:
Debuggers and assertions
KV continues to grind his teeth as he sees code loaded with debugging statements that would be totally unnecessary if the programmers who wrote the code could be both confident in and proficient with their debuggers. If one is lucky enough to have access to a good debugger, one should give extreme thanks to whatever they normally give thanks to and use the damn thing!
Stone Knives and Bear Skins
If you look at the software tooling landscape, you see that the majority of developers work with either open-source tools; or tools from the recently reformed home of proprietary software, Microsoft, which has figured out that its Visual Studio Code system is a good way to sucker people into working with its platforms; or finally Apple, whose tools are meant only for its platform. In specialized markets, such as deeply embedded, military, and aerospace, there are proprietary tools that are often far worse than their open-source cousins, because the market for such tools is small but lucrative.
Deterministic Record-and-Replay:
Zeroing in only on the nondeterministic actions of the process
This column describes three recent research advances related to deterministic record-and-replay, with the goal of showing both classical use cases and emerging use cases. A growing number of systems use a weaker form of deterministic record-and-replay. Essentially, these systems exploit the determinism that exists across many program executions but intentionally allow some nondeterminism for performance reasons. This trend is exemplified in GPUReplay in particular, but also in systems such as ShortCut and Dora.