The Kollected Kode Vicious

Kode Vicious - @kode_vicious

  Download PDF version of this article PDF

Dear KV,
What is the proper way to debug malfunctioning hardware?
Hard Up Against a Bug

Dear Hard Up,
I suggest taking a very sharp knife and cutting the board traces at random until the thing either works, or smells funny! I gather you’re not asking the same question that led me to use the word “changeineer” in another column (“Permanence and Change,” Communications of the ACM, December 2008). I figure you have an actually malfunctioning piece of hardware and that you’ve already sent three previous versions back to the manufacturer, complete with nasty letters containing veiled references to legal action should they continue to send you broken products.

Along with race conditions, a subject for another time, hardware problems are probably the most difficult things to figure out. While hardware engineers may scoff at software engineers with screwdrivers, if you want to make them truly afraid, get out a logic analyzer or a scope and hook it up to their board. Most software engineers are not, alas, trained in using logic analyzers—or even in basic electrical engineering—so you will have to content yourself with poking at the board through whatever software the board vendor or operating-system vendor has provided you.

Believe it or not—and I am sure if you’re a typical software engineer you won’t want to hear this—the best place to start is with the hardware vendor’s documentation. Of course, many hardware vendors take as dim a view of documentation as software vendors do. The quality of the documentation I have seen has run the gamut from unusably terrible all the way up to “bang my head on the desk and cry.” Rarely have I seen hardware documentation that was both correct and had a structure that made sense to anyone but the people who originally put it together. Happily, it is rare these days to be able to completely destroy a piece of hardware by putting the wrong value into the wrong memory location; the days of exploding computers à la the original “Star Trek” are still a couple of centuries in the future.

That being said, it is definitely possible to cause damage to hardware via software, or, more commonly, to mask whatever problem you were having by tripping some seemingly unrelated bit of configuration magic in the device. Not that KV is against magic; it’s just that he tends not to trust it... at all.

If you’re lucky, you have the documentation for the system, or can get the lawyer where you work to send a nondisclosure agreement and a letter to the vendor to get whatever it’s willing to give you.

Read the documentation first. Really, trust me on this. It may be completely useless in the end but it may also save you a lot of time if you find just the right bit of information in the docs. I tend to read over all the available registers and configuration options, of which there are often hundreds, and mark the ones I think might be related to my bug. I then tweak them one by one until I get a result. While this is a tedious process, it has been the one I’ve seen that has worked best.

Often you will not have a good way to interact with the hardware other than an already malfunctioning device driver. As devices have become more complex, vendors have released test and configuration programs that can be used to talk directly to the device—for example, over the PCI bus. If your hardware has such a program, and it works, then you are truly blessed. If, on the other hand, it does not come with such a program, there is a set of tools you can use to debug PCI-based devices, PCI Utilities, described in the accompanying sidebar.

PCI Utilities have been ported to several operating systems and something similar may exist in Windows, but, happily, that is not a form of pain to which I have been subjected.

If none of these yields results, and you still have to “just get the thing working,” it’s time, alas, to call for help. The quality of the help you can get from a vendor seems to be linearly related to the price of the device. A cheap device usually comes from a low-cost producer who does not have the money to keep high-quality engineers on hand to help with problems, whereas an expensive device is more likely, but by no means guaranteed, to be produced by a company with experienced engineers. If you’re specifying a device for a project at work, pick the one from the company that seems to have the better engineers. All devices have problems, but the ones that get fixed are the ones that have good engineering resources behind them. Cheap goods are cheap goods, in the end.

Once you reach a field or customer-support engineer, you need to be nice to them. I know, you’re thinking, “What have you done with Kode Vicious?”, but it’s true. Screaming at people and telling them they are idiots because they didn’t consider your personal corner case is not the way to get your bug fixed quickly, even if you work for a large corporation and you have your CEO calling their CEO every day for a fix. You will need to work with this person or these people at least for the duration of your bug, so it’s important to deal with them politely and professionally. Go back and read that again—I’ll wait.

Lastly, you need to take good notes on the problem. There is nothing that is more frustrating than a bug report that says, “It’s busted”—and don’t dare laugh, I’ve seen more than a few bug reports that say pretty much that. You need to be able to say how it is busted, when it was busted, if it stays busted, how to get it into the busted state, and any other information that seems related to the bug you’re seeing. You should take notes not only on the bug but also on the fix. As you work with the engineers from your vendor, you need to track the patches they give you, if any, version changes in the hardware or driver, various theories about what might be wrong and whether the theories pan out, and pretty much everything else that is related to fixing or working around the bug. At this point you will often be both the project manager of the bug fix, as well as the remote hands for the vendor’s engineers. While this may not be what you thought you signed up for, it’s more often than not part of solving a hardware problem.

I hope you’re lucky enough to have decent documentation and support from your vendor. If not, then I’ll see you at the bar. I’m the guy sitting alone at the far end, crying into a chip manual with an always-full gin and tonic. My bartender knows me well.
KV

KODE VICIOUS, known to mere mortals as George V. Neville-Neil, works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are code spelunking, operating systems, and rewriting your bad code (OK, maybe not that last one). He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. He is an avid bicyclist and traveler who currently lives in New York City.

PCI Utilities
The PCI Utilities package contains various utilities dealing with the PCI bus, as well as a library for portable access to PCI configuration registers. It includes lspci for listing all PCI devices (very useful for debugging of both kernel and device drivers) and setpci for manual configuration of PCI devices (http://atrey.karlin.mff.cuni.cz/~mj/pciutils.shtml).

acmqueue

Originally published in Queue vol. 6, no. 7
Comment on this article in the ACM Digital Library





More related articles:

Charisma Chan, Beth Cooper - Debugging Incidents in Google’s Distributed Systems
This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization.


Devon H. O'Dell - The Debugging Mindset
Software developers spend 35-50 percent of their time validating and debugging software. The cost of debugging, testing, and verification is estimated to account for 50-75 percent of the total budget of software development projects, amounting to more than $100 billion annually. While tools, languages, and environments have reduced the time spent on individual debugging tasks, they have not significantly reduced the total time spent debugging, nor the cost of doing so. Therefore, a hyperfocus on elimination of bugs during development is counterproductive; programmers should instead embrace debugging as an exercise in problem solving.


Peter Phillips - Enhanced Debugging with Traces
Creating an emulator to run old programs is a difficult task. You need a thorough understanding of the target hardware and the correct functioning of the original programs that the emulator is to execute. In addition to being functionally correct, the emulator must hit a performance target of running the programs at their original realtime speed. Reaching these goals inevitably requires a considerable amount of debugging. The bugs are often subtle errors in the emulator itself but could also be a misunderstanding of the target hardware or an actual known bug in the original program. (It is also possible the binary data for the original program has become subtly corrupted or is not the version expected.)


Queue Readers - Another Day, Another Bug
As part of this issue on programmer tools, we at Queue decided to conduct an informal Web poll on the topic of debugging. We asked you to tell us about the tools that you use and how you use them. We also collected stories about those hard-to-track-down bugs that sometimes make us think of taking up another profession.





© ACM, Inc. All Rights Reserved.