Download PDF version of this article PDF

Trials and Tribulations of Debugging Concurrency

You can run, but you can't hide.

KANG SU GATLIN, MICROSOFT

We now sit firmly in the 21st century where the grand challenge to the modern-day programmer is neither memory leaks nor type issues (both of those problems are now effectively solved), but rather issues of concurrency. How does one write increasingly complex programs where concurrency is a first-class concern. Or even more treacherous, how does one debug such a beast? These questions bring fear into the hearts of even the best programmers.

Furthermore, the landscape of computing is changing. Clusters are increasing in popularity, and now even mom-and-pop shops have clusters running on-site. There is also the added twist, in this world of increasing parallelism, that CPU manufacturers have committed to moving multicore processors onto the desktop in a very short timeframe. (Multicore processors are chips that have more than one core processing unit. These cores compute in parallel, while typically sharing memory.) With all this activity, there is clearly a new hardware wave pushing parallelism into problem domains where it traditionally has not been relevant.

So what do parallel programs look like today? They come in many different flavors—from finite-element programs used for crash simulation to word processors. This article is about debugging these concurrent/parallel programs, and the debuggers used to do the debugging. The article opens with insight into what the actual problems look like. The next section describes some of the approaches used for parallel debugging, and the article concludes with a look toward the future. Note that the technology of debuggers varies from vendor to vendor, and this article tries to shed light on the general problems, not on how specific vendors implement their solutions.

Common Issues in Concurrent Programs

Much like the well-documented dangling-pointer and dereferenced NULL pointer in the standard sequential program, the class of bugs in concurrent programs also tends toward a relatively small, yet common, set of problems. This section focuses on what typical problems for concurrent programs look like.

Before going further, we should establish the class of programs we are referring to. Most parallel programs are written in a model known as SPMD (single program multiple data). This model states that all processes are running the same program image, albeit with different data, and the instruction pointer is not required to execute the same instruction in lockstep. Furthermore, two different processes can run completely different functions from each other. Multithreaded programs are written with the task parallelism model where each thread is executing a different task, but global and static data are shared between threads (local and thread-local data are private per-thread). OK, that’s what the class of programs that we care about looks like…let’s move on.

In general, parallel programs should be debugged as parallel applications only after they have been debugged as serial applications. You can effectively turn many parallel applications into serial versions by setting the number of processes/threads to 1. This conversion works more often for multiprocess applications, and less often for multithreaded applications. The reason for doing this is to debug all of the problems in the program that have nothing to do with concurrency. Debugging sequential programs is an order of magnitude easier than debugging parallel programs, and it’s important to get all those issues taken care of first. After the sequential variant of the application is running flawlessly, then it is time to test and begin debugging the program as a parallel application.

The following list is not exhaustive, but should cover 90 percent of the problems that you’re likely to run into when debugging parallel programs. It’s worth noting that process-level parallelism and thread-level parallelism usually exhibit different sets of the problems listed here. For example, mismatched communication tends to happen more with process-level parallelism; race conditions and false sharing happen almost exclusively with thread-level parallelism; deadlocks can occur with both.

Mismatched Communication

Process-level parallel programs typically send messages back and forth between various processes, and these processes can be located on distinct physical computers. When you use an explicit message-passing API, such as MPI (the Message Passing Interface library standard for message passing; see http://www-unix.mcs.anl.gov/mpi/), there are explicit sends and receives, one for each side of the communication.

The mismatched communication bug occurs when there is a send/receive call without the corresponding receive/send on the other end of the communication. For example, process X may send a message to process Y, but what if process Y doesn’t attempt to receive the message? Or what if process X is waiting to receive a message from process Y, but process Y does not send a message?

The result of this sort of bug is typically a hung process or thread, which in turn often causes downstream mismatched communications, and can additionally cause a deadlock (covered in the next section). Fortunately, at the site of the mismatched communication, there is a great deal of information—including the unrequited call. Thus, it often is not too difficult to debug such programs. A technique called message queue viewing (discussed later) also helps make the debugging process for this type of bug easier.

Deadlock

Deadlock is one of the most famous problems in computer science and is illustrated by the Dining Philosophers problem (for more on this, see http://www.nist.gov/dads/HTML/diningphilos.html). In multithreaded programs, deadlocks become even more pronounced since locks become a fundamental part of the task-parallelism programming paradigm. With locks for various resources (often memory and I/O in parallel programs) dispersed throughout the program, there is a greater potential for deadlock unless great care is taken to ensure these deadlocks do not exist.

For example, consider the following sequence of actions on two threads:

Thread 1 acquires Lock A

Thread 2 acquires Lock B

Thread 1 blocks attempting to acquire Lock B

Thread 2 blocks attempting to acquire Lock A

Threads 1 and 2 each end up blocking on a distinct lock that the other thread holds. Unless some way to break this deadlock is put in place, they will freeze indefinitely—and that is the common result of deadlock in programs: a freeze or crash. For example, deadlocks in kernel mode often bug check. Fortunately, like the mismatched communication problem, the source of a deadlock is typically easy to find after a freeze or crash. This is because in order for a deadlock to occur, at least one thread must hold a lock and at least one thread must be attempting to acquire a lock; at any given point in the execution of the program, or when examining the crash dump file, this state is readily apparent. Of course, fixing the problem after identifying it can be a much harder problem, as sometimes it requires re-architecting the program (although sometimes it is simply as easy as releasing a lock that was inadvertently held too long).

Race Condition

A race condition occurs when there is shared state—for example, a shared variable that can be accessed by two or more different threads with no mechanism to control temporal ordering of their accesses. Thus, depending on which thread accesses the shared state first, the behavior of the application can change.

One of the most common characteristics of race conditions is their non-determinacy. A program may not exhibit bad behavior as a result of a particular race condition, because the order of access just happened to be “favorable” 99 percent of the time. Every once in a while, however, the program does something completely wrong and then you have a crash. Debugging this can be painful, since 99 percent of the time it works as expected; it is only that 1 percent of the time when you can catch it working incorrectly. Even worse, there is the occasional added complication that comes from merely adding a debugger to the program: this slightly changes the timing just enough that the program always runs correctly under the debugger, yet will crash when not run under the debugger. Yes, it can really be that painful.

False Sharing

Of the bugs listed here, false sharing is unique in that it is not a correctness issue, but rather a performance issue—one that is often hard to diagnose and the effects of which can be debilitating. To understand this problem, you must understand the basic structure of memory hierarchies in modern SMPs (symmetric multiprocessing computers). In most SMPs the data cache is private to each processor, whereas main memory is shared. Furthermore, the physical granularity of data access from memory to cache is referred to as a cache line and is typically multiple words in length. The problem is that there can be two threads, each working on the same cache line, yet on different words on the cache line. The cache invalidation mechanism detects that they’re modifying and reading the cache line, but is unable to detect that they’re accessing two distinct parts of the cache line. The result is that the cache line must continuously get written to and read from memory.

MultiParadigm Debugging

Although we’ve focused on traditional parallel models, there is at least one more model worth discussing here. This model is based on the increasing prevalence of loosely coupled services that communicate with each other. How does one debug an application that runs a spreadsheet as the front end, talks to a cluster and idle desktop computers on the back end—all of which are talking to a database server—and uses a combination of OpenMP (an API for writing multithreaded applications; see http://www.openmp.org), MPI, and Web Services as its mechanism for communication? The problems include not only all of the above, but also the intellectual incongruity associated with crossing these various paradigm boundaries. It’s a problem that no one has even begun to solve, but the importance of the problem grows every day. Nudge, nudge: it’s a great research topic.

Approaches of Parallel Debuggers and Tools

Several parallel debuggers are available today. None is near perfect, but they are getting better. Additionally, while they are not perfect, they are still great aids in finding and fixing bugs. This section describes some of the approaches used by parallel debuggers and tools. For many of the solutions described here, there are several different approaches a debugger or tool could use. In this article I try to find a representative, or at least widely accepted, approach—but what is listed is certainly not comprehensive.

Let’s first define some terminology that you will encounter when reading about debuggers:

Pause(d). A process or thread is paused when it has halted execution, and it can proceed to continue execution of the program when appropriately prompted.

Step. While debugging an application, the programmer can pause the program, and in this paused program the programmer can have the instruction pointer move a single instruction at a time. This is referred to as a step. The process of doing this is referred to as stepping.

Attach. A debugger being attached to a process means the debugger can control the process (for example, pausing or stepping a process) and inspect the memory of the process.

Breakpoint. A breakpoint is a mechanism used by the debugger to pause a program at a specified line in the source code. Breakpoints can be per program, per process, or per thread.

Standard Parallel Debugging

I call the approach described here standard simply because it has become the most common approach to debugging parallel programs. When most people refer to parallel debugging, they are usually referring to the model described in this section.

In the standard parallel debugging model, users have a debugger (often with a graphical user interface front end) on their local desktops where they can navigate the debugging session. From this debugger they invoke the parallel program (or attach to a currently running parallel program). This starts the parallel program on multiple processes. The debugger then attaches to the processes that can exist on multiple machines. The user can then specify breakpoints in the program, as one would do when debugging a sequential program, but the user can also specify breakpoints that are active only for a given set of processes and/or threads (process- and thread-level breakpoints, respectively).

Additionally, some debuggers give users the capability to synchronize processes such that they all execute up to a certain line of code, and then block until all processes have reached this point. The debugger can then single-step all of these processes in lockstep, if the user so desires.

The types of issues that typically come up in constructing such a debugger include the following:

Scalable start-up mechanisms. The time it takes to start up a debugging session with a single process on your local computer is typically negligible. Starting up a program that is running on 512 processors (or even more difficult, 10,000 processors) is much more problematic with respect to start-up time. There are several ways to do this. A common method is to use a tree-based start-up mechanism where each node tells k other nodes to start up (where k is some small integer constant).

Secure debugging. Since most of the jobs that are being debugged are remote, it becomes imperative to address security concerns. In particular, malicious users may attempt to inject their processes into your debug session, masquerading as one of your processes. To counter this, an authentication mechanism is necessary to ensure that each process is from the programmer doing the debugging.

Auto-attach. Debugging an application, even a parallel one, requires that the user be able to start debugging from the earliest instance of its process invocation. Thus, things such as global initialization should occur even before “main()” is called (in a C or C++ program). Additionally, some parallel programs will spawn new threads and/or processes, and the debugger should also automatically attach to these newly spawned entities.

Process-level breakpoints. Setting a breakpoint that fires for only a specific process (or thread) becomes invaluable when you’re trying to drill down into where a bug is located in a parallel program. Part of the requirement here is that setting a process-level breakpoint for one computer should not adversely affect the running of a process on another computer.

The functionality of this standard parallel debugging is available in products such as Etnus’s TotalView, Streamline Computing’s DDT (Distributed Debugging Tool), and the upcoming Microsoft Visual Studio 2005 C/C++ debugger.

Relative Debugging

Another approach is relative debugging. The fundamental idea behind relative debugging is that the state of the data structures in two copies of identical running programs should be the same (at specified points in time), even if one program is running on one processor and the other program is parallelized over 1,024 processors.

For example, given the following program snippet, assume this loop has no data dependencies:

for(int i = my_process_ID * size; i < (my_process_ID+1) * size; ++i)

k[i] += j[i] – z[i];

During each iteration of the loop, the array-element k[i] is assigned a value. This value should be the same whether this is run as a sequential program or as a parallel program. A relative debugger simultaneously runs both a single process (which is the reference version of the code) and a multiprocess version of the application, and ensures that the state of the program is the same (or within a specified tolerance range when it is supposed to be at various points in the program). If the values in the program differ, then the user can break into the code to inspect what has caused this discrepancy. This can be a great tool when you know the parallel program is diverging from the reference program but are having a hard time tracking down when and where this divergence begins.

Relative debugging is a feature of the Guardsoft’s Guard debugger.

Execution-Trace Debugging

For some of the larger parallel computer installations, computer time is still not the commodity that desktop computers are. For this reason, batch noninteractive jobs are still used to access some clusters. In these situations, users no longer have interactive access to the application; thus, they cannot interactively debug the application. A solution to this problem is execution-trace debugging.

Execution-trace debugging in the parallel computing world is similar to postmortem debugging in the sequential desktop world, except in the sequential desktop world you typically use postmortem debugging on a crash dump file to figure out why the program crashed. With execution-trace debugging, the trace file is generated whether or not the program crashes.

To do an execution-trace-based debug session, you must use a special instrumented build of the application. This instrumented build will write the execution-trace to a log file as the application executes. The log file allows the debugger to replay all of the messages sent, preserving the relative order in which they were sent. The log file is then the input into the debugger, and the user debugs the replayed past session using either standard parallel debugging or relative debugging. One of the additional advantages of log files is that the actual process of debugging does not disrupt the timing of messages as much as an interactive debug session usually does.

Visualization

Many people consider visualization simply as a way to view the result of computation, but it also turns out to be a powerful tool to debug applications, especially parallel applications. The fundamental use of visualization tends toward identifying two types of bugs:

• Those that are identified as outliers in a pattern. These patterns are often hard to see when looking at the raw data in a text file, but with visualization they become apparent.

• Those that create a rather drastic change in pattern from the expected. These bugs are easy to see when looking at the raw data, but it’s often hard to determine when and where they start. With visualization, this is often made much easier.

We categorize the visualizations for debugging parallel applications into three different types:

Message queue visualization. In some message-passing APIs, such as MPI, there is a notion of message queues, which contain information about messages that are currently being processed. Typically, there are three types of queues: send queue, receive queue, and unexpected message queue. The visualization uses a snapshot of the state of the message queues in a process and constructs a graph to represent the queue. This graph is composed of nodes, representing the processes, and edges (of different colors, or dotted versus dashed), which represent the type of queue for each message.

Figure 1 is an example of what message queue visualization might look like.

With this type of visualization, you can quickly see if an imbalance exists in the message queues along some path.

Message chronology visualization. During the execution of the program, messages are sent from one process to another, and having a graphical view of these messages and their chronology can be useful in finding bugs.

Most parallel programs have some degree of regularity with respect to their message-passing pattern. Although these patterns may be obtuse to those looking at the code, a visualization of the message-passing pattern chronology can lead to some interesting insights. For example, take a look at the visualization in figure 2. In this figure we can see that there is clearly something weird going on with process 6, as it is not following the pattern of sending a message to the next process (process 8 sends to process 1 since process 8 has no “next process”).

 

This is just one example. You can also debug bottleneck performance issues (all the messages going to or coming from a small number of processes) with this type of visualization.

Distributed array visualization. Arrays are the data structures of choice for distribution among multiple processors. It is common in scientific programs to have large arrays of scalar values, which are straightforward to visualize as a 2D or 3D grid (of course, dimensions higher than that are difficult to visualize). Yet these simple visualizations can quickly reveal anomalies that are a direct result of a bug.

For example, it is common for the values in the arrays to have relatively smooth transitions, as they may indicate heat over some surface. If every value in a region goes from 0.0 to about 0.2, but there is a value of 9.8e57 in the middle of everything—well, then you have something to investigate.

Some products do similar types of visualization to those presented here. For example, Etnus’s TotalView supports message-queue visualization, Intel’s Trace Analyzer supports message-chronology visualization, and Streamline Computing’s DDT supports distributed array visualization.

Value Comparison

The distributed array visualization section noted how visualization can make obscure anomalies apparent in graphical format; value comparison does this through statistical analysis of the data.

Value comparison takes a set of data and looks for outliers based upon some set of statistical criteria. The statistical criteria can be equality, values that fall outside of two standard deviations, values that don’t conform to a user-specified function, etc. For example, if all of the data in a two-dimensional array is within some delta of a neighbor value, except for a particular small chunk of data in a certain area, then this chunk might be a place where you would expect to find anomalies first.

A set of features similar to this is available in Streamline Computing’s DDT debugger.

Race Condition Detection

Race conditions are difficult to detect and find, but some tools and techniques can help. They can be broken into two large categories: static detection and dynamic detection.

Static detection of race conditions is an especially difficult problem. In fact, the general problem is classified in the complexity class of NP-Hard problems. With that said, there are static methods that can find which accesses cannot be involved in a race condition. Note that they may not find all such cases, but they can find at least a partial list of them. Those for which it cannot make a decision about whether or not to exclude them from all race conditions, must be examined further—either manually by the developer or with the help of a dynamic detection tool.

Dynamic detection is based on using an instrumented build of the program (or a special runtime or binary rewriting technology) to find race detection during execution of the program. This instrumented version of the program is able to detect when access to a shared variable is not properly protected by a lock. In these cases, it can raise an error to the user. Typically, algorithms that guard dynamically are too strict, but it is better to be too strict than to miss a data race.

Dynamic and static detection can be used together in complementary fashion. Static detection can statically infer which accesses do not have race conditions; then dynamic detection can further narrow down where race conditions might occur.

The academic project Eraser is a popular implementation of dynamic race condition detection. Intel’s Thread Checker is a popular commercial tool that also includes a race condition detector.

Challenges for the Future

The challenges here are vast. Simply put, it’s not clear if we even understand all the problems involved with writing complex parallel programs—given this, how can we hope to solve the debugging problem? The answer is that it will be difficult. The complexity of the programming model is intimately tied into the difficulties in debugging.

For example, the key technology to stop memory leaks was not better memory leak detection tools, but rather garbage collection (and languages that support garbage collection). I suspect the Holy Grail of correct concurrent programs will have to do with a paradigm shift in how actual programs are written, rather than the tools used to debug them.

As you noticed, the multiparadigm debugging challenge is largely unanswered by the various approaches listed here. That’s because there hasn’t been anything that approaches a satisfactory solution to this yet. The closest thing to it is multilanguage debugging, which currently exists in some high-end development tools such as Visual Studio. If you’re looking for a product to develop, a good multiparadigm debugger (or infrastructure to tie multiple debuggers together) is much desired.

Given the importance of writing parallel programs for the future of the computer science and the information technology industry, I expect both great research and commercial products will continue to tackle this problem in the coming years. Q

RESOURCES

A variety of products available today provide much of the parallel debugging functionality discussed in the accompanying article.

Microsoft’s Visual Studio 2005 will add support for standard parallel debugging with process and thread-level breakpoints and stepping. http://lab.msdn.microsoft.com/vs2005/

Etnus’s TotalView has support for standard parallel debugging, as well as message queue visualization. http://www.etnus.com

Intel’s Trace Analyzer supports visualizations of parallel programs including message chronology visualization. http://www.intel.com/software/products/cluster/tanalyzer/

Intel’s Thread Checker has functionality to help find deadlocks and race conditions. http://intel.com/software/products/threading/tcwin/

Streamline Computing’s DDT (Distributed Debugging Tool) supports a variety of the features described in the article, including value comparison and distributed array visualization. http://www.streamline-computing.com/softwaredivision_1.shtml

Guardsoft’s Guard debugger features the relative debugging mechanism. http://www.guardsoft.net/products.html

LOVE IT, HATE IT? LET US KNOW

[email protected] or www.acmqueue.com/forums

KANG SU GATLIN is program manager at Microsoft on the Visual C++ compiler team (http://msdn.microsoft.com/visualc), where he focuses on code generation and optimization, as well as 64-bit and high-performance computing development tools. He received a B.S. in computer science from Cal Poly San Luis Obispo and an M.S. and Ph.D. in computer science from the University of California at San Diego.

© 2004 ACM 1542-7730/04/1000 $5.00

acmqueue

Originally published in Queue vol. 2, no. 7
Comment on this article in the ACM Digital Library





More related articles:

Adam Morrison - Scaling Synchronization in Multicore Programs
Designing software for modern multicore processors poses a dilemma. Traditional software designs, in which threads manipulate shared data, have limited scalability because synchronization of updates to shared data serializes threads and limits parallelism. Alternative distributed software designs, in which threads do not share mutable data, eliminate synchronization and offer better scalability. But distributed designs make it challenging to implement features that shared data structures naturally provide, such as dynamic load balancing and strong consistency guarantees, and are simply not a good fit for every program. Often, however, the performance of shared mutable data structures is limited by the synchronization methods in use today, whether lock-based or lock-free.


Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, Mark Roth - Challenges of Memory Management on Modern NUMA System
Modern server-class systems are typically built as several multicore chips put together in a single system. Each chip has a local DRAM (dynamic random-access memory) module; together they are referred to as a node. Nodes are connected via a high-speed interconnect, and the system is fully coherent. This means that, transparently to the programmer, a core can issue requests to its node’s local memory as well as to the memories of other nodes. The key distinction is that remote requests will take longer, because they are subject to longer wire delays and may have to jump several hops as they traverse the interconnect.


Spencer Rathbun - Parallel Processing with Promises
In today’s world, there are many reasons to write concurrent software. The desire to improve performance and increase throughput has led to many different asynchronous techniques. The techniques involved, however, are generally complex and the source of many subtle bugs, especially if they require shared mutable state. If shared state is not required, then these problems can be solved with a better abstraction called promises. These allow programmers to hook asynchronous function calls together, waiting for each to return success or failure before running the next appropriate function in the chain.


Davidlohr Bueso - Scalability Techniques for Practical Synchronization Primitives
In an ideal world, applications are expected to scale automatically when executed on increasingly larger systems. In practice, however, not only does this scaling not occur, but it is common to see performance actually worsen on those larger systems.





© ACM, Inc. All Rights Reserved.