System Failures

Vol. 2 No. 8 – November 2004

System Failures

Interviews

A Conversation with Bruce Lindsay

If you were looking for an expert in designing database management systems, you couldn't find many more qualified than IBM Fellow Bruce Lindsay. He has been involved in the architecture of RDBMS (relational database management systems) practically since before there were such systems. In 1978, fresh out of graduate school at the University of California at Berkeley with a Ph.D. in computer science, he joined IBM's San Jose Research Laboratory, where researchers were then working on what would become the foundation for IBM's SQL and DB2 database products. Lindsay has had a guiding hand in the evolution of RDBMS ever since.

A Conversation with Bruce Lindsay

Designing for failure may be the key to success.

Photography by Tom Upton

Articles

Automating Software Failure Reporting

There are many ways to measure quality before and after software is released. For commercial and internal-use-only products, the most important measurement is the user's perception of product quality. Unfortunately, perception is difficult to measure, so companies attempt to quantify it through customer satisfaction surveys and failure/behavioral data collected from its customer base. This article focuses on the problems of capturing failure data from customer sites. To explore the pertinent issues I rely on experience gained from collecting failure data from Windows XP systems, but the problems you are likely to face when developing internal (noncommercial) software should not be dissimilar.

Automating Software Failure Reporting

We can only fix those bugs we know about.

Brendan Murphy, Microsoft Research

There are many ways to measure quality before and after software is released. For commercial and internal-use-only products, the most important measurement is the user’s perception of product quality. Unfortunately, perception is difficult to measure, so companies attempt to quantify it through customer satisfaction surveys and failure/behavioral data collected from its customer base. This article focuses on the problems of capturing failure data from customer sites. To explore the pertinent issues I rely on experience gained from collecting failure data from Windows XP systems, but the problems you are likely to face when developing internal (noncommercial) software should not be dissimilar.

A LITTLE HISTORICAL PERSPECTIVE

Traditionally, computer companies collected failure data through their customers’ or their own service arm, manually submitting bug reports. Back in the 1970s and 1980s, a number of computer companies (IBM, Tandem, Digital, etc.) began to service their customers’ computers through electronic communication (usually a secure telephone link). It was a natural progression to automate the collection of failure data: whenever the computer/application crashed, its failure data was automatically collected and sent back to the manufacturer, where it was forwarded to the engineering department for analysis.

by Brendan Murphy

Oops! Coping with Human Error in IT Systems

Human operator error is one of the most insidious sources of failure and data loss in today's IT environments. In early 2001, Microsoft suffered a nearly 24-hour outage in its Web properties as a result of a human error made while configuring a name resolution system. Later that year, an hour of trading on the Nasdaq stock exchange was disrupted because of a technicians mistake while testing a development system. More recently, human error has been blamed for outages in instant messaging networks, for security and privacy breaches, and for banking system failures.

Coping with Human Error

Errors Happen. How to Deal.

Aaron B. Brown, IBM Research

Human operator error is one of the most insidious sources of failure and data loss in today’s IT environments. In early 2001, Microsoft suffered a nearly 24-hour outage in its Web properties as a result of a human error made while configuring a name resolution system. Later that year, an hour of trading on the Nasdaq stock exchange was disrupted because of a technician’s mistake while testing a development system. More recently, human error has been blamed for outages in instant messaging networks, for security and privacy breaches, and for banking system failures.

Although these scenarios are not as spectacularly catastrophic as their analogues in other engineering disciplines—the meltdown of the Chernobyl nuclear plant or the grounding of the Exxon Valdez oil tanker, for example—their societal consequences can be nearly as severe, causing financial uncertainty, disruption to communication, and corporate instability. It is therefore critical that the designers, architects, implementers, and operators of today’s IT infrastructures be aware of the human error problem and build in mechanisms for tolerating and coping with the errors that will inevitably occur. This article discusses some of the options available for embedding “coping skills” into an IT system.

by Aaron B. Brown

Error Messages: What's the Problem?

Computer users spend a lot of time chasing down errors - following the trail of clues that starts with an error message and that sometimes leads to a solution and sometimes to frustration. Problems with error messages are particularly acute for system administrators (sysadmins) - those who configure, install, manage, and maintain the computational infrastructure of the modern world - as they spend a lot of effort to keep computers running amid errors and failures.

Error Messages: What's the Problem?

Real-world tales of woe shed some light

Paul P. Maglio and Eser Kandogan, IBM Research

Computer users spend a lot of time chasing down errors—following the trail of clues that starts with an error message and that sometimes leads to a solution and sometimes to frustration. Problems with error messages are particularly acute for system administrators (sysadmins)—those who configure, install, manage, and maintain the computational infrastructure of the modern world—as they spend a lot of effort to keep computers running amid errors and failures.

Over the past few years, we have spent time observing the problems and practices of sysadmins and have found them frequently frustrated or led astray by error messages. In fact, our data indicates that as much as 25 percent of a sysadmin’s time may be spent following blind alleys suggested by poorly constructed and unclear messages.1 We hope that once developers know how much time end users (such as sysadmins) spend dealing with error messages and how error messages shape problem-solving behavior, they will spend more time crafting messages to try to improve overall productivity.

by Paul P. Maglio, Eser Kandogan

Lack of Priority Queuing Considered Harmful

Most modern routers consist of several line cards that perform packet lookup and forwarding, all controlled by a control plane that acts as the brain of the router, performing essential tasks such as management functions, error reporting, control functions including route calculations, and adjacency maintenance. This control plane has many names; in this article it is the route processor, or RP. The route processor calculates the forwarding table and downloads it to the line cards using a control-plane bus. The line cards perform the actual packet lookup and forwarding. Although individual vendors or models may differ slightly in implementation, the salient points remain the same.

Lack of Priority Queuing Considered Harmful

We're in sore need of critical Internet infrastructure protection.

Vijay Gill, America Online

Most modern routers consist of several line cards that perform packet lookup and forwarding, all controlled by a control plane that acts as the brain of the router, performing essential tasks such as management functions, error reporting, control functions including route calculations, and adjacency maintenance. This control plane has many names; in this article it is the route processor, or RP. The route processor calculates the forwarding table and downloads it to the line cards using a control-plane bus. The line cards perform the actual packet lookup and forwarding. Although individual vendors or models may differ slightly in implementation, the salient points remain the same.

by Vijay Gill

Outsourcing: Devising a Game Plan

Your CIO just summoned you to duty by handing off the decision-making power about whether to outsource next years big development project to rewrite the internal billing system. That's quite a daunting task! How can you possibly begin to decide if outsourcing is the right option for your company? There are a few strategies that you can follow to help you avoid the pitfalls of outsourcing and make informed decisions. Outsourcing is not exclusively a technical issue, but it is a decision that architects or development managers are often best qualified to make because they are in the best position to know what technologies make sense to keep in-house. Deciding what should and should not be outsourced is key to a successful game plan.

Outsourcing: Devising a Game Plan

What types of projects make good candidates for outsourcing?

Adam Kolawa, Parasoft

Your CIO just summoned you to duty by handing off the decision-making power about whether to outsource next year’s big development project to rewrite the internal billing system. That’s quite a daunting task! How can you possibly begin to decide if outsourcing is the right option for your company?

There are a few strategies that you can follow to help you avoid the pitfalls of outsourcing and make informed decisions. Outsourcing is not exclusively a technical issue, but it is a decision that architects or development managers are often best qualified to make because they are in the best position to know what technologies make sense to keep in-house. Deciding what should and should not be outsourced is key to a successful game plan.

by Adam Kolawa

Curmudgeon

Programming in Franglais

When I was studying French in high school, we students often spoke "Franglais": French grammar and words where we knew them, English inserted where our command of French failed us. It was pretty awful, and the teacher did not think highly of it. But we could communicate haltingly because we all had about the same levels of knowledge of the respective languages.

Programming in Franglais

Rodney Bates, Wichita State University

When I was studying French in high school, we students often spoke “Franglais”: French grammar and words where we knew them, English inserted where our command of French failed us. It was pretty awful, and the teacher did not think highly of it. But we could communicate haltingly because we all had about the same levels of knowledge of the respective languages.

Today, there is a kind of programmer’s Franglais that is all too pervasive. Those who are old enough will remember the pitched controversy in the late 1960s and early 1970s over whether compilers, operating systems, and other systems programs should be written in assembly code or a high-level language. The prime argument for the languages was that their higher-level computational model allowed far more function to be coded in the same amount of development time.

by Rodney Bates

Kode Vicious

Kode Vicious Strikes Again

Dear Kode Vicious, I have this problem. I can never seem to find bits of code I know I wrote. This isn't so much work code--that's on our source server--but you know, those bits of test code I wrote last month, I can never find them. How do you deal with this?

Kode Vicious Strikes Again

A koder with attitude, KV answers your questions. Miss Manners he ain’t.

Kall us krazy, but we’re making Kode Vicious a regular. And so we say again, all together now: Never fear, Kode Vicious is here! Answering your questions, solving your problems, and making the world a better place.

Dear Kode Vicious,

by George Neville-Neil