System Failures

Vol. 2 No. 8 – November 2004

System Failures

Lack of Priority Queuing Considered Harmful:
We’re in sore need of critical Internet infrastructure protection.

Most modern routers consist of several line cards that perform packet lookup and forwarding, all controlled by a control plane that acts as the brain of the router, performing essential tasks such as management functions, error reporting, control functions including route calculations, and adjacency maintenance. This control plane has many names; in this article it is the route processor, or RP. The route processor calculates the forwarding table and downloads it to the line cards using a control-plane bus. The line cards perform the actual packet lookup and forwarding. Although individual vendors or models may differ slightly in implementation, the salient points remain the same.

by Vijay Gill

Outsourcing: Devising a Game Plan:
What types of projects make good candidates for outsourcing?

Your CIO just summoned you to duty by handing off the decision-making power about whether to outsource next years big development project to rewrite the internal billing system. That’s quite a daunting task! How can you possibly begin to decide if outsourcing is the right option for your company? There are a few strategies that you can follow to help you avoid the pitfalls of outsourcing and make informed decisions. Outsourcing is not exclusively a technical issue, but it is a decision that architects or development managers are often best qualified to make because they are in the best position to know what technologies make sense to keep in-house. Deciding what should and should not be outsourced is key to a successful game plan.

by Adam Kolawa

Programming in Franglais:
Six of one, half a dozen of d’autre

When I was studying French in high school, we students often spoke “Franglais”: French grammar and words where we knew them, English inserted where our command of French failed us. It was pretty awful, and the teacher did not think highly of it. But we could communicate haltingly because we all had about the same levels of knowledge of the respective languages. Today, there is a kind of programmer’s Franglais that is all too pervasive. Those who are old enough will remember the pitched controversy in the late 1960s and early 1970s over whether compilers, operating systems, and other systems programs should be written in assembly code or a high-level language. The prime argument for the languages was that their higher-level computational model allowed far more function to be coded in the same amount of development time.

by Rodney Bates

Error Messages:
What’s the Problem?

Computer users spend a lot of time chasing down errors - following the trail of clues that starts with an error message and that sometimes leads to a solution and sometimes to frustration. Problems with error messages are particularly acute for system administrators (sysadmins) - those who configure, install, manage, and maintain the computational infrastructure of the modern world - as they spend a lot of effort to keep computers running amid errors and failures.

by Paul P. Maglio, Eser Kandogan

Oops! Coping with Human Error in IT Systems:
Errors Happen. How to Deal.

Human operator error is one of the most insidious sources of failure and data loss in today’s IT environments. In early 2001, Microsoft suffered a nearly 24-hour outage in its Web properties as a result of a human error made while configuring a name resolution system. Later that year, an hour of trading on the Nasdaq stock exchange was disrupted because of a technicians mistake while testing a development system. More recently, human error has been blamed for outages in instant messaging networks, for security and privacy breaches, and for banking system failures.

by Aaron B. Brown

Kode Vicious Strikes Again:
Kall us krazy, but we’re making Kode Vicious a regular.

Dear Kode Vicious, I have this problem. I can never seem to find bits of code I know I wrote. This isn’t so much work code--that’s on our source server--but you know, those bits of test code I wrote last month, I can never find them. How do you deal with this?

by George Neville-Neil

Automating Software Failure Reporting:
We can only fix those bugs we know about.

There are many ways to measure quality before and after software is released. For commercial and internal-use-only products, the most important measurement is the user’s perception of product quality. Unfortunately, perception is difficult to measure, so companies attempt to quantify it through customer satisfaction surveys and failure/behavioral data collected from its customer base. This article focuses on the problems of capturing failure data from customer sites. To explore the pertinent issues I rely on experience gained from collecting failure data from Windows XP systems, but the problems you are likely to face when developing internal (noncommercial) software should not be dissimilar.

by Brendan Murphy