The Bike Shed

  Download PDF version of this article PDF

The Bikeshed

What Went Wrong?

Why we need an IT accident investigation board

Poul-Henning Kamp

What was intended to be a brief hiatus for "The Bikeshed" while I built a house became a somewhat longer Absence Without Leave, because we do, indeed, live in interesting times, but I'm back now.

 

In April, 39 postmasters and sub-postmasters were cleared of wrongdoing by a court in the UK after being accused and sentenced for various forms of fraud and, in some cases, serving multiyear prison sentences [bbc.com].

In total, around 700 people have been prosecuted based on the "evidence" from a single IT system installed by the UK Post Office, and while some of them probably did embezzle money, it looks like the majority of them did not. They were sentenced based on evidence from an IT system, which... ehhh... to be honest, we don't actually know what that IT system did, except we know it did it really, really badly.

Press reports have contained various mumblings and hand-waving about the shortcomings of the IT system, but nobody sat down and documented precisely what went wrong and what can be learned from it so that nobody ever makes a mistake like this again.

Had this been a ship sinking, a train derailing, or a plane crash, one of the UK's official accident investigation boards would have come in and written a report everybody would be allowed to read, explaining what went wrong and how to avoid it ever happening again. But because no ships, trains, or airplanes were involved, there will be no such report.

For well over a decade, I have been arguing that governments should create IT accident investigation boards for the exact same reasons they have done so for ships, railroads, planes, and in many cases, automobiles.

Denmark got its Railroad Accident Investigation Board because too many people were maimed and killed by steam trains, and it has kept the board around because a thousand tons of steel hurtling along at 180 km/h, just below a 25kV power line, can do a lot more damage than a steam locomotive with wooden wagons ever could.

The UK's Air Accidents Investigation Branch was created for pretty much the same reasons, but, specifically, because when the airlines investigated themselves, nobody was any the wiser.

Does that sound slightly familiar in any way?

The crucial feature of any accident investigation board is that it focuses only on what went wrong and how to avoid it happening again, and not on whom to blame.

Sometimes the board may find out that somebody failed to do something crucial, did something illogical, or even did something stupid, but that information is published only if it is necessary to prevent the same type of accident from happening again.

As far as I have seen, the information is relayed in impersonal terms ("The pilot did...," "The clerk did not..."), because it is not important who that person was; what is important is that no other person exacts that consequence again.

There are three kinds of incidents an IT accident investigation board should look into:

• When an IT system is involved in loss of life, limb, or liberty.

• When development of an IT system fails spectacularly.

• When an IT system leaks personal information.

The first point is a matter of consistency. Two Boeing 737 MAX airplanes crashed because of IT systems, and because those IT systems happened to be installed in airplanes, we get reports, whereas we get no reports about the UK Post Office's IT problem because its system was bolted into 19-inch racks.

That makes no sense: The human toll caused by both of these IT accidents is way beyond anything any civilized society can just let pass.

The second point is a matter of sound fiscal policy. Denmark, like all other countries, has an abysmal track record with development of governmental IT systems. Millions, and in some cases billions, in tax money pour into projects that almost invariably run late, over budget, fail to deliver, etc., etc., etc.

But nobody is being paid to—or given sufficient access to—write a technical report detailing the crucial mistakes and how to avoid and prevent them in future projects. If an IT accident investigation board were to write a report when such a project failed, and if the contracts for all future projects stipulated that recommendations from the board must be followed, then at least taxpayers would not have to pay to repeat the same mistakes.

The third point should barely need mentioning: Personal information is the helium of IT systems—it leaks out of every crack or imperfection faster than seems possible. This is obviously a subclass of "loss of liberty," but it is so dominating that it deserves its own category.

While pretty much everybody agrees that Something Has To Be Done™, nobody wants to give an official IT accident investigation board the authority to find out what that "something" should be. Software houses hem and haw about how their trade secrets and intellectual property will be violated. What they really mean to say is that they don't want anybody to stop their gravy train.

Individual developers fear that they will be made scapegoats, even though this is precisely not what accident investigation boards do. And politicians and management in private companies are nothing if not unified in their desire to avoid accountability for cutting corners and best-case management.

One particularly bogus argument is that it is not possible to write IT accident reports in the first place. I don't know where that idea comes from, but surely not from reading accident reports. For example:

 

In 2017 the motor of an airplane exploded over the southern part of the Greenland icecap. Part of the engine landed on the ice while the plane continued to the first suitable airport way up north in Canada.

Nobody got hurt.

Two years later the accident investigation board located and dug up the missing parts a couple of meters under the surface of Greenland's ice.

 

If you think that sounds easy, I can highly recommend the 69-page report about how they did it.

A year later, the board issued the final report, revealing that a failure mode called "cold dwell/cold creep" had caused the fan blades to disintegrate. That came as a surprise to everybody, because nobody, not even a mad scientist in a secret lab, had ever imagined that as a failure mode for the Ti-6-4 titanium alloy [worldairlinenews.com].

So, yes, surely an IT accident investigation board would find it "impossible" to figure out what went wrong with the UK Post IT system. Not!

Another bogus argument is that people would refuse to talk and would destroy and hide evidence. This vastly underestimates lawmakers: It is a crime to do that for all other accident investigation boards, and even small infractions lead to jail time. And no, it is not "self-incrimination" unless you did something criminal.

Finally, and most perplexing to me, people claim that an IT accident investigation board will cost too much money.

Compared to what?

Compared to destroying the lives of almost 700 people with bogus criminal records and years in jail, separated from their family and kids?

Compared to the 100 million euros Denmark spent on a new IT system for the police, a project that never delivered anything? That amount of money could easily have paid for the first 20 years of a Danish IT accident investigation board.

There really are no valid arguments against IT accident investigation boards, and all the bogus arguments proffered are the same ones that people put forth to counter all the other very successful accident investigation boards now in operation.

These boards work. We need one for IT, and we need it now.

 

Note: Shortly after the writing of this column, the United States announced the establishment in May of a new Cybersecurity Safety Review Board, similar to what is described above. [whitehouse.gov].

"The Executive Order establishes a Cybersecurity Safety Review Board, co-chaired by government and private sector leads, that may convene following a significant cyber incident to analyze what happened and make concrete recommendations for improving cybersecurity. Too often organizations repeat the mistakes of the past and do not learn lessons from significant cyber incidents. When something goes wrong, the Administration and private sector need to ask the hard questions and make the necessary improvements. This board is modeled after the National Transportation Safety Board, which is used after airplane crashes and other incidents."

 

Poul-Henning Kamp ([email protected]) spent more than a decade as one of the primary developers of the FreeBSD operating system before creating the Varnish HTTP Cache software, which around a fifth of all web-traffic goes through at some point. He lives in his native Denmark, where he makes a living as an independent contractor, specializing in making computers do weird stuff. One of his most recent projects was a super-computer cluster, to stop the stars twinkling in the mirrors of ESO's new ELT telescope.

Copyright © 2021 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 19, no. 3
Comment on this article in the ACM Digital Library





More related articles:

Jatinder Singh, Jennifer Cobbe, Do Le Quoc, Zahra Tarkhani - Enclaves in the Clouds
With organizational data practices coming under increasing scrutiny, demand is growing for mechanisms that can assist organizations in meeting their data-management obligations. TEEs (trusted execution environments) provide hardware-based mechanisms with various security properties for assisting computation and data management. TEEs are concerned with the confidentiality and integrity of data, code, and the corresponding computation. Because the main security properties come from hardware, certain protections and guarantees can be offered even if the host privileged software stack is vulnerable.


Tracy Ragan - Keeping Score in the IT Compliance Game
Achieving developer acceptance of standardized procedures for managing applications from development to release is one of the largest hurdles facing organizations today. Establishing a standardized development-to-release workflow, often referred to as the ALM (application lifecycle management) process, is particularly critical for organizations in their efforts to meet tough IT compliance mandates. This is much easier said than done, as different development teams have created their own unique procedures that are undocumented, unclear, and nontraceable.


J. C. Cannon, Marilee Byers - Compliance Deconstructed
The topic of compliance becomes increasingly complex each year. Dozens of regulatory requirements can affect a company’s business processes. Moreover, these requirements are often vague and confusing. When those in charge of compliance are asked if their business processes are in compliance, it is understandably difficult for them to respond succinctly and with confidence. This article looks at how companies can deconstruct compliance, dealing with it in a systematic fashion and applying technology to automate compliance-related business processes. It also looks specifically at how Microsoft approaches compliance to SOX.


John Bostick - Box Their SOXes Off
Data is a precious resource for any large organization. The larger the organization, the more likely it will rely to some degree on third-party vendors and partners to help it manage and monitor its mission-critical data. In the wake of new regulations for public companies, such as Section 404 of SOX, the folks who run IT departments for Fortune 1000 companies have an ever-increasing need to know that when it comes to the 24/7/365 monitoring of their critical data transactions, they have business partners with well-planned and well-documented procedures. In response to a growing need to validate third-party controls and procedures, some companies are insisting that certain vendors undergo SAS 70 Type II audits.





© ACM, Inc. All Rights Reserved.