*Originally published in Queue vol. 8, no. 10*—

see this item in the ACM Digital Library

Tweet

Related:

Mihir Nanavati, Malte Schwarzkopf, Jake Wires, Andrew Warfield - **Non-volatile Storage**

Implications of the Datacenter's Shifting Center

Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau - **Crash Consistency**

Rethinking the Fundamental Abstractions of the File System

Adam H. Leventhal - **A File System All Its Own**

Flash memory has come a long way. Now it's time for software to catch up.

Michael Cornwell - **Anatomy of a Solid-state Drive**

While the ubiquitous SSD shares many features with the hard-disk drive, under the surface they are completely different.

(newest first)

The article is excellent. I strongly agree the correction reinforces the overall case. I would raise a related topic, hopefully leading to another studies: the industry states that DVD should last 50 to 100 years. It is a fraction of Sun's claim for the ST5800. Keeping the data safe is a challenge on its own right. Some file formats such as plain text files are relatively safe from applications' natural evolution. However there are vast amounts of information saved by audio and video recorders, word processors, spreadsheets, databases and so on. Most applications have limited backward compatibility. There is a real risk the future generations will inherit data nobody can decode, invalidating all efforts to keep the information safe.

Thank you for pointing out that my statistical mistakes are only slightly less than those of the manufacturers! Fortunately, the correction seems to reinforce my overall case. And I agree that the failure probability looks strangely close to "one in a million".

This is a great article and brings up many good points, however the 'math' in the probabilities section is horribly wrong.

As a statistician I am always sad to see statements of this nature:

Sirius watched the entire production of SC5800s ($10^10 worth of storage systems) over their entire service life, the experiment would end 20 years from now after accumulating about 2×10^6 system-years of data. If its claim were correct, Sirius would have about a 17 percent chance of seeing a single data-loss event

The random variable here is the number of data-loss events among all of the systems in 10 years. First lets see what the probability of the failure of one machine in 10 years is. I am using the horribly decided normal distribution (Failure rates are usually modeled with Poisson distribution).

P(Failure in 1 machine within 10 years) = P(Z(2.4e6,0.4e6) < 10) = 9.867e-10

This rate is ridiculously small about 1 in 1 million. It seems like they just made it up from the phrase "one in a million"...

now if we take all 2e5 machines, each one is an independent Bernoulli trial.

So the x of them fail is distributed Binomial(200000,9.867e-10)

P(No machine fails) = P(0 failures) = nCr(200000,0) * 9.867e-10^0 * (1 - 9.867e-10)^200000 = 0.9998

P(at least one machine fails) = 1 - P(no machine fails) = 0.000197

This is much smaller than 17 percent, but then again that 17 percent assumed we could just add up all the machine years.

I missed an excellent paper from the Usenix HotStorage 2010 workshop, "Mean time to meaningless: MTTDL, Markov models, and storage system reliability" by Kevin Greenan, James Plank and Jay Wylie.

They agree with my point that MTTDL is a meaningless measure of storage reliability, and that bit half-life isn't a great improvement on it. They propose instead NOMDL (NOrmalized Magnitude of Data Loss), i.e. the expected number of bytes that the storage will lose in a specified interval divided by its usable capacity. As they point out, it is possible to compute this using Monte Carlo simulation based on distributions of component failures that experiments have shown to fit the real world. These simulations produce estimates that are relatively credible, especially compared to the ludicrous MTTDL estimates I pillory in the article.

NOMDL is a far better measure than MTTDL. Greenan, Plank and Wylie are to be congratulated for proposing it.