The Bike Shed

Development

  Download PDF version of this article

B.Y.O.C. (1,342 Times and Counting)

Why can't we all use standard libraries for commonly needed algorithms?

Poul-Henning Kamp


Although seldom articulated clearly, or even at all, one of the bedrock ideas of good software engineering is reuse of code libraries holding easily accessible implementations of common algorithms and facilities. The reason for this reticence is probably because there is no way to state it succinctly, without sounding like a cheap parody of Occam's razor: Frustra fit per plura quod potest fieri per pauciora (it is pointless to do with several where few will suffice).

Obviously, choice of programming language means that "few" will never be "a single one," and until somebody releases a competent implementation under an open source license, we may have several more versions floating around than are strictly necessary, for legal rather than technological reasons.

It also never hurts to have a few competing implementations to inspire improvement; in fact, there seems to be a distinct lack of improvement where a single implementation becomes too "golden."

So much for theory. Does any of this hold in practice?

One of the nice side effects of the "software tools" concept is that programs are data, too. We can apply data-mining methods to program source code, allowing us to investigate such questions.

Cryptography algorithms provide a good example, because they are easier to identify than other algorithms. Magic numbers in crypto algorithms make for good oracular answers to their presence: you are not likely to encounter both 0xc76c51a3 and 0xd192e819 anywhere other than an implementation of SHA-2. Creating an oracle to detect sorting algorithms in source code with (p>0.9) would be a good student project (i.e., likely impossible).

For data mining FOSS (free and open source software) programs, the FreeBSD operating system ships with a handy facility called the Ports Collection, containing strategic metadata for 22,003 pieces of FOSS. A small number of these "ports" are successive versions of the same software (Perl 5.8, Perl 5.10, etc.), but the vast majority are independent pieces of software, ranging from trivialities such as XLogo to monsters such as Firefox and OpenOffice.

A simple command downloads and extracts the source code to as many ports as possible into an easily navigated directory tree:

cd /usr/ports ; make -k extract

You will obviously need both sufficient disk space and patience.a

The results are worse than I suspected, as shown in table 1.b I expect that these numbers will trisect my readers into three somewhat flippantly labeled segments: "huh?," "sigh," and "aghast."

The "huh?" segment wonders what the big deal is: the absence of a standardized system library with these functions means that you have to "bring your own crypto" if you want some.

The "sigh" segment thinks this is the least of our troubles.

The "aghast" segment will see this as a total failure of good software engineering practices, a call to arms for better education, and reason for a stake through the heart of the Open Zombie Group.

And they are all correct, of course, each from its own vantage point.

Fortunately, what this is not, is The Next Big Security Issue, even though I would not be surprised if one or more "security researchers" would claim so from their parents' basement.c If these had been independent implementations, then there would be reason to worry about the security implications, but they are not.

In a few cases, optimized or license-sanitized versions have been written, but overwhelmingly this is just pointless copy-and-paste of identical source code in blatant disregard of Occam's three-quarters-millennia-old advice.

I am a card-carrying member of the "aghast" segment. My membership card is a FreeBSD commit message:

My libmd, which is as unencumbered by copyright issues as it can be, later grew more cryptographic hash algorithms, such as RIPEMD-160 and the SHA family, and it has been adopted by some other operating systems.

I am also in the "sigh" segment, because not all mainstream operating systems have adopted libmd, despite having 16 years to do so, and if they have, they do not agree what should be in it. For example, Solaris seems to leave MD2 out (see http://hub.opensolaris.org/bin/view/Project+crypto/libmd), which begs the question: Which part of "software portability" don't they understand?

I am, sadly, also in the "huh?" segment, because there seems to be no hope. The rational thing to expect would be that somebody from The Open Group reads this article, reproduces my statistics, and decides that yes, there is indeed demand for a "libstdcrypto" filled with the usual bunch of crypto algorithms. That, I am told, is impossible. The Open Group does not write new standards; they just bicker over the usability of ${.CURDIR} in make(1) and probably also the potential market demand for fire that can be applied nasally.

The other possible avenue of hope, is that the ISO-C standardization group would address this embarrassing situation. Before getting your hopes too high, bear in mind that they have still not managed to provide for specification of integer endianness, even though CPUs can do it and hardware and protocols have needed it since the days of the ARPANET.

If the ISO-C crew decided to do it, their process for doing so would undoubtedly consume 5-10 years before a document came out at the other end, by which time SHA-3 would likely be ready, rendering the standard instantly obsolete.

But it is all a pipe dream, if ISO is still allergic to standards with ITAR restrictions. And you can forget everything about a benevolent dictator laying down the wise word as law: Linus doesn't do userland.

To be honest, what I have identified here is probably the absolutely worst-case example.

First, if you need SHA-2, you need SHA-2, and it has to do the right and correct thing for SHA-2. There is little or no room for creativity or improvements, apart from performance.

Second, crypto algorithms are everywhere these days. Practically all communication methods, from good old e-mail over VPNs (virtual private networks) and torrent sites to VoIP (voice over IP), offers strong crypto.

But aren't those exactly the same two reasons why we should not be in this mess to begin with?
Q

NOTES

a. Using cd /usr/ports ; make -k -j 10 extract will do 10 pieces of software in parallel, but will be a bandwidth hog.

b. Sorry, I forgot to include the DES algorithm in the search.

c. The fact that MD5 seems to be more in demandd than its quality warrants is a matter of choice of algorithm, not a matter of implementation of the algorithm chosen.

d. Yes, I may indeed be to blame for that myself, but that is a story for another day; search for "md5crypt" if you cannot wait.

LOVE IT, HATE IT? LET US KNOW

feedback@queue.acm.org

Poul-Henning Kamp (phk@FreeBSD.org) has programmed computers for 26 years and is the inspiration behind bikeshed.org. His software has been widely adopted as "under the hood" building blocks in both open source and commercial products. His most recent project is the Varnish HTTP accelerator, which is used to speed up large Web sites such as Facebook.

© 2011 ACM 1542-7730/11/0200 $10.00

acmqueue

Originally published in Queue vol. 9, no. 2
see this item in the ACM Digital Library


Tweet



Related:

Ivar Jacobson, Pan-Wei Ng, Ian Spence, Paul E. McMahon - Major-league SEMAT: Why Should an Executive Care?
Becoming better, faster, cheaper, and happier


Alex E. Bell - The Software Inferno
Dante's tale, as experienced by a software architect


Ivar Jacobson, Ian Spence, Pan-Wei Ng - Agile and SEMAT - Perfect Partners
Combining agile and SEMAT yields more advantages than either one alone


Jacob Loveless - Barbarians at the Gateways
High-frequency Trading and Exchange Technology



Comments

Displaying 10 most recent comments. Read the full list here

Andrew Stone | Thu, 24 Feb 2011 02:50:05 UTC

Would be interesting to try to look for bugs in implementations by comparing them

Poul-Henning Kamp | Thu, 24 Feb 2011 07:38:18 UTC

Andrew: You're welcome to do the experiment, but I doubt you will find any. I did a few comparisons and I only think I found 2 different implementations of SHA256. This really is the most incredible pointless copy&paste in history.

Siderite | Thu, 24 Feb 2011 09:10:52 UTC

You are assuming that an algorithm (at least) has the same specifications everywhere. I was surprised to see yesterday that in C# Math.Round(4.5) results in 4. Searching the documentation I found that by design the default behavior is based on "banker's rounding". Like the financial bubble wasn't evil enough, now they mess with rounding algorithms. Anyway, silly me for thinking the algorithm should have behaved like any other math.round, including the one in IE's implementation of Javascript.

Mateusz Kierepka | Thu, 24 Feb 2011 09:29:29 UTC

That's why we have .NET & mono ;)

mitza | Thu, 24 Feb 2011 11:23:58 UTC

Well I guess the next generation FOSS will build a better codebase using better data mining... I guess the way C/C++ uses libs is also iin the way so maybe in the future a mono-like approach will make this issue to diminish considerably.

Kellen | Thu, 24 Feb 2011 14:29:57 UTC

Which begs the question, what is it that you, Bruce, think "begs the question" means.

Parrotlover77 | Thu, 24 Feb 2011 15:19:49 UTC

@Siderite Nice slam on the financial market and Microsoft, but so-called "Bankers' Rounding" is actually quite common in many disciplines, including IT. From Wikipedia... A tie-breaking rule that is even less biased is round half to even, namely If the fraction of y is 0.5, then q is the even integer nearest to y. Thus, for example, +23.5 becomes +24, +22.5 becomes +22, 22.5 becomes 22, and 23.5 becomes 24. This method also treats positive and negative values symmetrically, and therefore is free of overall bias if the original numbers are positive or negative with equal probability. In addition, for most reasonable distributions of y values, the expected (average) value of the rounded numbers is essentially the same as that of the original numbers, even if the latter are all positive (or all negative). However, this rule will still introduce a positive bias for even numbers (including zero), and a negative bias for the odd ones. This variant of the round-to-nearest method is also called unbiased rounding (ambiguously, and a bit abusively), convergent rounding, statistician's rounding, Dutch rounding, Gaussian rounding, or bankers' rounding. This is widely used in bookkeeping. It is the default rounding mode used in IEEE 754 computing functions and operators (and in various computing languages such as ANSI/ISO C, C++, and Java, for their float and double types).

Greg | Thu, 24 Feb 2011 15:41:34 UTC

At the risk of having rotten tomatoes thrown at me... Why not submit the crypto library to Boost?

Stu Savory | Fri, 25 Feb 2011 05:33:17 UTC

Reimplementation of crypto algorithms may be due to paranoia. You write your own because you are unsure whether someone else's has a back door in it. Just sayin' ;-)

Poul-Henning Kamp | Fri, 25 Feb 2011 09:22:57 UTC

Stu: Nice theory, but they are not reimplementations, they are copy & paste.
Displaying 10 most recent comments. Read the full list here
Leave this field empty

Post a Comment:







© 2014 ACM, Inc. All Rights Reserved.