The Most Expensive One-byte Mistake
Did Ken, Dennis, and Brian choose wrong with NUL-terminated text strings?
Poul-Henning Kamp
IT both drives and implements the modern Western-style economy. Thus, we regularly see headlines about staggeringly large amounts of money connected with IT mistakes. Which IT or CS decision has resulted in the most expensive mistake?
Not long ago, a fair number of pundits were doing a lot of hand waving about the financial implications of Sony's troubles with its PlayStation Network, but an event like that does not count here. In my school days, I talked with an inspector from The Guinness Book of World Records who explained that for something to be "a true record," it could not be a mere accident; there had to be direct causation starting with human intent (i.e., we stuffed 26 high school students into our music teacher's Volkswagen Beetle and closed the doors).
Sony (probably) did not intend to see how big a mess it could make with the least attention to security, so this and other such examples of false economy will not qualify. Another candidate could be IBM's choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads. But that was not really an IT or CS decision. It was a business decision that, as far as history has been able to uncover, centered on Kildall's decision not to accept IBM's nondisclosure demands.
A better example would be the decision for MS-DOS to invent its own directory/filename separator, using the backslash (\) rather than the forward slash (/) that Unix used or the period that DEC used in its operating systems. Apart from the actual damage being relatively modest, however, this does not qualify as a good example either, because it was not a real decision selecting a true preference. IBM had decided to use the slash for command flags, eliminating Unix as a precedent, and the period was used between filename and filename extension, making it impossible to follow DEC's example.
Space exploration history offers a pool of well-publicized and expensive mistakes, but interestingly, I didn't find any valid candidates there. Fortran syntax errors and space shuttle computer synchronization mistakes do not qualify for lack of intent. Running one part of a project in imperial units and the other in metric is a "random act of management" that has nothing to do with CS or IT.
The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end? This is a decision that the dynamic trio of Ken Thompson, Dennis Ritchie, and Brian Kernighan must have made one day in the early 1970s, and they had full freedom to choose either way. I have not found any record of the decision, which I admit is a weak point in its candidacy: I do not have proof that it was a conscious decision.
As far as I can determine from my research, however, the address + length format was preferred by the majority of programming languages at the time, whereas the address + magic_marker format was used mostly in assembly programs. As the C language was a development from assembly to a portable high-level language, I have a hard time believing that Ken, Dennis, and Brian gave it no thought at all.
Using an address + length format would cost one more byte of overhead than an address + magic_marker format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.
Hardware Development Costs
Initially, Unix had little impact on hardware and instruction set design. The CPUs that offered string manipulation instructions—for example, Z-80 and DEC VAX—did so in terms of the far more widespread adr+len model. Once Unix and C gained traction, however, the terminated string appeared on the radar as an optimization target, and CPU designers started to add instructions to deal with them. One example is the Logical String Assist instructions IBM added to the ES/9000 520-based processors in 1992.1
Adding instructions to a CPU is not cheap, and it happens only when there are tangible and quantifiable monetary reasons to do so.
Performance Costs
IBM added instructions to operate on NUL-terminated strings because its customers spent expensive CPU cycles handling such strings. That bit of information, however, does not tell us if fewer CPU cycles would have been required if a ptr+len format had been used.
Thinking a bit about virtual memory systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.
One example is FreeBSD's libc, where the bcopy(3)/memcpy(3) implementation will move as much data as possible in chunks of "unsigned long," typically 32 or 64 bits, and then "mop up any trailing bytes" as the comment describes it, with byte-wide operations.2
If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted "page not present" fault.
Of course, it is possible to write code to detect that corner case before engaging the optimized code path, but this adds a relatively high fixed cost to all string moves just to catch this unlikely corner case—not a profitable tradeoff by any means.
If we have out-of-band knowledge of the strings, things are different.
Compiler Development Cost
One thing a compiler often knows about a string is its length, particularly if it is a constant string. This allows the compiler to emit a call to the faster memcpy(3) even though the programmer used strcpy(3) in the source code.
Deeper code inspection by the compiler allows more advanced optimizations, some of them very clever, but only if somebody has written the code for the compiler to do it. The development of compiler optimizations has historically been neither easy nor cheap, but obviously Apple is hoping this will change with LLVM (Low-level Virtual Machine), where optimizers seem to come en gros.
The downside of heavy-duty compiler optimization—in particular, optimizations that take holistic views of the source code and rearrange it in large-scale operations—is that the programmer has to be really careful that the source code specifies his or her complete intention precisely. A programmer who worked with the compilers on the Convex C3800 series supercomputers related his experience as "having to program as if the compiler was my ex-wife's lawyer."
Security Costs
Even if your compiler does not have hostile intent, source code should be written to hold up to attack, and the NUL-terminated string has a dismal record in this respect. Utter security disasters such as gets(3), which "assume the buffer will be large enough," are a problem "we have relatively under control."3
Getting it under control, however, takes additions to compilers that would complain if the gets(3) function were called. Despite 15 years of attention, over- and under-running string buffers is still a preferred attack vector for criminals, and far too often it pays off.
Mitigation of these risks has been added at all levels. Long-missed no-execute bits have been added to CPUs' memory management hardware; operating systems and compilers have added address-space randomization, often at high costs in performance; and static and dynamic analyses of programs have soaked up countless hours, trying to find out if the byzantine diagnostics were real bugs or clever programming.
Yet, absolutely nobody would be surprised if Sony's troubles were revealed to start with a buffer overflow or false NUL-termination assumption.
Slashdot Sensation Prevention Section
We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.
In other words, Ken, Dennis, and Brian did the right thing.
But That Doesn't Solve the Problem
To a lot of people, C is a dead language, and ${lang} is the language of the future, for ever-changing transient values of ${lang}. The reality of the situation is that all other languages today directly or indirectly sit on top of the Posix API and the NUL-terminated string of C.
When your Java, Python, Ruby, or Haskell program opens a file, its runtime environment passes the filename as a NUL-terminated string to open(3), and when it resolves queue.acm.org to an IP number, it passes the host name as a NUL-terminated string to getaddrinfo(3). As long as you keep doing that, you retain all the advantages when running your programs on a PDP/11, and all of the disadvantages if you run them on anything else.
I could write a straw man API proposal here, suggest representations, operations, and error-handling strategies, and I am quite certain that it would be a perfectly good waste of a nice afternoon. Experience shows that such proposals go nowhere because backwards compatibility with the PDP/11 and the finite number of programs written are much more important than the ability to write the potentially infinite number of programs in the future in an efficient and secure way.
Thus, the costs of the Ken, Dennis, and Brian decision will keep accumulating, like the dust that over the centuries has almost buried the monuments of ancient Rome.
References
1. Computer Business Review. 1992. Partitioning and Escon enhancements for top-end ES/9000s; http://www.cbronline.com/news/ibm_announcements_71.
2. ViewVC. 2007. Contents of /head/lib/libc/string/bcopy.c; http://svnweb.freebsd.org/base/head/lib/libc/string/bcopy.c?view=markup.
3. Wikipedia. 2011. Lifeboat sketch; http://en.wikipedia.org/wiki/Lifeboat_sketch.
LOVE IT, HATE IT? LET US KNOW
Poul-Henning Kamp (phk@FreeBSD.org) has programmed computers for 26 years and is the inspiration behind bikeshed.org. His software has been widely adopted as "under the hood" building blocks in both open source and commercial products. His most recent project is the Varnish HTTP accelerator, which is used to speed up large Web sites such as Facebook.
© 2011 ACM 1542-7730/11/0700 $10.00
![]()
Originally published in Queue vol. 9, no. 7—
see this item in the ACM Digital Library
POUL-HENNING KAMP (phk@FreeBSD.org) is one of the primary developers of the FreeBSD operating system, which he has worked on from the very beginning. He is widely unknown for his MD5-based password scrambler, which protects the passwords on Cisco routers, Juniper routers, and Linux and BSD systems. Some people have noticed that he wrote a memory allocator, a device file system, and a disk encryption method that is actually usable. Kamp lives in Denmark with his wife, his son, his daughter, about a dozen FreeBSD computers, and one of the world's most precise NTP (Network Time Protocol) clocks. He makes a living as an independent contractor doing all sorts of stuff with computers and networks.
For additional information see the ACM Digital Library Author Page for: Poul-Henning Kamp


mramirez | Tue, 02 Aug 2011 17:32:50 UTC
Jan Schaumann | Tue, 02 Aug 2011 18:00:43 UTC
ForkQueue | Tue, 02 Aug 2011 18:06:05 UTC
Rick Hightower | Tue, 02 Aug 2011 18:27:52 UTC
Lucio Maciel | Tue, 02 Aug 2011 18:31:35 UTC
Oren | Tue, 02 Aug 2011 18:47:30 UTC
maht | Tue, 02 Aug 2011 18:53:58 UTC
Ethan Grammatikidis | Tue, 02 Aug 2011 19:45:28 UTC
gets() should have been dropped from C libraries a long time ago, and many other library changes could have been made to make C (and Unix) much nicer to work with, but the computing industry has a bizarre obsession with retaining old mistakes instead of investing the relatively small amount of work needed to correct old code. Speaking of making C nicer, Plan 9 includes a library for operating on a String type. IIRC it's a struct containing length and data. Most of the system uses this String type. I guess I may be right in saying "using C strings for anything non-trivial is stupid," but this illustrates the ridiculous resistance to useful change in the computing world. Where is any len+data string format and the libraries to handle it in POSIX?! The notion of all ${lang} depending on C is a difficult one. It's technically untrue: versions of Lisp exist which have no dependence on C while Go definitely and deliberately does not depend on C. Unfortunately the manner in which operating system "progress" is implemented these days makes this hard. The hardware costs section here strikes me as a little strange. I'm sure the special instructions for null-terminated strings are cheaper to implement in hardware than those for len+data which requires somewhere to store the length. The performance costs section pivots on the notion that C's string type is the One True String Type which it is not. It's merely a tool, and I refer back to Plan 9's String struct.Me | Tue, 02 Aug 2011 21:32:06 UTC
It could be done. Please don't forget in your implementation the requirement to have the library work with mmapped files. The length should not be 1 byte, but a number that can fully represent the address space on the computer. struct str { size_t sz; char *c; }; For allocation, would you create a macro for creating a const struct str? How would you specify such, and how would you find end of string? Would you require manually specify # of characters during allocation, or would you change the compiler to do this for you? How would this work if you had an existing string that you want to hack into and reference (point) a new string to a piece of an old string? Would you instead force a string copy? You may wish to put some thought into dealing with strings that have # Characters > the address space of the CPU.534-252-3526 | Wed, 03 Aug 2011 00:00:12 UTC
Krz | Wed, 03 Aug 2011 00:56:55 UTC
Mike Zorn | Wed, 03 Aug 2011 01:17:42 UTC
Federico Mendez | Wed, 03 Aug 2011 01:41:48 UTC
Tony Ledford | Wed, 03 Aug 2011 03:01:24 UTC
John A. Wills | Wed, 03 Aug 2011 03:21:34 UTC
Rudi Cilibrasi | Wed, 03 Aug 2011 05:42:25 UTC
Bruce | Wed, 03 Aug 2011 05:44:57 UTC
Ice | Wed, 03 Aug 2011 05:49:22 UTC
Teunis | Wed, 03 Aug 2011 06:04:33 UTC
PCM2 | Wed, 03 Aug 2011 06:08:07 UTC
fisted | Wed, 03 Aug 2011 06:13:02 UTC
! | Wed, 03 Aug 2011 06:20:17 UTC
! | Wed, 03 Aug 2011 06:21:00 UTC
F | Wed, 03 Aug 2011 06:37:51 UTC
Arief M Utama | Wed, 03 Aug 2011 06:43:50 UTC
F | Wed, 03 Aug 2011 06:48:52 UTC
Poul-Henning Kamp | Wed, 03 Aug 2011 07:08:31 UTC
TRY | Wed, 03 Aug 2011 09:51:57 UTC
Working on a high-reliability project, I created a (another?) suite of extremely well-tested string routines using a C structure like this: ============ typedef struct { size_t CurLen; // Current lenght of the string size_t MaxLen; // Maximum length of the string (i.e. 1 less than size of allocated data) unsigned char *Data; // Pointer to the allocated data } tString; ============ Since this was for a specific project based on a common platform, cross-platform issues (i.e. different sizes of size_t) were not present. There was a compatibility call: unsigned char * GetString(tString String) that would NUL-terminate the string and return a pointer to it. This routine assumed that no NULs were present in the string -- or if they were, that they were supposed to be there.Cellar | Wed, 03 Aug 2011 10:25:12 UTC
I see parallels between phk and djb. Both like their own universes quite a lot and only barely tolerate strangers in it. This bit of tech-opinion would be more interesting if it wasn't so easy to sidestep the problems caused by this design decision. That is, it's not all that difficult to write your own code such that its basic string handling goes by length+pointer or even two pointers (the C++ stl brings us the equivalent two iterators paradigm). My string handling classes often have a fn(const char *s) { fn(s,s?strlen(s):0); } member for completeness that then usually never gets used. That is, backward compatability is a marginal cost to moving forward coming from within. Of course, the problems start with interfacing old code, but the first order of the day is cleaning (by definition untrusted) input and once that's done conversion can usually be isolated to a relatively thin layer of glue. [p] That doesn't change that there is a lot of rotten code out there, some of which insists on deploying existing facilities for maximum stupidity. Like code that extensively mixes the str* functions with repeated uses of tricks like sprint(ENDOF(buf),"...",...); or even sprintf(buf,"%s....",buf,...); where that same sprintf returns the size of what it just did, allowing to forego yet another useless strlen call hidden in a macro or in sprintf. That return value was, of course, blithely ignored. Revamping this sort of thing takes time and is boring even if it isn't particularly hard to do, so it doesn't happen, and the bad code endures. [p] You could argue that the str* series of functions are poorly thought-out even apart from their use of a NUL terminator, and I'd agree. But this is old history. I'd say there are other things that are less defensible, such as linus sticking to a monolithic design and the resulting failure to defend against possible exploitation vectors that way, though even that pales in comparison to the complete absence of internal bulkheads in even more widely used desktop operating software. That is something that "the user" cannot change. But nobody forces you to use str* functions and in fact you can get by pretty well without using them and not switching languages. Though you can do that too. [p] In fact, plenty of modern "enterprise" software now-a-days is written in a different language, one that doesn't even tell you how it stores its strings internally. That same language managed to bet big on threading and then provided a rather poor implementation. Care to calculate the senseless economic overhead of that? [p] It's typical phk-form (and djb-form, for that matter) to point to someone else and go "neener, neener, you done bad!" regardless of the difference in state of the art between now and twoscore years back. This particular thing wasn't such a bad decision back then. You might as well try and calculate the economic cost of getting the electron flow wrong and the resulting enduring mislabeling.Jan | Wed, 03 Aug 2011 10:31:35 UTC
Ehud Gavron | Wed, 03 Aug 2011 11:54:10 UTC
Marmie | Wed, 03 Aug 2011 13:23:43 UTC
Meh | Wed, 03 Aug 2011 13:32:53 UTC
Christian Sciberras | Wed, 03 Aug 2011 13:44:17 UTC
BradyS14 | Wed, 03 Aug 2011 13:48:51 UTC
TedvG | Wed, 03 Aug 2011 13:56:54 UTC
Sum Guy | Wed, 03 Aug 2011 14:32:36 UTC
Geezer | Wed, 03 Aug 2011 14:42:34 UTC
Ben Craig | Wed, 03 Aug 2011 14:51:39 UTC
brian | Wed, 03 Aug 2011 15:04:41 UTC
Alan Evans | Wed, 03 Aug 2011 15:10:52 UTC
Rob Sherratt | Wed, 03 Aug 2011 15:18:52 UTC
John Blas | Wed, 03 Aug 2011 15:44:42 UTC
Ken Hansen | Wed, 03 Aug 2011 15:59:19 UTC
Richard | Wed, 03 Aug 2011 16:48:11 UTC
Helge Hafting | Wed, 03 Aug 2011 17:23:26 UTC
emile | Wed, 03 Aug 2011 17:25:59 UTC
Brian | Wed, 03 Aug 2011 17:39:08 UTC
Peter Richards | Wed, 03 Aug 2011 17:45:58 UTC
NM | Wed, 03 Aug 2011 18:07:12 UTC
Dan Sutton | Wed, 03 Aug 2011 18:18:59 UTC
austin | Wed, 03 Aug 2011 19:10:13 UTC
David | Wed, 03 Aug 2011 19:11:26 UTC
John Blas | Wed, 03 Aug 2011 19:56:13 UTC
George | Wed, 03 Aug 2011 20:12:02 UTC
Bill | Wed, 03 Aug 2011 22:49:20 UTC
J Williams | Wed, 03 Aug 2011 22:53:29 UTC
Thierry Leroux-Demers | Thu, 04 Aug 2011 01:28:06 UTC
K | Thu, 04 Aug 2011 04:22:34 UTC
Poul-Henning Kamp | Thu, 04 Aug 2011 09:10:20 UTC
Joseph | Thu, 04 Aug 2011 18:47:52 UTC
Adam Diffenderfer | Thu, 04 Aug 2011 18:51:20 UTC
Erik Hoel | Thu, 04 Aug 2011 19:09:18 UTC
Bill Tetzlaff | Thu, 04 Aug 2011 19:12:05 UTC
Bob Wallace | Thu, 04 Aug 2011 19:31:03 UTC
Note re DEC 12, 18, & 36bit CPUs (PDP-1 through DecSystem20xx) and sixbit characters: The choice of bits/word reflects design choices for memory, addressing, instruction OPcode lengths, desired numerical precision, OS intent (i.e., RT, embedded, single user, multitasking/multiuser,...), and probably the availability of a 6bit characterset (itself a superset of 5bit teletype character codes inluding shift, etc...). That the word lengths are multiples of 6bits is applied economy. I first programmed usable code (music synth using a DEC_142 AD/DA in the "9") in machine/assembly using papertape on a DEC PDP-9 with 4K ("four-Kbytes" !!) of 18bit/word core ram. With about 33 instructions, it was a "RISC-architecure" machine (plus ca change...). BTW, the PDP-2 was 24 bit that did not get out of the lab. The PDP-9 was a microprogrammed PDP-7. Unix (aka unics) was born on a PDP-7. Regarding strings discussion and modern systems and the specter of under/overflow exploits in string operations, I still program assuming the need to counter these events if only to keep a program robust in an uncertain I/O environment (read "users").John Montague | Thu, 04 Aug 2011 19:43:28 UTC
Jefferson Braswell | Thu, 04 Aug 2011 20:40:14 UTC
Jefferson Braswell | Thu, 04 Aug 2011 20:43:05 UTC
David E. Siegel | Thu, 04 Aug 2011 20:58:45 UTC
nm | Thu, 04 Aug 2011 21:44:37 UTC
Nick Nussbaum | Fri, 05 Aug 2011 00:15:26 UTC
i | Fri, 05 Aug 2011 01:38:42 UTC
Ken | Fri, 05 Aug 2011 02:59:54 UTC
Ken | Fri, 05 Aug 2011 03:03:59 UTC
pwedd | Fri, 05 Aug 2011 04:35:22 UTC
Kaz Kylheku | Fri, 05 Aug 2011 04:52:43 UTC
Andy Robb | Fri, 05 Aug 2011 09:15:41 UTC
ben | Fri, 05 Aug 2011 11:44:11 UTC
indiferent | Fri, 05 Aug 2011 13:36:41 UTC
RalfG | Fri, 05 Aug 2011 16:09:06 UTC
Some body | Fri, 05 Aug 2011 17:12:14 UTC
Clark Coleman | Fri, 05 Aug 2011 18:10:47 UTC
lupestro | Fri, 05 Aug 2011 21:07:58 UTC
i | Fri, 05 Aug 2011 22:04:59 UTC
Bram van Kampen | Sat, 06 Aug 2011 01:14:10 UTC
hwertz | Sat, 06 Aug 2011 04:59:20 UTC
Diomidis Spinellis | Sat, 06 Aug 2011 08:52:09 UTC
Math | Sat, 06 Aug 2011 13:32:35 UTC
I have a problem with the title and suggestive nature of your text. Oh, yes, 'Ken, Dennis, and Brian did the right thing.' , .... but ... 'the costs of the Ken, Dennis, and Brian decision will keep accumulating.'. This looks like a cheap trick to draw attention. Please let us argue about systems and not about blaming individuals who did a really great job in te past. Was it a mistake? Or just a decision with advantages and disadvantages? There is no such thing as correctness of software in an absolute sense. We can only say that it satisfies a specification or required use. Was it their mistake? C/Unix is a very open community. The original designers did not force the world to use their work without modification. It's popularity was rather spontaneous. Universities used it for teaching and research. So if it was a mistake, then it was collective one. Technically. The C language does not have a string datatype. It only has string litterals. A good C programmer always thinks of character arrays. Apart from this, I do not want to go into technical details and repeat what is mentioned in other reactions or related discussions. But there is a LOT more that needs to be said, too much for the space available here. Computer science is evolving and we learn from the past. But let us make improvements based on formal arguments and not just change because we believe that those old guys where stupid and we can do better.Dr KR Chowdhary | Sat, 06 Aug 2011 14:29:48 UTC
JerryW | Sat, 06 Aug 2011 15:33:19 UTC
Jost Riedel | Sat, 06 Aug 2011 16:08:30 UTC
The decision appears to have been a conscious one. In "Software Tools in Pascal", Krenighan and Plauger write about string implementation: "Two acceptable possibilities spring to mind. One is to make a string a record, with a length and an array for the actual characters: type string = record length : 0..MAXSTR; chars : array [1..MAXSTR] of character; end; The ith character of such a string s is s.chars[i], and the length is just s.length. The other organization is to mark the end of the array by some special value, one that is not a valid character. In that case, the ith character is just s[i], but the length has to be computed. Each of these organizations has good and bad points (wich you should think about for yourself); there is no "right answer". We have decided to use the mark-at-the-end representation."Paul Kimpel | Sat, 06 Aug 2011 18:54:16 UTC
Edward Syrett | Sat, 06 Aug 2011 21:59:37 UTC
Philip Machanick | Mon, 08 Aug 2011 13:19:44 UTC
Poul-Henning Kamp | Mon, 08 Aug 2011 13:24:00 UTC
Alex Measday | Mon, 08 Aug 2011 14:31:11 UTC
Sinan Awad | Mon, 15 Aug 2011 14:27:01 UTC
sysopdr | Tue, 16 Aug 2011 14:19:41 UTC
Poul-Henning Kamp | Tue, 16 Aug 2011 21:14:59 UTC
RLM | Fri, 19 Aug 2011 18:58:00 UTC
PS | Fri, 04 Nov 2011 01:20:58 UTC
Tim Rowe | Thu, 29 Dec 2011 15:21:02 UTC
Jonathan Dickinson | Fri, 30 Dec 2011 08:45:19 UTC
Along the lines of 'fixing it' Microsoft did pretty damn well with the CLR - their string is essentially this: typedef struct { size_t Len; // Length of Data without the null terminator. char* Data; // Null terminated. } String; The clever thing here is when they need to interact with lower-level systems (aka P/Invoke) they can pass the data pointer through unaltered because it is actually null terminated - but systems interacting with that string type never see the null terminator.Bruno Vo_qui | Fri, 30 Dec 2011 15:18:28 UTC
Dana | Tue, 03 Jan 2012 19:24:48 UTC
theraot | Sat, 28 Jan 2012 13:35:28 UTC