Comments

(newest first)

Mike Smith | Wed, 29 Aug 2018 13:56:25 UTC

NUL terminated strings win bigly over using a preceding length when it comes to sharing strings between machine e.g. when they are written to files:-

1. No problems with having to know or figure out how long the length field is
2. No problems figuring out the endianness of the length field

If they had used a length, imagine the time wasted and confusion that would have happened extending the length field from 1 to 2 bytes, then 2 to 4 bytes, then 4 to 8 bytes as the CPUs advanced over the years.

So in my opinion, they had to make a compromised decision and I think they took the right one.

krater | Tue, 16 Jan 2018 19:29:53 UTC

I imagine a today where Ritchie & Co. decided to use address + length to represent strings. 

We have now 1-8 different types of strings, with 1, 2, 4, or 8 byte length field. All incompatible to each other. Additional we have a couple of overhead to convert string type 1 to string type 2. We have overhead to access 2 byte strings on a 32bit machine because 16bit access is slower than 32bit access. 
We have security issues because someone had invented a string type where it's possible to automaticly recognize the lenght of the size field and some programs would handle this strings wrong or some developer are using this strings wrong.

I don't see any difference at all...

cousteau | Fri, 12 Feb 2016 08:56:50 UTC

What about CRLF for newlines?

Certain OSes and many transfer protocols rely on the double byte CRLF as a line terminator. This, as far as I know, is a historical remainder from teleprinters that relied on two separate "instructions" to move the printing head to the beginning of the line and feed paper for 1 line. Needless to say, nowadays there is no point on having 2 separate characters for this as they have a single meaning together and no meaning separately.

The consequences of this might be expensive: A stream processing mechanism must make the conversions on the fly. There are 2 different incompatible ways to open a file. The "line terminator" cannot be determined by a simple character search, but instead requires a substring match (which is even more complicated to do in hardware). Also, the cases of unmatches CR or LF will usually need to be treated specially. Plus, on a file with an average of 50 characters per line, this implies a 2% extra storage size or transmission time. I don't know if this has great economical repercussions, but it's definitely not for free.

Another candidate is UTF-16 (which is a rather inconvenient encoding whose only point is backwards compatibility with UCS-2), and Unicode adapting its size to it, and not the other way around. Because of it it is required for any other Unicode implementation treats the range D800-DFFF specially, plus it resulted in limiting Unicode to a bit more than one million code points, while UTF-8 had no problems with going up to 2 billion; now those extra code points are required to be treated as invalid, complicating UTF-8 validity check.

Robert | Tue, 01 Sep 2015 16:22:42 UTC

Another thing to consider is the sequential access model used by "everything is a file" unix where typical programming idioms (such as those in the K&R book) typically revolve around reading 'until' some sentinal value. A file typically ends up terminated by EOF (represented as a 0 too), and you don't typically know where it's going to come unless you perform a length calculation first. Such a calculation is expensive on magnetic media - especially tapes. You could work around this again with your run-length byte at the start of a string but then this means that whomever is writing the data in the first place has to calculate the length either beforehand or rewind to the start of the string and put in the length after the write.

rh- | Sat, 24 Jan 2015 04:13:55 UTC

"Using an address + length format would cost one more byte of overhead than an address + magic_marker format"
Huh? There I stopped reading it. At the time there was no talking about strings longer than 256 chars, as not many toys had lots of memory space to waste, and the screens to print them was only 40 or 80 chars, and either adding a null terminator or adding a size byte (as in Pascal or ADA) would anyhow add a byte. Talking about longer strings (which would waste two bytes for the size) came much later, and that was not true for long: when the unicode strings came, the null terminator would also waste two bytes. Lack of security? Same idiot forgetting to add the terminator would also forget to set the proper right. Also, when playing with strings, it would be a pain in the ass to always compute the right lengths, the backslash-zero solution is more easy and elegant.

Terry Davis | Tue, 30 Sep 2014 02:12:49 UTC

I had a teacher who was convinced bigendian was correct.  Just goes to show how stupid people can be.  I like NULL terminated.  I am God's chosen.

Keith Thompson | Fri, 11 Apr 2014 19:34:11 UTC

> Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

A C string isn't represented as address + marker. A C string is by definition "a contiguous sequence of characters terminated by and including the first null character". You can have a pointer to (equivalently, the address of) a string, but the string itself is not an address; it's a sequence of characters stored in an array. (And no, arrays are not pointers; see section 6 of the comp.lang.c FAQ, http://www.c-faq.com/, if you think they are.)

The point is that the proposed "address + length" representation would store information about the string in two different places.

An alternative might be Pascal-style counted strings; in C terms, you could store the length in the first (sizeof (size_t)) bytes of the char array.

Paul | Fri, 11 Apr 2014 16:18:09 UTC

How is the Therac-25 incident not the most expensive (also one-byte) bug?

theraot | Sat, 28 Jan 2012 13:35:28 UTC

Want to know Why?

This is the best explanation I have so far...

When B was designed there wasn't an standard about the size of a byte or the size of a word. The sizes went form 5 bits to 22 bits, and it became worst later on, for example the Cray-1 used 60 bits and DEC Alpha with 64 bits. And B was meant to be compilable to all the machines. We are talking about B here, so Brian Kernighan is innocent. As for Kenneth Lane Thompson and Dennis MacAlistair Ritchie... they decided that it was easier to have a null termination (a 0 of whatever number of bits it takes) than to have to manage two words, one for a pointer and one for the size.

Aside from that problem, which could have been solved by the compiler and virtual machine* anyway, they opted for immutable strings. Mutable strings seems a good idea for example if you want to get substring that is easy with pointer and size, but this means that you will have different pointers to the same area in memory, therefore you got to wait until you stoped using all the substring to free the memory of the main string, this means complications keeping track of the references, this is expensive when programming. Instead it is easier to make a copy the string, now the new string is stored independently and doesn't impose any limitation to release the memory of the former. This make the live of the programmer easier, so why not use inmmutable strings and make the live easier for the compiler too? It is known that using null terminated strings is less efficient compared to pointer and length strings because I need to copy the string each time I concatenate or substring, although developing the compiler was easier (because of the size of the byte problem I mentioned earlier).

* Remember that virtual machines were a new thing that appeared with BCPL, in which Ritchie participated (although BCPL didn't have strings per se), note that the development of B started in 1971 and was based on BCPL.

C inherited it from B and C++ inherited if from C, and Java, C#... inherited if from C++.

Today when we want to manipulate strings, unless we are doing a single operation, it is better to use a solution based on linked lists or arrays while we are crafting the string, and the retrieve the result as a inmutable string.

Now, it is possible to make inmmutable strings with pointer and size, copying the string anyway to keep the simplicity of the implementation to keep track of the references. For this situation to have the length doesn't represent any benefit, and as mentioner earlier it is harder because of the disparity of architecture of the compuers of the time. Still I know that it is much better to use pointer and length to display the string.

Today the situation is different, we all have bytes of 8 bits, virtual machines are common, we have power to use garbage collectors. Is it time to develop a new mechanism for strings? I don't know, just keep in mind that inmmutable strings are good for thread safety, and computers with multiple cores are common too.

Maybe I should mention that BASIC did use something similar to pointer and lengh, the first data on the destination of the pointer was the length of the string. There is a lost content from msdn (I saw it in the version for visual studio of '98) that explains that microsoft decided to change the implementation of strings for BASIC (I think when they started to call it visual basic) becuase it increased the performance of it by calling C++ libraries. [In fact, in the old msdn was a full discussion about the pros and cons of different ways to handle strings, including things like storing the pointer to start and the pointer to the end, having a pointer after a blocks of constant size of characters to have non contiguous strings and so on... sadly those articles seem to be lost forever].

Dennis Ritchie, rest in peace.

Dana | Tue, 03 Jan 2012 19:24:48 UTC

"...Using an address + length format would cost one more byte of overhead than an address + magic_marker format...

Not really true.   The null-terminated string also adds one byte at the end of the string - the null character.  Address + length also limits the size of the string to 256 characters unless more than one byte is used as the length designator.

Bruno Vo_qui | Fri, 30 Dec 2011 15:18:28 UTC

May I say, top post and comments are talking about 'in-band' string delimitors. But memory may be marked with 'out-band' properties, for instance ecc-read-write-execute and the likes. What about an 'isSameString' out-band?

Jonathan Dickinson | Fri, 30 Dec 2011 08:45:19 UTC

Along the lines of 'fixing it' Microsoft did pretty damn well with the CLR - their string is essentially this:

typedef struct
{
  size_t Len; // Length of Data without the null terminator.
  char* Data; // Null terminated.
} String;

The clever thing here is when they need to interact with lower-level systems (aka P/Invoke) they can pass the data pointer through unaltered because it is actually null terminated - but systems interacting with that string type never see the null terminator.

Tim Rowe | Thu, 29 Dec 2011 15:21:02 UTC

Having been endlessly (!) frustrated by the 255-character string limit of early Pascal and related language, I would suggest that the terminating character approach has more going for it than you suggest.

PS | Fri, 04 Nov 2011 01:20:58 UTC

Mr, you really shouldn't have posted this. None of your arguments are real. The most major reason you've came up with is based on FreeBSD. Wait a minute! Something's fishy...
For a side note: ptr+len means messing up extension capabilities, partially eliminating the dynamic nature and a whole lot of those "expensive clock cycles" wasted on updating the length value during writes.

RLM | Fri, 19 Aug 2011 18:58:00 UTC

Many posters have touched on interoperability issues with the address + length format: endianness, word-length, field-alignment, definition of #bits / char to name a few. The NUL-terminated choice seems to have served Ken, Dennis, and Brian well in terms of simplicity. My suspicion is that, had they chosen the address + length format, we would now be reading about the costs generated by the mistake of them choosing THAT format 40+ years ago. It may also be the case, Mr. Kamp, that you perhaps have too much time on your hands, as do I for spending time on this rebuttal.

Poul-Henning Kamp | Tue, 16 Aug 2011 21:14:59 UTC

Interesting if slightly uninformed point of view :-)  One of the things you can recognize a craftsman on, is that he picks the most appropriate tool for the job, and therefore actually cares about the relative merits of various tools.  In particular, no sensible craftsman picks a tool that needlessly makes the job harder to complete.

sysopdr | Tue, 16 Aug 2011 14:19:41 UTC

OK blaming null terminated strings for all the windows security issues (and all the other stuff) would be like blaming the spoon for all of the overweight people in the US. (a co-worker proposed this.)
Me thinks the author is trying to compensate for something like being a person who has created buffer over-run problem in some software he has written. Do not blame the tools for the errors made by the craftsman.

Sinan Awad | Mon, 15 Aug 2011 14:27:01 UTC

Interesting article, very interesting observations.
I remember I enjoyed strings in pascal, but was annoyed by the fact they were only 256 bytes long :-)

Alex Measday | Mon, 08 Aug 2011 14:31:11 UTC

This is unrelated to the C-string problem.  DEC operating systems (at least VMS and possibly RSX, if I remember correctly) used periods between file names and extensions.  Directories were encased in square brackets, in which periods were used as directory separators.  Consequently, DEC's file naming convention was not impossible for IBM to follow (and DEC also used "/" for command-line arguments).  It was a simple matter to define a "cd" alias that would automatically insert the square brackets for you.  Related to the string problem, VMS descriptors (not used only for strings) had 4 fields: address, length, type, and class.

Poul-Henning Kamp | Mon, 08 Aug 2011 13:24:00 UTC

"For example, read a file into memory using a single file read, [...]"  Sorry, but that triggers my VM-ignorance detector into the yellow field.   In these days and times, you should not read files into memory, you should use the nice Virtual Memory you OS offers you, and mmap(2) the file into your address space.  Needless to say, you should not modify the file, unless that is your actual intention.   Using a ptr+len (or ptr+ptr) format allows that, NUL-terminated strings do not.

Philip Machanick | Mon, 08 Aug 2011 13:19:44 UTC

There's another benefit to nul terminated strings. You can do in-place string manipulations. For example, read a file into memory using a single file read, find all the newlines, replace them by nuls and create an array of pointers to the first char and subsequently all the other positions one past a nul, and you have an array of lines. Quick and easy, no data copying. Another one: process a string a substring at a time by replacing a char partway along by a nul. Also no large-scale copying. In some applications, that sort of consideration still matters. You may argue that creating an array of arrays or a sub-array out of a single glob of memory that wasn't created as such is inherently risky and you'd be right. But this is the sort of thing the language was designed to allow.

It would be nice if you could achieve that sort of efficiency with a language that was type-safe. Theoretically you could achieve much of this with slices. What you are really doing is renaming each substring up to and not including a specific position as a new string. I don't know if any language implements slices in a way comparable with C's ability to reinterpret data in this sort of way.

I'd rather discuss this sort of question than whether the designers of C made a mistake 40 years ago (though it's a nice starting point to stimulate discussion), otherwise we've hardly advanced, have we?

@George | Wed, 03 Aug 2011 20:12:02 UTC, you say:

> I would nominate the inclusion of HTML in SMTP message bodies as a far more expensive mistake.

That's nothing on the way some applications creatively turn text into 10x or 100x the basic information content. Try saving some very plain text in an MS Word document, then generate a PostScript file. But that's a different kind of cost: massive data bloat not a cost in program errors, which I think is the big issue here.

Edward Syrett | Sat, 06 Aug 2011 21:59:37 UTC

I've read many of the comments and am surprised that no one has mentioned the obvious analogy to the physical media used on the target computers (mainframes and minis).  Mainframe unit-record devices sent record-length info over separate control channels; the data was unrestricted (no magic values).  On paper tape, you had to reserved one combination of bits as an end-of-data signal since the tape was conceptually endless.  A less obvious observation:  null-terminated strings led to the development of RISC, since mainframe-style MVCL instructions had no lengths to work with and so were never emitted by C compilers.

Paul Kimpel | Sat, 06 Aug 2011 18:54:16 UTC

I certainly agree with Nick Nussbaum that the problem isn't with NUL termination of strings (or any other representation), but with pointers, stacks, and all the other aspects of memory addressing and management, without having appropriately fine-grained hardware protection. I take issue, however, with his assertion that the hardware for that type of protection wasn't available in the early '70s when C was being developed. As John Blas mentioned, check out the Burroughs B5000 architecture, which had descriptor-based addressing and bounds protection fully implemented and commercially available in 1962. That capability still lives on in the B5000's descendant Unisys MCP systems.

The broader underlying problem, though, was hinted at by Brian -- in most language implementations there is no useful type information available at run time, and most compilers see it as their job to squeeze all of that stuff out of the code in the process of generating a runnable object. There is a broadly held belief that static analysis of a program (by the compiler, or the programmer, or whatever) should be adequate to detect and prevent problems at run time. By now we should have more than ample evidence that simply isn't the case. Programs are dynamic entities, thus they need dynamic analysis (and enforcement) to keep them from doing things they shouldn't. That in turn requires that data type information be included as part of the dynamic environment.

I have yet to hear an argument that divide-by-zero traps are unnecessary, or that programmers should prevent divide-by-zero simply by "being careful," and yet that is precisely the argument that is advanced for preventing the far more serious kinds of type violations that result in address errors. It's nuts.

We need to remember that programmers are humans, and being 100% careful 100% of the time is just not in our makeup -- if it were, we would never have had, say, penicillin. Our current state of software affairs (and what I think is the fundamental issue behind the original post) is that with few exceptions, we design programming languages for computer architectures and computer architectures for circuits, an approach that is most beautifully realized in C, although the outcome could hardly be called beautiful. Our current problems with programming and software are going to persist until we reverse course and start designing the computer architectures for languages, and the languages for people.

Jost Riedel | Sat, 06 Aug 2011 16:08:30 UTC

The decision appears to have been a conscious one.

In "Software Tools in Pascal", Krenighan and Plauger write about string implementation:

"Two acceptable possibilities spring to mind. One is to make a string a record, with a length and an array for the actual characters:

type string =
 record
       length : 0..MAXSTR;
       chars : array [1..MAXSTR] of character;
 end;

The ith character of such a string s is s.chars[i], and the length is just s.length.

The other organization is to mark the end of the array by some special value, one that is not a valid character. In that case, the ith character is just s[i], but the length has to be computed.

Each of these organizations has good and bad points (wich you should think about for yourself); there is no "right answer". We have decided to use the mark-at-the-end representation."

JerryW | Sat, 06 Aug 2011 15:33:19 UTC

Seems like decisions to mix metadata or control information in the same communications medium as data itself lead to the long term problems. Telephone switching control, string termination characters, IP routing information, computer instructions, etc, have all had valid "reasons" for their design choices. All have lead to accumulating expense to deal with the hidden baggage that type of decision brings with it.

Dr KR Chowdhary | Sat, 06 Aug 2011 14:29:48 UTC

I think, having a length with string is a more robust case. It is faster also. Consider the case of sorting a large number of strings which are null terminated. You need to allocate memory for a temporary variable for swapping, whose length should be maximum of all these strings. To find out maximum length of n strings each having average length m, you need to scan m*n characters, which have complexity O(m*n). if length is used along with each string, you can find the maximum length in O(n) time. 

But, ken, Richie (KR), and Thompson must have had a definite reason for choosing the magic character (null), which seems to be SPACE or TIME!

Math | Sat, 06 Aug 2011 13:32:35 UTC

I have a problem with the title and suggestive nature of your text.
Oh, yes, 'Ken, Dennis, and Brian did the right thing.' , .... but ... 'the costs of the Ken, Dennis, and Brian decision will keep accumulating.'. This looks like a cheap trick to draw attention. Please let us argue about systems and not about blaming individuals who did a really great job in te past. 

Was it a mistake? 

Or just a decision with advantages and disadvantages? There is no such thing as correctness of software in an absolute sense. We can only say that it satisfies a specification or required use.

Was it their mistake?    

C/Unix is a very open community. The original designers did not force the world to use their work without modification. It's popularity was rather spontaneous. Universities used it for teaching and research. So if it was a mistake, then it was collective one.
      
Technically.

The C language does not have a string datatype. It only has string litterals. A good C programmer always thinks of character arrays. Apart from this, I do not want to go into technical details and repeat what is mentioned in other reactions or related discussions. But there is a LOT more that needs to be said, too much for the space available here.

Computer science is evolving and we learn from the past. But let us make improvements based on formal arguments and not just change because we believe that those old guys where stupid and we can do better.

Diomidis Spinellis | Sat, 06 Aug 2011 08:52:09 UTC

NULL-terminated strings require one less CPU register to process them.  strchr can be written with a single pointer to the string; strcpy with two pointers for the source and destination.  The length approach requires an additional counter register.

hwertz | Sat, 06 Aug 2011 04:59:20 UTC

     Good article.  One comment, so assembly programmers favored "string + magic marker" format.  C was originally developed in tandem with UNIX as a language to write the kernel in.  Inevitably some assembly would have to be used, so they followed assembly conventions.  Honestly, I don't think C was designed with the pervasive use it now has in mind, I would guess even back then they would have assumed (just like the mainframes of the day) that most user programs would be written in other languages (that indeed may have used length + string format.)

Bram van Kampen | Sat, 06 Aug 2011 01:14:10 UTC

Well UNICODE resolves this issue quite nicely, and RTL Strings Won't be with us for the duration of the universe.

i | Fri, 05 Aug 2011 22:04:59 UTC

Ken, You are optimizing at the wrong point; ...... 3 bytes for a 255 bytes string isn't significant overhead.  The more important is the overhead for *short* strings. In my suggestion You can use the minimum 1 byte overhead up to a string of 254 bytes - I am certain that this choice will save more memory in reality than Your 1 byte = 127 max length + flag bit. ...... Here I'm of course relying on the assumption that the length interval 128 - 254 is more frequent than the interval above 254.

lupestro | Fri, 05 Aug 2011 21:07:58 UTC

Buffer overflows aren't caused by null-terminated strings but by the language not enforcing data structure limits. This in turn happens in C because pointers address memory and are agnostic about language structures like containers. This also was a deliberate choice but it was a different one. The real "mistake" is in selecting, for writing nearly everything, a language that was optimized for writing to-the-iron code over a language that takes responsibility for enforcing the programmer's declared abstractions. Writing in C has its costs. You have more details to manage, more distractions from what you are trying to do. Sometimes it is worth it and sometimes you're better off in Java or .NET or scripting. Use what makes sense.

Clark Coleman | Fri, 05 Aug 2011 18:10:47 UTC

" How about sending a string down to the kernel, like the path name when opening a file? Now the kernel has two PTRs to verify: the PTR to the string itself, and then that string's PTR to the data. Think before you write silly articles." There is only one pointer. The string starts with 16 bits of length info. The third byte starts the data. There is no need for a second pointer. Think before posting silly comments.

Some body | Fri, 05 Aug 2011 17:12:14 UTC

I disagree with many of the points of the article, would rather have seen the addr+len implementation for reasons not mentioned by the OP.

First, as others have pointed out, operations that require the length of the string would be quite a bit faster, especially with long strings. I often find myself managing the length of a string I am building so I can replace strcat() with strcpy(), e.g. (I don't use the 'n' versions because I'm watching the lengths.)

Second, as others might have pointed out, passing the length of a buffer along with the buffer itself would be avoided. Making a bold assumption that if strings were add+len, then perhaps that philosophy might have spilled over to arrays, as well, after seeing the advantages.

The OP pointed out over-running strings by writing over the terminating NULL character. It would seem to be that if the intruder knows where the string is, then they can probably find its length bytes and alter those, especially if the string contains ASCII (URL, SQL, etc.). My limited read of software attack methods includes discovering fixed length buffers, so the length byte pattern is already known.

As for the '\', the [person] that chose that character owned up to it in some publication, stating "I'm the reason why you use '/'" (paraphrasing). I chose the term [person] because of the headaches it causes which I probably don't need to list. Fortunately, the forward slash is accepted by Windows and many of the CLIs, although some require '\'.

Unfortunately, fixing the NULL terminated string convention is NOT as simple as some posters have alluded or out-right stated. There 30+ years of code in production and its behaviour is known (what works, what doesn't). How much regression testing will be needed to provide the same level of trust acquired from years of operation? How will you identify and "fix" code segments optimised for NULL terminated strings?

@cellar - I, for one, use the return code from sprintf().

RalfG | Fri, 05 Aug 2011 16:09:06 UTC

I feel old, because I nodded a lot reading this article ;-). Thanks !

indiferent | Fri, 05 Aug 2011 13:36:41 UTC

PHK wrote: "As the C language was a development from assembly to a portable high-level language, I have a hard time believing that Ken, Dennis, and Brian gave it no thought at all."

Next time do your homework and actually research before posting.

In the words of Mr. Dennis M. Ritchie:

"None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator."

http://cm.bell-labs.com/who/dmr/chist.html



It's a cleaver idea to stand on the shoulders of giants before and take a glimpse

ben | Fri, 05 Aug 2011 11:44:11 UTC

Have you ever HAD to count the characters in a string?  It's really not that much fun. Uh... Did you count them right? Count again. Null termination frees you from all that. Laziness (minimal effort) should not be under-estimated as a design virtue.

Andy Robb | Fri, 05 Aug 2011 09:15:41 UTC

Specious argument about multi-byte optimisation at end of VM page - surely the end of a page is also on a multi-byte boundary.

Kaz Kylheku | Fri, 05 Aug 2011 04:52:43 UTC

Oh the CPU waste! But that's nothing.

Do you know that by having hair a few inches longer, you've wasted hundreds on shampoo?

The world has a GLUT of unused machine cycles. Hundreds of millions of PC's are idle most of the time.

Cycles only matter when someone or something is waiting, and a delay has consequences.

In an interactive GUI program, I cannot tell the difference between a 1/100th of a second response and a millisecond response.


The nice thing about null termination is that it is self-describing.  Anything that has a "PTR" needs context to be understood: namely, an address space for that "PTR".

If you write a PTR+LEN string to a file, and then read it in another process, the PTR makes no sense. Needless to say, the same applies to communication.

If you put some PTR+LEN strings into shared memory, everyone has to use the same address for the mapping. This is not always practical.

How about sending a string down to the kernel, like the path name when opening a file? Now the kernel has two PTRs to verify: the PTR to the string itself, and then that string's PTR to the data.

Think before you write silly articles.

pwedd | Fri, 05 Aug 2011 04:35:22 UTC

von Neumann - I'll get back to that. Like many issues the magic terminator v length argument has advantages and disadvantages either way. This has been more fully elucidated here than I would attempt. This gets in the way of the real problem viz buffer overflow being used to force data into the executable code memory area and so cause mischief. This isn't due to choice of NULL terminated strings; its due to (rather poorly) sticking to the von Neumann architecture of shared memory being used for data and executable code despite modern chip architecture (since the 286) providing all the tools necessary to efficiently allocate seperate and protected granules of memory to code and data. To be forgiven in MS-DOS; as we all know quick and dirty DOS in those heady Z-80 days. Not to be forgiven in Microsoft's adoption of DEC architecture for NT and its successors - just no excuse then - the hardware technology existed. The buffer overun issue, focus of this article, lays clearly with Microsoft. There are a few fringe operating systems, such as THEOS, that properly utilize the technology but they simply don't have the marketing clout that Microsoft gained by its dealings with IBM.

Ken | Fri, 05 Aug 2011 03:03:59 UTC

Whoops, make that "...the third byte goes from 16,384..."

Ken | Fri, 05 Aug 2011 02:59:54 UTC

"i" says "if > 255 the first length byte contains 255 as a flag and the length is in the following 2 bytes,".

That's absolutely the worst idea I've ever read. You are using the entire byte as a FLAG, anyone who has done anything with computers knows you use a BIT as a flag. The above example requires three bytes when the length is > 255, 5 bytes when the length is > 612, etc.

Instead, use the 128 bit to go to two bytes, however, now the second byte value is 128 times the value from 0 to 127 or a length from 128 to 16,383 bytes (If the first byte has 255 and the second byte value has 127 127*128+127=16,383), the third byte goes from 16,383 to 2,097,279, etc.

Now, who's to say that hackers would not be able to hack the string length technique? Since it isn't generally used, they haven't put their minds to it.

Of course, I haven't looked at how they have managed to hack the nul terminator. They certainly did put their minds to that and came up with hacks.

i | Fri, 05 Aug 2011 01:38:42 UTC

Have a length field that's unlimited expandable: 1 byte for lengths 0 - 254, if > 255 the first length byte contains 255 as a flag and the length is in the following 2 bytes, if length > 65024 these bytes contains 65025 as a flag and the length is in the following 4 bytes - etc in aeternum. ...... The length field is here always minimal compared to the length of data. ...... (For lengths < 255 it takes no more space than a null terminator solution.)

Nick Nussbaum | Fri, 05 Aug 2011 00:15:26 UTC

At the time C made this horrible design mistake Pascal took the pure route of  a byte length preceding the string. That's one of the reasons that Pascal has the market share it does in industry.

As for your memcpy problems, if it really bothers you do your string mem allocations padded out to machine word boundaries and do the memcopy in word lengths. 

It's true that the security holes exist, but it's not just null termination; it's the whole use of pointers and stacks without hardware descriptors to force memory window permissions. Unfortunately the hardware wasn't available then, and pointers in C turned out to be too versatile to ignore in favor of struct arrays.

nm | Thu, 04 Aug 2011 21:44:37 UTC

David wrote:  "Technically speaking NUL terminated strings are NOT part of C. They're part of the UNIX real time library."  Not quite. If you declare char *myName="David", the literal is null terminated. There is no literal syntax for length+charArray format.

David E. Siegel | Thu, 04 Aug 2011 20:58:45 UTC

PHK says he was told that "for something to be 'a true record,' it could not be a mere accident; there had to be direct causation starting with human intent". This filter makes sense for records of achievement -- it doesn't make sense for even such GBWR records as "Oldest person alive", and it surely doesn't make sense for "biggest (or most costly) mistake", for surely no one intentionally sets out to make a mistake.

[para]  The consequences of problems in handling memory-mapped strings and networked strings could not have reasonably been forseen in the early 1970s. Other people in this thread have mentioned that a len+pointer structure involves many possible portability issues (endianness, length of length field, alignment of length field, etc) and a possible high overhead for short strings, particularly in the relativly constrained enviornments of the 70s. 

[Para] Although many OS services use null-terminated strings, an application can choose to use a different string-type for its own data, if this seems more efficient. Building a set of library (or library-like) tools for manipulating such a structure would not be excessivly hard, and they might, if popular, become an alternate standard. Don't curse the darkness of standard libraries, light a new candle, and then we'll see

-DES

Jefferson Braswell | Thu, 04 Aug 2011 20:43:05 UTC

hehe... "this text stream will NOW terminate", it should read.  (Can you imagine if I had to count up my bytes and put that in front of my comment ahead of time, by the way ?)

Jefferson Braswell | Thu, 04 Aug 2011 20:40:14 UTC

Well, the author has his own opinion, which he is simply spinning, in my opinion. I stopped reading midway through when he tried to argue the supposedly unquestionable merits of a pointer+length representation versus pointer+terminator approach. Notwithstanding the negative implications of using variable-length fields in early database storage formats (e.g., Oracle 1 and 2, in order to save *disk* space, but which results in having to scan data rows from the beginning in order to locate the Nth column in a row), the decision to represent variable length text in memory as, in fact, variable length text which is terminated by a null (non-text) character seems entirely flexible and more elegant for a wider variety of uses than a fixed-length text field approach. One of the supremely elegant paradigms of UNIX was and still is the concept of I/O streams accessed by ports or channels -- i.e., stdin, stdout, pipes, etc -- which allowed for an incredibly flexible and robust ability to create complex processes out of simple ones. Standard input was itself canonically closed by a special character, CTRL-D, which mapped to the End of Transmission (EOT) character of the out-of-band control characters in printable ASCII streams, not to mention BiSync and Telex protocols. Can you imagine if the fixed-length text field argument recommended by the author had been applied to UNIX I/O in general ?? You would be filling out cramped forms just to try to mesh the output of one process with the input of another. The prevalence of streaming media today is itself a complete vindication of the use of variable, indefinite-length text representation in memory, not the other way around. 'Nuff said -- I have useful work to do, and this text stream will not terminate EOT ( ! )

John Montague | Thu, 04 Aug 2011 19:43:28 UTC

I nominate the decision by AT&T engineers to periodically set the least significant bit to 1 of a word representing an audio sample in order to fix a digital PBX design flaw that, without ensured 1 bit density, caused loss of bit synchronization.  A quick and cheap fix that had big consequences.  This is why we had 56 Kb modems instead of 64 Kb modems (you could not rely on the 8th bit of a Byte), and why North American PBX vendors lost essentially all business for digital exchange upgrades to analog telephony in areas outside North America (the result of anti-competitive misuse of the standard process by a european coallition when a requirement for "clear channel 64 bit" capability was snuck into the standard defining what constituted an ISDN digital system thereby locking out existing PBX designs of North American vendors -- a big hit to the business of these vendors); these standards when cited in government procurements had the effect of law.

Bob Wallace | Thu, 04 Aug 2011 19:31:03 UTC

Note re DEC 12, 18, & 36bit CPUs (PDP-1 through DecSystem20xx) and sixbit characters:  The choice of bits/word reflects design choices for memory, addressing, instruction OPcode lengths, desired numerical precision, OS intent (i.e., RT, embedded, single user, multitasking/multiuser,...), and probably the availability of a 6bit characterset (itself a superset of 5bit teletype character codes inluding shift, etc...).   That the word lengths are multiples of 6bits is applied economy.
I first programmed usable code (music synth using a DEC_142 AD/DA in the "9") in machine/assembly using papertape on a DEC PDP-9 with 4K ("four-Kbytes" !!) of 18bit/word core ram.  With about 33 instructions, it was a "RISC-architecure" machine  (plus ca change...).  BTW, the PDP-2 was 24 bit that did not get out of the lab.  The PDP-9 was a microprogrammed PDP-7.  Unix (aka unics) was born on a PDP-7. 
Regarding strings discussion and modern systems and the specter of under/overflow exploits in string operations, I still program assuming the need to counter these events if only to keep a program robust in an uncertain I/O environment (read "users").

Bill Tetzlaff | Thu, 04 Aug 2011 19:12:05 UTC

When I was programming at Northwestern University, about ten years before this decision, I encountered both methods in use.  The IBSYS operating system, for the IBM 709/7090 series, that came into use in about 1961, with decisions made a few years earlier, had tape formats for blocked variable length records, which had a word at the beginning of each logical record or segment, with length information.  They were called green words, because they were shown in green in the first presentation at SHARE.  At about the same time those decisions were being made, the IBM 1401 was being designed, which used an extra bit called a word mark, to delineate fields.  It was an extra bit associated with each byte of memory, which was manipulated with special instructions, so it was a yet different way to work.  When writing to tape, a combination groupmark, with a word mark, was used to delineate the end of a record, and was the equivalent of the Unix construction.  PL/I was the first, or close to first, HLL to have variable length character strings, and did it with an explicit length stored.

It did surprise me that later, when media and memory had come down in price, that UNIX used the null character as an economy.

Erik Hoel | Thu, 04 Aug 2011 19:09:18 UTC

How about Tony Hoare's self described "Billion Dollare Mistake" from 1965 - inventing the null reference when he was developing Algol W...

Adam Diffenderfer | Thu, 04 Aug 2011 18:51:20 UTC

"C a dead language"?  When did it die?  Still alive and kicking in most embedded circles.

Joseph | Thu, 04 Aug 2011 18:47:52 UTC

So we should only use strings of length 16? "Using an address + length format would cost one more byte of overhead"

Poul-Henning Kamp | Thu, 04 Aug 2011 09:10:20 UTC

With respect to Dennis Ritchies account, I think you people focus on the wrong bit of it, for me the operative section is: "C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well. "  On a PDP/11 that make sense, as a gold-plated API for all future, it does not.

K | Thu, 04 Aug 2011 04:22:34 UTC

Newlines are stripped from comments? Seriously!?

Thierry Leroux-Demers | Thu, 04 Aug 2011 01:28:06 UTC

If I am not mistaken, .Net's CLR does not use terminator characters but rather the address+length format. With .Net gaining popularity, there may be hope to one day get rid of this problem!

J Williams | Wed, 03 Aug 2011 22:53:29 UTC

> I have not found any record of the decision

http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

"This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." - Dennis M. Ritchie

I'd say a much more costly choice for a single byte was allowing spaces in filenames on computers that also used spaces to separate command-line arguments! Even now an astounding number of programs don't correctly quote filenames and fall over in a heap when they encounter spaces. And this ghastly problem was duplicated into web URLs - oh the joys of the "%20".

Or, if you can stretch to a few more bytes, then my second vote goes to the hideous "http://www." that we all have to type/read hundreds of times a day. Totally unnecessary, unfriendly and ugly, and just think how much data storage space around the planet has been devoted to this evil character sequence! Think how many pixels have burned fossil fuels throughout the day and night just to display these totally redundant and superfluous characters. When they are the default and overwhelmingly most frequent prefix, why the heck do our browsers still insist on requiring or displaying them? Can't anyone think of just hiding them internally somewhere?

Bill | Wed, 03 Aug 2011 22:49:20 UTC

Going back in time to the Z80, the difference between using the C approach and the length byte of Pascal strings was clear. On those machines, the performance difference was readily apparent. Finding the length of a NUL terminated string is a repeating expense. Length byte + string has no such expense. As to the length limit implied by including a length byte (word, dword, etc.), at the time such things were coming into use, computing was just departing from the Hollerith cards. I sincerely doubt that anyone was playing with individual strings in excess of 64K bytes.

George | Wed, 03 Aug 2011 20:12:02 UTC

I would nominate the inclusion of HTML in SMTP message bodies as a far more expensive mistake.

John Blas | Wed, 03 Aug 2011 19:56:13 UTC

TO: Dan Sutton. Seems to be some inaccuracies. A 3270 is a terminal. The system was a 370. I think you're referring to program IEFBR14 which was, as the name states, a branch via register 14. It was never intended for general use by the user community but as an internal default return mechanism. IBM changed it per user requests to include clearing R15. I don't think it wasn't a bug so much as incorrect use. If I'm not mistaken this was part of OS in general not just VM starting in the 60s; MFT, MVT, VS1, SVS, MVS, etc.

David | Wed, 03 Aug 2011 19:11:26 UTC

"Technically speaking NUL terminated strings are NOT part of C. They're part of the UNIX real time library. (How do you copy a string without strcpy? Remove all of your #includes and see what still works.)"

This happened with

austin | Wed, 03 Aug 2011 19:10:13 UTC

ok im not exactly defending the null byte, god knows it has caused me extra work (especially on reading and writing binary files) but a length at the end? really?[br]
ok assume we have a character array which is a pointer and length combo, lets say the length is one byte, thats 8 bits, and we assume unsigned(cus who needs a negative length) i cannot have a string longer than 2^8-1 or 255 characters.[br]
maybe instead of 1 byte they pick 2 bytes making 16 bits, now you can have 2^16-1 or 65,535 characters maybe enough for some but would piss me off. maybe i need more space, maybe i need a lot of characters in memory, a programming language should make as few assumptions about how it will be used as possible.[br]
if you pick 4 bytes for length you can fit a whole 4billion characters in memory or about 4GB (which may seem like quite a bit today but in the future will seem petty)[br]
[br]
of course this makes your character array a MINIMUM of 5 bytes long, over the existing 1 byte long.
the null byte termination however makes no assumptions of how many characters you could ever need in your array and uses the least amount of space needed. [br]
character arrays are like a gun, you can shoot the target or you can shoot yourself in the foot, and the gun doesnt really care which you do.

Dan Sutton | Wed, 03 Aug 2011 18:18:59 UTC

In the '70s, IBM released a toolkit library for assembler programming on the 3270 mainframe, under VM. One of the routines was an assembler snippet which you were supposed to put at the end of programs to signify a clean exit: it was something like, "set register 17 to 0; quit" -- two assembler instructions. So programmers throughout the '70s were adding the macro for this to the ends of their programs. Turns out, IBM put a bug in it, so that it set the wrong register (a two-instruction program with a bug in it) thus virtually all assembler programs on 3270s written within a 10-year period have this bug in them: IBM was finally forced to modify VM to cope.

NM | Wed, 03 Aug 2011 18:07:12 UTC

Yes indeed to ! above, who pointed to interoperability issues. 

Once you start trying to directly write multiline text into a file, there is tremendous value in having that not depend on platform-specific vagaries of integer encoding (e.g. "endianness")

What I don't think has been mentioned is: it's worse than that. Different machines have different alignment requirements. If you have multiline text, then you not only have to agree on how the integer length is represented, you need a convention as to whether the integer conveying the length of the next line is to be aligned on some sort of word boundary. First of all, doing such alignment adds significant complexity to string manipulation operations, and thus potentially has a runtime overhead. Furthermore, there is no particular choice of alignment that is good for all of the widely used systems into which that text file might be read.

No, the existing C string format doesn't do much to help you with word alignment, but most computer architectures are much more demanding regarding the alignment of integers than of individual characters. Whatever the other drawbacks, the C string format sticks to characters only, and thus avoids such alignment headaches.

Peter Richards | Wed, 03 Aug 2011 17:45:58 UTC

If this was truly a *one-byte* mistake, we'd be stuck in 2011 with a *one-byte* length field and 255-byte strings.  No thanks!  

Call it a 4-byte mistake, and 2^32-byte strings, and now you're talking some sense.

Brian | Wed, 03 Aug 2011 17:39:08 UTC

I won't claim that null-terminated strings are the best way to operate (for all the reasons that were explained so well in the article), but I defend the boys' decision to use them in C.

C is only a high level language in the strictest sense of the term. To me, C has three aspects: it abstracts away the specifics of the instruction set; it provides helpful "macros" for writing things like loops, conditionals, and procedures; and it provides a convenient set of alternative views into the program's memory (chars, unsigned ints, arrays, structs, etc.). It's this third aspect, typing, that is really being called into question here.

C's type system is simple, elegant, and probably the dirtiest thing you'll ever run into. Types only exist at compile time and they do nothing more than tell the compiler how big the block of memory is and what instructions they should use for manipulating it (and maybe that it should throw a few warnings or errors that the programmer is just going to caste away anyway). Defining a string as anything other than a pointer to a character would require the use of a built in composite type that would have totally befouled the beauty of the language.

emile | Wed, 03 Aug 2011 17:25:59 UTC

Anybody given any thought what the consequences of other representations. Prefixing a string with the length has the same kind of issues, address+length simply opens another can of worms. The cost would have come, no matter what rep. was chosen.

Helge Hafting | Wed, 03 Aug 2011 17:23:26 UTC

NUL terminated strings is not much of a problem. Buffer overflows happen because of sloppy programming, not because of string termination. You can crash with pointer+length implementations too, if you write stupid code.

And of course C has various libraries for "safer" string handling. Use those, if you don't want to think about your code.

There were some very good reasons for NOT using pointer+length. For how would you represent the length? A byte would limit you to 255-letter strings - which is too little. In those days, many machines had 16-bit CPUs. A 16-bit quantity limits you to 65535 letters per string. Better, but still a problem in some cases.

The biggest machines in those days had 32-bit numbers. 4 million characters. Strings would only be limited by main memory, nobody had over 4GB in those days. Not even close. Instead, people would complain about the waste - 3 extra bytes compared to nul-termination. Many strings are shorter than 3 bytes - even with the terminating zero.

And then there is the compatibility problems. The 16-bit machines of those days would be very slow handling 32-bit numbers. The length could be machine-specific, but that would lead to other incompatibilities instead. A string written to disk would be unreadable for a different machine. Source code who need to know the string format would be unportable.

And these days, we have 64-bit machines. They actually support strings with over 4 million characters, and we also have that much memory. No artifical length limits, please.

Richard | Wed, 03 Aug 2011 16:48:11 UTC

Why not just fix some of this. Surely it wouldn't be too hard to extend glibc with some sensible string functions (differently named) and we can all move over to them. It could even be done quite compatibly, by writing some macros to express the new functions in terms of the old ones.

Incidentally, functions like strlen() and strncat() would be so much faster that it would easily make up for the trivial increase in memory use. The memory use increase would actually be an extra int at the start of each string: not a single byte, but usually 4 (extra int, + retain the trailing Null for compatibility), or perhaps 8 bytes so as to allow for handling really long strings.

Ken Hansen | Wed, 03 Aug 2011 15:59:19 UTC

I don't get it - where's the expense? That a faster method EVENTUALLY became available in subsequet hardware designs (the boys at AT&T Bell Labs weren't working on Z-80, DEC VAX, or IBM ES/9000 hardware at the time, hence they couldn't benefit from the pointer+length alternative you describe.

I eagerly await your insights into the ASCII vs. EBCDIC dust up, where the federal gav't standardized on ASCII, but IBM refused to change, forcing the Gov't to drop the requirement so they could keep on buying IBM hardware... What did that decision "cost"? Oh, think of the savings in man-years and wasted CPU cycles if all computers could pack numbers like EBCDIC allows Where decimal numbers can be "packed" into fewer bytes, typically (num of digits /2) + 1 or less... What did that "decision" cost?

John Blas | Wed, 03 Aug 2011 15:44:42 UTC

"...there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago..."
Absolutely false. They didn't bother to look at previous systems. The whole method of using in-band controls dates back to 1950 systems which did data movement using field or group or record marks (check out the IBM 1620 as an example). This was done because of limited ALU capabilities of only handling one character/decimal digit at a time. Later real computer designs like the IBM 360 did not use this method for obvious reasons. Univac and Burroughs never used them to begin with. C was the first language I used that regurgitated using this method with NULs. Technically speaking NUL terminated strings are NOT part of C. They're part of the UNIX real time library. (How do you copy a string without strcpy? Remove all of your #includes and see what still works.) You can see the lack of planning/design of the C language which is only a generic assembler trying to become a real language as an after thought. I worked with Cobol, Fortran, and PL/I before C and none of them even had the concept of in-band string control. C ideas came from PL/I (remember, Ken, etc. worked on the failed Multics project). PL/I allows you to define strings as fixed or variable in size. Use of data vectors allows run time code to know data types, sizes and dimensions. This came from Algol. Check out the Burroughs 5000 architecture.
Why didn't PL/I catch on? IBM defined it and made the language specifications proprietary with license fees. Later someone created an unbelievability
grotesque subset language called PL/M. Another stupid C concept is implementation defined sizes for data types like int, short, long, etc. (where's medium short, extra long and a half, etc.?) PL/I allows you to define these types with sizes, e.g. fixed binary(32). You define variables by sizes needed to solve your problem. Compiler
implementations must handle definitions regardless of target system sizes.

Rob Sherratt | Wed, 03 Aug 2011 15:18:52 UTC

Single biggest mistake? 
Largest long term impact to greatest number of computer users? ....

qwertyuiop
asdfghjkl
zxcvbnm

Alan Evans | Wed, 03 Aug 2011 15:10:52 UTC

So much of c relies on "the efficiencies" of string pointer manipulation that it's easy to see why Kernighan and Ritchie dismissed length bytes (not to defend the decission).

e.g. I can "create" a new string for free as an offset from my first string.

char *str1 = "Hello";
char *str2 = str1 + 2; //"llo"

Couldn't achieve that if a length pointer was involved.

brian | Wed, 03 Aug 2011 15:04:41 UTC

Some how null terminated strings seem just a natrual constuct. There is the information, and its terminated with a null - easy! The null terminator leads to easy string processing etc at the lower levels.

Same sort of philosophy for other items delimited data files such as csv files - its just a natrual way of doing things!

Ben Craig | Wed, 03 Aug 2011 14:51:39 UTC

FYI, Address Space Layout Randomization (ASLR) has almost no runtime cost.  At process startup time, some random number has to be generated.  After that, the processor time required is the same for ASLR enabled vs. disabled.

Geezer | Wed, 03 Aug 2011 14:42:34 UTC

Hmm, maybe I'll rethink scrapping my PDP/11 at the house...

Sum Guy | Wed, 03 Aug 2011 14:32:36 UTC

Dumb article by somebody with flawed insight.  [NEW PARAGRAPH HERE]Back in the day, BITS mattered.  On DEC 18/36 bit operating systems, which typically had a lot of memory for the time, it was common to pack strings in "sixbit" to save ONE BIT per character (over 7 bit ASCII).  Thus, nobody in their right MIND would encode a string as pointer plus length.  That was wasteful of the length portion.  Not to mention, this doesn't allow for expansion of the string without a complete recopy the way null termination does.  So, if you used counted strings you'd need address, current length, and maximum length.  Which is exactly what Windows NT would call a UNICODE_STRING structure.  And nobody in the 1970s would use that much memory for a simple string.  You'd be (justly) fired.
[NEW PARAGRAPH HERE]So... lots of flawed insight here Poul.  I'd guess you weren't around in the 70's when this stuff was written, don't know anybody who was, and don't really understand the issues.
[NEW PARAGRAPH HERE]I can't wait for your next article about why system time storage values are the worst mistake in the history of computing science.

TedvG | Wed, 03 Aug 2011 13:56:54 UTC

it is not the fault of null termination

or any other string length registration
for that matter

it's just sloppy programming

is it not? /o


Ted

BradyS14 | Wed, 03 Aug 2011 13:48:51 UTC

Sounds like the author is footing the bill for the NUL-termination decision.

Will there be a follow up article for the mistake Adam made by eating an apple?

And what's with the layout of this webpage. It's giving me some sort of motion sickness,...probably the lack of serif on the chosen font for the main article.

Christian Sciberras | Wed, 03 Aug 2011 13:44:17 UTC
```
Kill C. Use Pascal/Delphi instead.
```

Meh | Wed, 03 Aug 2011 13:32:53 UTC

I read up until the MS hatred boiled through and then moved on...

Marmie | Wed, 03 Aug 2011 13:23:43 UTC

Encoding strings as "address + length" need not limit the length to 64 KiB. Length could be encoded as a big-endian sequence with the high bit of each byte acting as a terminator flag, so that for small strings (length <= 127) it fits in one byte. Or it could be encoded the way UTF-8 encodes codepoints, which would set a high upper limit for length, but allow to determine the length of "length" by looking at its first byte.     ..    
Magical characters are not only bad for security and efficiency, they also force us to escape and unescape, which inevitably leads to data mangling because there is no simple way of knowing which and how many levels of escaping a string has gone through.

Ehud Gavron | Wed, 03 Aug 2011 11:54:10 UTC

Sorry but it's not all about the DEC PDP/11.  VAX/VMS had execute-only pages but in reality the Alpha was the first DEC CPU to enforce it.  

Still descriptor (pass by pointer a structure containing at the very least a pointer and a length) usage was prevalent in VMS since the 1970s.  It wasn't a "One-byte" mistake.   Each descriptor requires an additional pointer (4 bytes back then), and the size of the length (likely 2-4 bytes).  It more than doubles the size of the meta-structure.  For EACH and EVERY string.  

So for EACH and EVERY string, allocating an additional 6-8 bytes was -- in the 1970s-- a very expensive proposition.  Instead, for most operations the NUL-terminated string works fine, when the programmer is responsible for a)following the API to input the string, and b)following the API to use the string.  For cases where a data structure may contain NULs, a meta-data element with the length IS kept separately, and again it's up to the input API to count and update that element, and the output API to use it.

So, to summarize, it wasn't one-byte, and it wasn't a mistake.  Now TODAY with memory being so relatively cheap and CPU optimization being the buzzword, it is not the best thing.

Pardon me, I have to go load an RK05 disk pack.  Who could ever need more than 2.5MBytes on a single platter.  Shees.

Ehud

Jan | Wed, 03 Aug 2011 10:31:35 UTC

Storing strings as 'length+pointer' pairs would help a bit, but would still not solve all problems like buffer overflows etc. The problem is that there is often a difference between the length of the string and the size of the buffer allocated to store the string. And that, too, is by design, because you often want to have a single buffer that stores different-length strings during its life-time.

Cellar | Wed, 03 Aug 2011 10:25:12 UTC

I see parallels between phk and djb. Both like their own universes quite a lot and only barely tolerate strangers in it. This bit of tech-opinion would be more interesting if it wasn't so easy to sidestep the problems caused by this design decision. That is, it's not all that difficult to write your own code such that its basic string handling goes by length+pointer or even two pointers (the C++ stl brings us the equivalent two iterators paradigm). My string handling classes often have a fn(const char *s) { fn(s,s?strlen(s):0); } member for completeness that then usually never gets used. That is, backward compatability is a marginal cost to moving forward coming from within. Of course, the problems start with interfacing old code, but the first order of the day is cleaning (by definition untrusted) input and once that's done conversion can usually be isolated to a relatively thin layer of glue. [p]

That doesn't change that there is a lot of rotten code out there, some of which insists on deploying existing facilities for maximum stupidity. Like code that extensively mixes the str* functions with repeated uses of tricks like sprint(ENDOF(buf),"...",...); or even sprintf(buf,"%s....",buf,...); where that same sprintf returns the size of what it just did, allowing to forego yet another useless strlen call hidden in a macro or in sprintf. That return value was, of course, blithely ignored. Revamping this sort of thing takes time and is boring even if it isn't particularly hard to do, so it doesn't happen, and the bad code endures. [p]

You could argue that the str* series of functions are poorly thought-out even apart from their use of a NUL terminator, and I'd agree. But this is old history. I'd say there are other things that are less defensible, such as linus sticking to a monolithic design and the resulting failure to defend against possible exploitation vectors that way, though even that pales in comparison to the complete absence of internal bulkheads in even more widely used desktop operating software. That is something that "the user" cannot change. But nobody forces you to use str* functions and in fact you can get by pretty well without using them and not switching languages. Though you can do that too. [p]

In fact, plenty of modern "enterprise" software now-a-days is written in a different language, one that doesn't even tell you how it stores its strings internally. That same language managed to bet big on threading and then provided a rather poor implementation. Care to calculate the senseless economic overhead of that? [p]

It's typical phk-form (and djb-form, for that matter) to point to someone else and go "neener, neener, you done bad!" regardless of the difference in state of the art between now and twoscore years back. This particular thing wasn't such a bad decision back then. You might as well try and calculate the economic cost of getting the electron flow wrong and the resulting enduring mislabeling.

TRY | Wed, 03 Aug 2011 09:51:57 UTC

Working on a high-reliability project, I created a (another?) suite of extremely well-tested string routines using a C structure like this:

============
typedef struct
{
  size_t CurLen; // Current lenght of the string
  size_t MaxLen; // Maximum length of the string (i.e. 1 less than size of allocated data)
  unsigned char *Data; // Pointer to the allocated data
} tString;
============

Since this was for a specific project based on a common platform, cross-platform issues (i.e. different sizes of size_t) were not present. There was a compatibility call: unsigned char * GetString(tString String) that would NUL-terminate the string and return a pointer to it. This routine assumed that no NULs were present in the string -- or if they were, that they were supposed to be there.

Poul-Henning Kamp | Wed, 03 Aug 2011 07:08:31 UTC

Rather than ptr+len, I would use a ptr+ptr format, that automatically ports to any size word/memory/address-space without any need for adjustment.

F | Wed, 03 Aug 2011 06:48:52 UTC

I want to add that I suggest storing the length value little-endian in my scheme, because that way one can "zero-fill" the length value and use a constant size length value within an application, making long-word access to the string data more practical.

Arief M Utama | Wed, 03 Aug 2011 06:43:50 UTC

Once saw a protocol analysis of a problem in GSM's radio network equipment that traced down to this C-string issue.

Fun :)

F | Wed, 03 Aug 2011 06:37:51 UTC

Encode the size of the length value into itself.  One way to do this is to set aside the high bit, giving you seven bits of length value storage in each byte.  The high bit is set for all bytes of the length value, except the last one.  If there is only one byte of length value, its high bit is not set.  So for example, strings of length 127 or less would have one byte of length value.  Strings of length 128 to 16383 would require two bytes of length value, etc.  This way the length value can have arbitrary values, yet still consuming no more space than necessary.  Loading and storing the length values could be efficient with hardware support.  I would suggest storing the length little-endian.

! | Wed, 03 Aug 2011 06:21:00 UTC

Gah, where did my line breaks go? And my name got changed too... :(

! | Wed, 03 Aug 2011 06:20:17 UTC

Length-prefixed strings do have many advantages, but I hate to think of the interoperability issues. These days you're probably safe with a 32-bit length, but there'd still be systems around using 16 bits, and you'd try to talk to them on a network and things would be all crazy. Not to mention endinanness issues.


As well, null-terminated strings have some optimizations you can do by pointing into part of the string. e.g. you can have the strings "foo bar" and "bar" occupying the same memory space by pointing the latter at the middle of the former. It's common to do this when parsing a string, incrementing a pointer to the part you're interested in.
An alternative might be to have length + pointer, pointing to a string somewhere else (which can still be null-terminated for compatibility), instead of length as a prefix. It's worth noting that the Lua language does something like this.


You might also be interested to know that a buffer overflow exploit was indeed one of the earliest tricks hackers used to get into the PS3 system. A device known as PSJailbreak exploited such a vulnerability in the kernel's USB device handling to take over the system. (And getting into the console was sort of the first step of the PSN issues, since 1. Sony reacted very poorly and ticked off a lot of hackers, and 2. PSN was designed with the assumption that anything coming from a PS3 could be trusted.)

fisted | Wed, 03 Aug 2011 06:13:02 UTC

Oddly, what isn't mentioned in a single word is that the addr+len variant would also get us rid of yet another inconvenience regarding NUL-terminated strings.
Namely the fact that you cannot store such a NUL byte in it.

PCM2 | Wed, 03 Aug 2011 06:08:07 UTC

Dennis Ritchie has written about this subject: http://cm.bell-labs.com/who/dmr/chist.html
"C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type."
(He admits that C's approach does have its faults, but it was definitely a conscious decision that made sense at the time.)

Teunis | Wed, 03 Aug 2011 06:04:33 UTC

Null termination: No assumption of character size from an era where characters where anywhere from 7 to (I think) 15 bits in size.    No limit on size of data, allowing ease of streaming strings through.   Mentally preparing at least some programmers to building things dependant on "unlimited" strings, such as word processors and WWW.
Platform neutral.   Possibly a factor in the encouragement of multiplatform coding and communication between different types of computers.
the ability to "stream with a terminator" to devices that can't hold the data - but can treat it like a turing machine.  (many embedded environments)

Requiring a length:  What bit/bytesize is length?  What bit/byte order?   What will the maximum string size be?  (640K is enough for anyone!)

Ice | Wed, 03 Aug 2011 05:49:22 UTC

I've certainly lost track of how many varying string classes & libraries I have run into during the last two decades. For each their own, no one is forcing you to use null terminated strings, but they are handy because otherwise one would be automatically limiting the maximum size of string and in the days past using, say, 2-4 extra bytes for the size of each string did matter. Also, if you do have size restrictions, then you run into a whole other set of problems.

Bruce | Wed, 03 Aug 2011 05:44:57 UTC

The only problem I have with this is that the PDP-11 was a 16-bit word architecture which supported only a small subset of instructions having 8-bit boundary address access.

Therefore all malloc() and stack allocations were always done in 16-bit word chunk sizes to ensure that any word accesses were all even byte aligned, so there was no 1 byte saving by using null termination.

Every C compiler under every O/S on a PDP-11 I have ever used HAD to behave this way or have the CPU throw a word access exception.

Rudi Cilibrasi | Wed, 03 Aug 2011 05:42:25 UTC

BCPL had it right all along.  The Amiga kernel was written using BCPL strings and it was a great technical achievement but unfortunately the market could not accept it at that time.  Fast forward three decades and we have reinvented preemptive multitasking and resource efficiency in the mobile market and are about to repopularize length-prefixed strings as std::string in C++ as part of the upcoming C++ Renaissance.

ftp://cm.bell-labs.com/who/dmr/bcpl.html

John A. Wills | Wed, 03 Aug 2011 03:21:34 UTC

Someone somewhere made a decision to replace other programming environments with Unix, which is inferior to most of its predecessors. Having spent today trying to split a PL/SQL package in two because of the multiple functionality and growing unwieldiness of the starting one, and knowing that it will take at least half of tomorrow to finish the job, I am painfully aware of the foolishness of not having firm line numbers and cross-references. Better environments which I have worked on include BS3 on TR440, Cyber on CDC6600, Primos on Prime 500, MPE on HP-3000, ISPF on TSO on MVS on IBM mainframe and most recently a couple of weeks on as Unisys 2200. I think that only the last and MVS are still in use, and I would like to know why. A lot of the shift to Unix happened while I was away from work for 68 months, and I disclaim all responsibility.

Tony Ledford | Wed, 03 Aug 2011 03:01:24 UTC

This was a good article.  I must point out however, that bcopy/memcpy take the number of bytes to be moved as parameter, so there is no danger of accidentally reading past the end of the string.  Since the operation knows how many bytes it needs to move, it can optimize how many max word length units to move before resorting to byte operations.  Also, both methods you describe would suffer problems if the magic byte or length of string are incorrect (i.e. garbage in garbage out).  I know that's nitpicking a minor point to your article.  I think you main point is well made: decisions have consequences that we short-sighted humans just aren't good at foretelling.  I enjoy programming in C/C++ and after all these years, thinking about nul-terminated strings is just second nature I guess.  The other good point of this article is that it made me think differently - I had never considered nul-terminated strings as a liability, I just accepted it - like gravity or something.

Federico Mendez | Wed, 03 Aug 2011 01:41:48 UTC

Great article! And oh boy,... how little I know! I'm beginning with C programming myself, and I was discussing just the other day buffer overflows with my tutor (she didn't know those were exploited by malware...if you can believe that), and I had no idea /0 was such a common victim of it... what's amazing is that this could (as you state) be easily avoided... but I think we're naive to think that nothing is being done just because of backward compatibility with the PDP/11.

Mike Zorn | Wed, 03 Aug 2011 01:17:42 UTC

Several years ago, one of our programmers (in a data-processing shop) decided to use the ASCII string 'EOFEOF' to mark the end of a file of binary data.  Needless to say, it was not many months before a floating-point number matched that string.

The infamous '\', foisted off on an unsuspecting world, causes more than IT damage.  It's intruded into popular culture. Radio announcers are now required (at least by custom) to say "forward slash" when announcing URLs, which grates on the nerves as irritatingly as the proverbial nail on the blackboard.

Thanks, Rick Hightower, for the origin of '\'.

I'm a fan of the NUL terminator.  It lets me end the string that is "War and Peace" with a single byte.  There's a bit of logic, too: NUL = x'00' means, "none", hence "no more" (to this string).

Krz | Wed, 03 Aug 2011 00:56:55 UTC

@Lucio Maciel: Would you really have needed to support a string size >64KB?  The author supposes a net change to the storage requirement of a new string as "one byte longer," implying a two byte length field, since we're dropping the trailing null.  Machines at the time had 16, 18, and 22 bit address spaces, so a 16 bit string size would certainly have been quite sufficient back then.  Moreover, even those PDP machines with 18 and 22 bit addresses provided it to user space with overlays, further restricting you to only 64 KB of contiguous storage at any given point in time.

@Me: Defining the storage requirement as one byte longer does not imply a one byte length, it implies a two byte length, more than enough for the time.  If the decision had been made so, the length field would have likely been expanded with the machine's native int size, just as the pointer grew with the underlying architecture.  It almost goes without saying the address space would have been available to a single string (though whether that's a feature or misfeature is another question entirely).

534-252-3526 | Wed, 03 Aug 2011 00:00:12 UTC
```
Great article!
```

Me | Tue, 02 Aug 2011 21:32:06 UTC

It could be done.  Please don't forget in your implementation the requirement to have the library work with mmapped files.  The length should not be 1 byte, but a number that can fully represent the address space on the computer.

struct str {
        size_t sz;
        char *c;
};

For allocation, would you create a macro for creating a const struct str?  How would you specify such, and how would you find end of string?  Would you require manually specify # of characters during allocation, or would you change the compiler to do this for you?

How would this work if you had an existing string that you want to hack into and reference (point) a new string to a piece of an old string?  Would you instead force a string copy?

You may wish to put some thought into dealing with strings that have # Characters > the address space of the CPU.

Ethan Grammatikidis | Tue, 02 Aug 2011 19:45:28 UTC

gets() should have been dropped from C libraries a long time ago, and many other library changes could have been made to make C (and Unix) much nicer to work with, but the computing industry has a bizarre obsession with retaining old mistakes instead of investing the relatively small amount of work needed to correct old code.

Speaking of making C nicer, Plan 9 includes a library for operating on a String type. IIRC it's a struct containing length and data. Most of the system uses this String type. I guess I may be right in saying "using C strings for anything non-trivial is stupid," but this illustrates the ridiculous resistance to useful change in the computing world. Where is any len+data string format and the libraries to handle it in POSIX?!

The notion of all ${lang} depending on C is a difficult one. It's technically untrue: versions of Lisp exist which have no dependence on C while Go definitely and deliberately does not depend on C. Unfortunately the manner in which operating system "progress" is implemented these days makes this hard.

The hardware costs section here strikes me as a little strange. I'm sure the special instructions for null-terminated strings are cheaper to implement in hardware than those for len+data which requires somewhere to store the length.

The performance costs section pivots on the notion that C's string type is the One True String Type which it is not. It's merely a tool, and I refer back to Plan 9's String struct.

maht | Tue, 02 Aug 2011 18:53:58 UTC

The answer was easy to find  :

http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

Bell-Labs were dumping GE-645 Multics and looking for an alternative. Faced with porting BCPL from Multics to the PDP-7 he dropped some stuff to squeeze it into less memory and thus B was born. 

"In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator."

Oren | Tue, 02 Aug 2011 18:47:30 UTC

No valid candidates in space exploration history? What about Mariner 1: http://en.wikipedia.org/wiki/Mariner_1 aka "The most expensive hyphen in history"?

Lucio Maciel | Tue, 02 Aug 2011 18:31:35 UTC

"Using an address + length format would cost one more byte of overhead than an address + magic_marker format"
And that would limit the size of the String... Null terminated Strings in C has no size limits because it don't need to embeed its size on the string.

Rick Hightower | Tue, 02 Aug 2011 18:27:52 UTC

MSDOS did not pick the slash direction. They bought dirty DOS (quick and dirty disk operating system) which had it that way and in reality Dirty DOS did not invent it either. Dirty DOS was a 16 bit OS which was backwards compatible with the 8 bit CPM OS which picked the slash direction. I may have the TLAs a bit off but this is the best I can do on my iPad riding public transit to work.

ForkQueue | Tue, 02 Aug 2011 18:06:05 UTC

Why not change and fix the mistake?
http://www.gnu.org/s/libc/resources.html you could fork it (to github) and monkey patch the bug and fix the other design flaws that you see. I'd happily compile my C code with it.

Jan Schaumann | Tue, 02 Aug 2011 18:00:43 UTC

> I have a hard time believing that Ken, Dennis, and Brian gave it no thought at all.

Have you asked them?

mramirez | Tue, 02 Aug 2011 17:32:50 UTC

Run with the same problem once. I ended making a  replacement, where all string functions use EoF (ASCII 26) character as a terminator instead of the NUL (ASCII 0) character.

Sign up for QueueNews

acmqueue app

Join ACM

Comments