January/February 2018 issue of acmqueue

The January/February issue of acmqueue is out now

The Bike Shed


  Download PDF version of this article PDF

ITEM not available


Originally published in Queue vol. 9, no. 7
see this item in the ACM Digital Library



Jez Humble - Continuous Delivery Sounds Great, but Will It Work Here?
It's not magic, it just requires continuous, daily improvement at all levels.

Nicole Forsgren, Mik Kersten - DevOps Metrics
Your biggest mistake might be collecting the wrong data.

Alvaro Videla - Metaphors We Compute By
Code is a story that explains how to solve a particular problem.

Ivar Jacobson, Ian Spence, Pan-Wei Ng - Is There a Single Method for the Internet of Things?
Essence can keep software development for the IoT from becoming unwieldy.


(newest first)

Displaying 10 most recent comments. Read the full list here

krater | Tue, 16 Jan 2018 19:29:53 UTC

I imagine a today where Ritchie & Co. decided to use address + length to represent strings.

We have now 1-8 different types of strings, with 1, 2, 4, or 8 byte length field. All incompatible to each other. Additional we have a couple of overhead to convert string type 1 to string type 2. We have overhead to access 2 byte strings on a 32bit machine because 16bit access is slower than 32bit access. We have security issues because someone had invented a string type where it's possible to automaticly recognize the lenght of the size field and some programs would handle this strings wrong or some developer are using this strings wrong.

I don't see any difference at all...

cousteau | Fri, 12 Feb 2016 08:56:50 UTC

What about CRLF for newlines?

Certain OSes and many transfer protocols rely on the double byte CRLF as a line terminator. This, as far as I know, is a historical remainder from teleprinters that relied on two separate "instructions" to move the printing head to the beginning of the line and feed paper for 1 line. Needless to say, nowadays there is no point on having 2 separate characters for this as they have a single meaning together and no meaning separately.

The consequences of this might be expensive: A stream processing mechanism must make the conversions on the fly. There are 2 different incompatible ways to open a file. The "line terminator" cannot be determined by a simple character search, but instead requires a substring match (which is even more complicated to do in hardware). Also, the cases of unmatches CR or LF will usually need to be treated specially. Plus, on a file with an average of 50 characters per line, this implies a 2% extra storage size or transmission time. I don't know if this has great economical repercussions, but it's definitely not for free.

Another candidate is UTF-16 (which is a rather inconvenient encoding whose only point is backwards compatibility with UCS-2), and Unicode adapting its size to it, and not the other way around. Because of it it is required for any other Unicode implementation treats the range D800-DFFF specially, plus it resulted in limiting Unicode to a bit more than one million code points, while UTF-8 had no problems with going up to 2 billion; now those extra code points are required to be treated as invalid, complicating UTF-8 validity check.

Robert | Tue, 01 Sep 2015 16:22:42 UTC

Another thing to consider is the sequential access model used by "everything is a file" unix where typical programming idioms (such as those in the K&R book) typically revolve around reading 'until' some sentinal value. A file typically ends up terminated by EOF (represented as a 0 too), and you don't typically know where it's going to come unless you perform a length calculation first. Such a calculation is expensive on magnetic media - especially tapes. You could work around this again with your run-length byte at the start of a string but then this means that whomever is writing the data in the first place has to calculate the length either beforehand or rewind to the start of the string and put in the length after the write.

rh- | Sat, 24 Jan 2015 04:13:55 UTC

"Using an address + length format would cost one more byte of overhead than an address + magic_marker format" Huh? There I stopped reading it. At the time there was no talking about strings longer than 256 chars, as not many toys had lots of memory space to waste, and the screens to print them was only 40 or 80 chars, and either adding a null terminator or adding a size byte (as in Pascal or ADA) would anyhow add a byte. Talking about longer strings (which would waste two bytes for the size) came much later, and that was not true for long: when the unicode strings came, the null terminator would also waste two bytes. Lack of security? Same idiot forgetting to add the terminator would also forget to set the proper right. Also, when playing with strings, it would be a pain in the ass to always compute the right lengths, the backslash-zero solution is more easy and elegant.

Terry Davis | Tue, 30 Sep 2014 02:12:49 UTC

I had a teacher who was convinced bigendian was correct. Just goes to show how stupid people can be. I like NULL terminated. I am God's chosen.

Keith Thompson | Fri, 11 Apr 2014 19:34:11 UTC

> Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

A C string isn't represented as address + marker. A C string is by definition "a contiguous sequence of characters terminated by and including the first null character". You can have a pointer to (equivalently, the address of) a string, but the string itself is not an address; it's a sequence of characters stored in an array. (And no, arrays are not pointers; see section 6 of the comp.lang.c FAQ, http://www.c-faq.com/, if you think they are.)

The point is that the proposed "address + length" representation would store information about the string in two different places.

An alternative might be Pascal-style counted strings; in C terms, you could store the length in the first (sizeof (size_t)) bytes of the char array.

Paul | Fri, 11 Apr 2014 16:18:09 UTC

How is the Therac-25 incident not the most expensive (also one-byte) bug?

theraot | Sat, 28 Jan 2012 13:35:28 UTC

Want to know Why?

This is the best explanation I have so far...

When B was designed there wasn't an standard about the size of a byte or the size of a word. The sizes went form 5 bits to 22 bits, and it became worst later on, for example the Cray-1 used 60 bits and DEC Alpha with 64 bits. And B was meant to be compilable to all the machines. We are talking about B here, so Brian Kernighan is innocent. As for Kenneth Lane Thompson and Dennis MacAlistair Ritchie... they decided that it was easier to have a null termination (a 0 of whatever number of bits it takes) than to have to manage two words, one for a pointer and one for the size.

Aside from that problem, which could have been solved by the compiler and virtual machine* anyway, they opted for immutable strings. Mutable strings seems a good idea for example if you want to get substring that is easy with pointer and size, but this means that you will have different pointers to the same area in memory, therefore you got to wait until you stoped using all the substring to free the memory of the main string, this means complications keeping track of the references, this is expensive when programming. Instead it is easier to make a copy the string, now the new string is stored independently and doesn't impose any limitation to release the memory of the former. This make the live of the programmer easier, so why not use inmmutable strings and make the live easier for the compiler too? It is known that using null terminated strings is less efficient compared to pointer and length strings because I need to copy the string each time I concatenate or substring, although developing the compiler was easier (because of the size of the byte problem I mentioned earlier).

* Remember that virtual machines were a new thing that appeared with BCPL, in which Ritchie participated (although BCPL didn't have strings per se), note that the development of B started in 1971 and was based on BCPL.

C inherited it from B and C++ inherited if from C, and Java, C#... inherited if from C++.

Today when we want to manipulate strings, unless we are doing a single operation, it is better to use a solution based on linked lists or arrays while we are crafting the string, and the retrieve the result as a inmutable string.

Now, it is possible to make inmmutable strings with pointer and size, copying the string anyway to keep the simplicity of the implementation to keep track of the references. For this situation to have the length doesn't represent any benefit, and as mentioner earlier it is harder because of the disparity of architecture of the compuers of the time. Still I know that it is much better to use pointer and length to display the string.

Today the situation is different, we all have bytes of 8 bits, virtual machines are common, we have power to use garbage collectors. Is it time to develop a new mechanism for strings? I don't know, just keep in mind that inmmutable strings are good for thread safety, and computers with multiple cores are common too.

Maybe I should mention that BASIC did use something similar to pointer and lengh, the first data on the destination of the pointer was the length of the string. There is a lost content from msdn (I saw it in the version for visual studio of '98) that explains that microsoft decided to change the implementation of strings for BASIC (I think when they started to call it visual basic) becuase it increased the performance of it by calling C++ libraries. [In fact, in the old msdn was a full discussion about the pros and cons of different ways to handle strings, including things like storing the pointer to start and the pointer to the end, having a pointer after a blocks of constant size of characters to have non contiguous strings and so on... sadly those articles seem to be lost forever].

Dennis Ritchie, rest in peace.

Dana | Tue, 03 Jan 2012 19:24:48 UTC

"...Using an address + length format would cost one more byte of overhead than an address + magic_marker format...

Not really true. The null-terminated string also adds one byte at the end of the string - the null character. Address + length also limits the size of the string to 256 characters unless more than one byte is used as the length designator.

Bruno Vo_qui | Fri, 30 Dec 2011 15:18:28 UTC

May I say, top post and comments are talking about 'in-band' string delimitors. But memory may be marked with 'out-band' properties, for instance ecc-read-write-execute and the likes. What about an 'isSameString' out-band?

Displaying 10 most recent comments. Read the full list here
Leave this field empty

Post a Comment:

© 2018 ACM, Inc. All Rights Reserved.