Drill Bits

  Download PDF version of this article PDF

Literate Executables

Terence Kelly

Drill Bits is about improving real-world code. For example, recent installments enhanced widely used workhorses by adding new features: crash tolerance for a ubiquitous database6 and persistent memory for an indispensable utility.7 This episode casts a wider net. It proposes a simple convention for making compiled binary programs and libraries more useful, more transparent, and more manageable. A tool provided in this column makes it easy to adopt the new convention for your own software, and example code applies the tool to a program that you probably use every day.

 

Principles and Practice

The Free Software movement of the 1980s15 and the Open Source movement of the 1990s3 defined lofty ideals, radical at the time, that have since become mainstream. Nowadays, myriad high-quality programs offer—in principle—the freedom to analyze, recompile, modify, and redistribute those programs.

In practice, however, today's nominally free/open software falls short on the central pragmatic objective of the original movement. Finding the exact source code corresponding to binary executables should be easy for ordinary users but is remarkably difficult even for experienced developers. My informal survey of veteran programmers, including contributors to popular open-source packages, found that none could confidently pinpoint the source code of the programs and libraries that they use every day on major GNU/Linux distributions. Knowing that more or less relevant source is supposedly available, somewhere, is small comfort when you need the right code right now:

 

  $ opaque < input > output
  opaque: elusive.c:476: whatfun:
          Assertion `! confused´ failed.

 

Fortunately, it's not difficult to redress free software's lopsided success. We can keep the hard-won principled gains while restoring convenient and precise source access. There are many ways to tether binaries to source. The arcane and brittle alternatives incorporate dubious current practices: secreting source in the local file system far from the beaten $PATH or entrusting it to a distant server that is doomed eventually to go the way of all flesh. The simple and foolproof approach is to redefine the relationship between compiled binaries and corresponding source code to be that of chicken and egg: Insist that either can produce the other.

Literate executables are compiled programs and libraries that contain, and divulge on demand, the source code from which they were built. It's trivial to pinpoint the exact source corresponding to a literate executable, which in turn makes it easy to analyze, debug, recompile, and modify the software. The mechanics of literacy aren't specialized to source code, so literate executables may also contain their own documentation, sample inputs, test apparatus, or whatever other files we choose.

Literacy is surprisingly cheap and easy. Even to retrofit it onto complex legacy software requires only a wee bit of simple code, as we'll soon see. The main prerequisite is resolve: Convince yourself that if you're going to run free software then you ought to possess the source code and that there's no better place to keep the source than inside the corresponding executables. Folk wisdom says that an automotive owner's manual belongs in the glove box. Literate executables stash the complete blueprints there too.

 

Barnyard Biology

A source-code tarball is the egg in our analogy, inactive yet full of potential. It embodies a genotype that defines its species. Hatched by the build process, the chicken/executable assumes an individual phenotype that depends on the build environment and target architecture. For example, the grep-3.7.tar.xz tarball from ftp.gnu.org contains the source code for the current grep utility. The build process yields different grep binaries on Linux/Intel versus Solaris/SPARC, but their runtime behavior shows a family resemblance that attests to their common ancestry.

The small C program of figure 1, litr8x, implants eggs in chickens. Naturally, we'll arrange for litr8x to emit its own source, and together figure 1 (target program) and figure 2 (compile/implant/run script) illustrate the general technique for endowing literacy. The target program must statically allocate space for the egg—array E on line 6 of figure 1. When suitably invoked, the program will dump this array (lines 19–20), but first an egg must be implanted into it. Preprocessor #defines reserve just enough space and mark the array with a special string (line 6); compilation creates a chicken ready to receive an egg. The litr8x program memory-maps both chicken and egg binaries (lines 22 and 23), locates the marker in the chicken (line 24), and overwrites that location with the egg (line 25).

drill bits 8: literate executables
FIG 1 litr8x.c (comments and #includes omitted)

The shell script of figure 2 compiles and runs litr8x. Line 3 creates a marker string to designate the egg-implant location in the compiled program. This script uses a collision-resistant checksum of the source, but any marker that doesn't occur elsewhere in the compiled program will do. Lines 4 and 5 compile the program, passing the size of the egg (source file) and the marker string as preprocessor #defines.

drill bits 8: literate executables
FIG 2 Shell script to compile and test litr8x.c

Lines 6–8 sidestep an error that would arise if we tried to use the litr8x executable to implant an egg into itself: We can't open() a running executable for writing. So we copy the executable to /tmp/ (line 6), implant using the copy (line 7), and delete the copy (line 8). Thankfully, this rigmarole isn't necessary when litr8x implants programs other than itself.

Finally, lines 9 and 10 check whether the implant succeeded. Line 9 prints the litr8x.c checksum computed earlier on line 3. Line 10 invokes the litr8x executable with no arguments, which should cause it to lay its egg (lines 19 and 20 of figure 1), and pipes the output to the checksum program. If all goes well, the script of figure 2 produces identical checksums:

 

      $ ./run.sh
      source file: d981431fc0d599a0...
      exec output: d981431fc0d599a0...

Of course, directly invoking litr8x with no arguments yields an equally convincing demonstration: It prints its own source (figure 1 plus #includes and comments).

"In the bad old days well-intentioned programmers 'patched' binary object code..."
—Jon Bentley, reprinted in Knuth [8, p. 140]

Prudent coders are rightly apprehensive about patching a compiled binary using a homebrew tool; they audit the tool and its suggested use with exceptional care. Some developers, however, are so appalled on safety grounds that they reject the suggestion out of hand without considering the details. As my grandmother would say of anyone similarly paralyzed by timidity, "They'd be scared to put a wet child in the dryer!" Determine whether litr8x works as advertised; if you conclude that it does, use as directed with confidence.

 

Industrial-Strength Literacy

In addition to the litr8x program, the sample code for this column includes a script that retrofits literacy onto version 3.7 of the grep utility.4 This patch script adds 10 lines of code to file grep.c and adds a new build script that uses litr8x to create a literate executable grep. Literate grep can dump a tarball from which literate grep can be built, forming the cycle depicted in figure 3.

drill bits 8: literate executables
FIG 3 The literate execution cycle

A literate executable can spit out any subset of the tarball within, which in turn may contain whatever we choose. Literate grep, for example, can produce its own documentation as well as its own source code; standard utilities extract the man page:

 

  $ grep --dump-txz | xzcat \
    | tar xOf - grep-3.7.litr8x/doc/grep.in.1 \
    | man -l -

 

A literate executable could just as easily emit an instructional video. Source code for dynamically linked libraries is among the few things that a literate executable probably shouldn't contain, because such libraries are selected at runtime.

Storing odds and ends within executables keeps everything organized. The currently popular alternative is to scatter miscellany over many directories—bin/, lib/, man/, info/, and several others. Scattering sows confusion, especially if several variants of an executable are installed on the same system. The sysadmin's analogue of Occam's razor urges us to minimize the number of locations that must be kept in mutual harmony. Similarly, the SPOT rule precludes inconsistency and omission by mandating a single point of truth. To use a software artifact, you must possess the executable, and therefore the executable is a good place to keep the artifact's related files.

Literacy bloats binaries, but the absolute cost is affordable. On my Fedora laptop, for example, the total size of the source code for all installed software packages is around 7 GiB. All available packages—the whole nine yards plus the kitchen sink, which very few users need or want—are 84 GiB. A good one-terabyte SSD can be had for $100 nowadays, so fast storage for the source of my installed packages costs less than a dollar—probably the smartest buck I'll ever spend. Storing all available source costs less than $10, which is a modest price to pay for quick foolproof source access. Besides treading lightly on storage, the tarballs within literate executables impose zero runtime overhead: Modern operating systems demand-page an executable from storage into memory, so an egg leaves storage only when laid by its chicken.

 

Related Work and Play

Writing succinct self-printing programs, or quines,10 has been a coding challenge and a sophomoric art form since the dawn of compiled languages. Students were amusing themselves with self-printing FORTRAN by the early 1970s.8 Unix creator Ken Thompson indulged in this pastime with his college classmates; years later he began his Turing Award lecture with self-printing C code.16 Increasingly tiny and cryptic quines were a mainstay of the Obfuscated C contest9 until Szymon Rusinkiewicz forever ended this frivolous arms race with the provably smallest self-printing C program: a zero-length file.14 Literate executables repurpose self-printing for general software and for serious ends.

"Nothing will come of nothing."
—King Lear

Donald Knuth proposed literate programming to make code more readable for humans. His WEB system allows authors to intersperse Pascal and prose in one file, from which tools extract .tex for typeset documentation and code for the compiler; the code extractor weaponizes ugliness to discourage inspection and editing.8 Pretty-printed listings may misdirect creative energy toward typesetting flourishes at the expense of sound coding: Douglas McIlroy notes that a prominent early showcase of literate programming is grotesquely undersimplified.1 Norman Ramsey's noweb system deemphasizes pretty-printing and generalizes from WEB's Pascal to any programming language.11,12 Literate programming was used in several serious coding projects, such as David Hanson's C library,5 but didn't catch on as widely as its proponents had hoped. Literate executables borrow literate programming's best idea: keeping things together in one file.

The Reproducible Builds project seeks to "create an independently verifiable path from source to binary code," to ensure that a security audit of source code is relevant to the corresponding executable.13 Reproducible Builds tries to make build systems deterministic and well-defined (i.e., to guarantee a one-to-one mapping from genotype to phenotype). Literate executables work in the opposite direction, from executable to source, and permit one-to-many mappings from source to executable.

 

Drilling Deeper

With hands-on experience using litr8x, you'll learn to stop worrying and love cyclic dependency graphs like the one in figure 3.

 

Bits

Download the example code from https://queue.acm.org/downloads/2022/Drill_Bits_08_example_code.tar.gz. You get the litr8x.c source of figure 1, the compile/run script of figure 2, and the grep-literatizer script.

 

Drills

1. Pinpoint the source code of the grep utility on your computer. Prove to a skeptic that the source you identify was compiled into the executable.

2. Add a safety/sanity check to litr8x.c: Check that the size of the egg (in variable se) matches the size of the implant array (sizeof E, which should equal LITR8X_SIZE).

3. Following the example of my grep-literatizer script, retrofit literacy onto your favorite program or library.

4. Implement literacy for arbitrary Java code.

5. Is line 13 unlucky in figure 1? What if a size_t is four bytes but files can exceed four GiB? Should we static_assert() that size_t and off_t are the same size?

6. Check out NetBSD, where finding source code is allegedly easier than on GNU/Linux.

7. Read about MacOS Application Bundles and related concepts such as Application Directories. How do literate executables compare to the alternative of placing source code etc. in an Application Bundle?

8. Is the marking procedure of figure 2 risky? Might the marker of line 3 be present in the compiled binary somewhere other than the intended implantation site?

9. Prove that if file size is unbounded, then there must exist a text file that contains its own checksum. Create such a file; make it as small as possible.

"Put all your eggs in one basket and guard that basket very carefully."
—Andrew Carnegie

10. Think about security: Can literacy create new vulnerabilities? Can it make the attacker's job harder?

11. Implement in Python or Perl an "egg implanter" program to replace litr8x. Does Python or Perl make the coding easier or the program safer?

12. The example grep-literatizer script is C shell. Tom Christiansen says csh is bad for scripting.2 How would this script improve if rewritten in a Bourne-ish dialect?

13. Use the du utility to compare the storage footprint of a chicken/binary before and after egg implantation (to see a difference, your file system must support sparse files and you might need to fallocate --dig-holes to sparsify the executable between compilation and implantation).

14. Write a tool to remove the egg from a chicken, restoring the latter to be byte-for-byte identical to its pre-implant state. Use fallocate --dig-holes to reduce the de‑egg‑ified binary's storage footprint (requires sparse files).

 

Acknowledgments

Phil Nelson, author of gdbm, bounced around ideas with me before I wrote this paper, then reviewed a draft. Programming Pearls author Jon Bentley provided extensive feedback on a draft. Norman Ramsey, author of noweb, offered feedback on a draft. GNU grep maintainer Paul Eggert offered feedback on the grep-patching script in the example code.

 

References

1. Bentley, J., Knuth, D. E., McIlroy, D. 1986. Programming pearls: A literate program. Communications of the ACM 29(6), 471-483; https://dl.acm.org/doi/pdf/10.1145/5948.315654.

2. Christiansen, T. 1995. Csh programming considered harmful. Internet FAQ Archives; http://www.faqs.org/faqs/unix-faq/shell/csh-whynot/.

3. DiBona, C., Ockman, S., Stone, M., editors. 1999. Open Sources: Voices from the Open Source Revolution. O'Reilly Media.

4. GNU grep. 2022; https://www.gnu.org/software/grep/.

5. Hanson, D. R. 1997. C Interfaces and Implementations. Addison-Wesley.

6. Kelly, T. 2021. Crashproofing the original NoSQL key-value store. acmqueue 19(4); https://queue.acm.org/detail.cfm?id=3487353.

7. Kelly, T., Tan, Z.F., Li, J., Volos, H. 2022. Persistent memory allocation: leverage to move a world of software. acmqueue 20(2). https://queue.acm.org/detail.cfm?id=3534855. (Official gawk 5.2 uses my persistent memory allocator: https://ftp.gnu.org/gnu/gawk/gawk-5.2.0.tar.xz.)

8. Knuth, D. E. 1992. Literate Programming. Stanford Center for the Study of Language and Information. (See p. 11 for self-printing programs and p. 140 for weaponized ugliness.)

9. Libes, D. 1993. Obfuscated C and Other Mysteries. Wiley. (See pp. 313–315 for self-printing C program.)

10. Quine (computing). 2022; https://en.wikipedia.org/wiki/Quine_(computing).

11. Ramsey, N. 1994. Literate programming simplified. IEEE Software 11(5); https://www.cs.tufts.edu/~nr/pubs/lpsimp.pdf. (The author cautions: "Avoid the published version; it is littered with misprints.")

12. Ramsey, N. 2022. Noweb—a simple, extensible tool for literate programming; https://www.cs.tufts.edu/~nr/noweb/.

13. Reproducible Builds. 2022; https://reproducible-builds.org/.

14. Rusinkiewicz, S. 1994. The world's smallest self-replicating program. Guaranteed. https://www.ioccc.org/years.html#1994_smr.

15. Stallman, R. M. 2002. Free Software, Free Society. GNU Press; https://www.gnu.org/philosophy/fsfs/rms-essays.pdf#page=49.

16. Thompson, K. 1984. Reflections on trusting trust [Turing Award lecture]. Communications of the ACM 27(8), 761-763; https://dl.acm.org/doi/pdf/10.1145/358198.358210.

 

Terence Kelly ([email protected]) thanks his grandmother for using the "delicates" cycle.

Copyright © 2022 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 20, no. 5
Comment on this article in the ACM Digital Library








© ACM, Inc. All Rights Reserved.