Saddling time-tested workhorse software with unexpected new burdens is often the best way to seize an opportunity or trounce a perennial annoyance, but adding radically new functionality to production code isn't easy. A practical case study spotlights principles that help us bolt new tricks onto old dogs.
Retrofitting means adding major new features to deployed artifacts that were not designed with those features in mind. Adding a trailer hitch to the family minivan doesn't count; replacing the rear seats with a jacuzzi does.
Retrofitting is difficult for several reasons. It begins with reading and understanding other people's code, which is much harder than writing your own. Then it requires designing and implementing major new features to harmonize with what's already there. By analogy, builders adding a wing to a house must uphold numerous invariants (building codes) and ensure that the final product looks like part of the original architect's vision, not like a collision with a shantytown. The existing structure might resist the retrofit, and it certainly won't assist.
Large retrofits happen infrequently in most coding shops, which is bad news for programmers striving to hone their skills. There's not enough practice on the job. Fortunately, it's easy to practice the art of retrofitting in the vast world of open-source software. Find a widely used, actively maintained program that would benefit from a medium-sized new feature that you would use. Scratch your own itch. If you implement the feature well, the maintainer might incorporate it into the official distribution.
Zawinski's Law:
Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.
This episode of Drill Bits walks through such an exercise. The itch is a common data-analytics chore. Our scratching exposes general principles to guide retrofit projects. In the end, we deliver a powerful and general new feature for a venerable open-source utility.
Modern analytics crunches infinite streams of event-driven data that arrive piecemeal. Examples include datacenter telemetry, financial data from stock exchanges, and sensor readings from the Internet of Things. The first stage of analytics often involves smoothing incoming data, for example, by maintaining windowed averages over the latest arrivals on each of myriad streams. Getting stream analytics right requires rock-solid arithmetic, which leads to the starting line of our retrofit project.
Conventional fixed-width binary floating-point arithmetic is ill-suited to stream analytics for two reasons. First, many serious applications require decimal arithmetic. Mundane decimal numbers such as 0.1 do not admit exact binary representation,8 which isn't a mere curiosity or minor annoyance. Binary causes unacceptable errors in finance, accounting, and other domains where decimal inputs must yield correct decimal outputs. In figure 1, for example, C double
arithmetic botches the computation of a 5% tax on a 70¢ purchase.5 Mathematically, $0.70×1.05 = $0.735, which conventionally rounds to $0.74. The rounding in figure 1 is wrong because in binary the result before rounding is 0.734999.... Misplacing a penny per transaction can lead to multimillion-dollar annual errors for real firms.5 Demanding financial applications have therefore long employed arbitrary-precision arithmetic.9
The second problem with fixed-width arithmetic is that high-volume data streams can trigger snowballing numerical errors. During the Gulf War of 1991, cumulative snafus in fixed-width arithmetic confused an air-defense system, allowing an Iraqi Scud missile to kill 28 Americans.1 Our example code tarball illustrates similar pitfalls that can afflict practical calculations involving compound interest and polynomial evaluation.
A calculator utility bundled with Linux distributions, GNU bc
, offers arbitrary-precision decimal arithmetic behind an ergonomic no-nonsense interface.16 It's hard to beat for quick interactive calculations at the command line, and it supports non-interactive scripts as well. The previous episode of Drill Bits used bc
extensively for combinatorial calculations over enormous integers begat by factorials.13 For arithmetic with fractions, bc
computes results to any desired precision.
Python and other scripting languages have arbitrary-precision decimal arithmetic libraries, but bc
scripts are clearer, smaller, and more convenient for simple chores. The lean bc
interpreter is also noticeably faster than Python and has a much smaller memory footprint, which are important advantages where performance is a concern.
A bc
script can easily compute a windowed average over a data stream. The script in figure 2 reads non-negative input numbers into variable x
, holds the most recent window of w=3
inputs in array a
, counts the total number of inputs received in n
, and maintains the sum
over the window's contents. The if/else
line ensures that the denominator used to compute the windowed average, d
, is correct both before and after the window fills; this line also subtracts from the sum
numbers that "age out" of a full window. Below the script itself, shell variables IN1
and IN2
contain sample inputs. Piping these inputs into the script followed by a negative number yields output to the precision specified by the scale
parameter. The script's mod()
function is necessary because bc
's modulus operator doesn't do what we want when scale
is nonzero; auto
designates p
and r
as local variables.
bc
The script in figure 2 is adequate if all inputs are available at once, but it can't handle an infinite stream of data that arrives piecemeal. In figure 3, the script ingests the two input sequences separately and emits incorrect output as the window fills for the second time. The problem, of course, is that the window vanishes when the first invocation of the script terminates.
A bad solution would be to modify the script in figure 2 to dump the window to persistent storage as the script terminates and read the window back in when the script is next invoked. This approach suffers two shortcomings. First, it can be a gratuitous performance disaster. If input numbers arrive one at a time, it would perform work proportional to the window size for every input number, whereas the inherent computational burden of the task at hand is only constant work per input (O(windowsize) vs. O(1)). A timeless universal principle steers us away from trouble here: Strongly prefer asymptotically efficient solutions.
The second problem with the bad solution is that modifying every script with the same fundamental problem—amnesia—is a gratuitous programming disaster. Innumerable scripts would bloat with different homebrew Band-Aids for a single root problem. A second principle points toward a better solution: Solve a general class of problems exactly once, in the right place.
It's best to solve the amnesia problem once, efficiently and well, by retrofitting a new capability onto the bc
interpreter: Enable the interpreter to save sufficient script-execution state between invocations so that scripts can simply pick up where they left off. This new capability will be a pure opt-in: If the user who invokes bc
doesn't request it, bc
will behave in its traditional forgetful fashion. Scripts themselves will include no code whatsoever to enjoy the new pause/resume capability; the interpreter will do all of the work, and users will activate the new feature (or not) at runtime. Finally, pause/resume won't be for scripts alone; it will also work for interactive bc
sessions.
All of which, of course, is easier said than done. The new feature's description might be concise and clear, but the retrofit code might get hairy. Furthermore, an efficient design that performs only the minimum necessary work might be elusive. Fortunately, the bc
codebase is reasonably small and tidy, so we weren't too intimidated to dive in and start contemplating different retrofit strategies.
We identified a set of variables that collectively represent the execution state of a bc
script or interactive session. Preserving these variables across invocations is both necessary and sufficient to support our new pause/resume capability. Conveniently, the declarations and definitions of the crucial variables were already consolidated in global.h
and global.c
, but many other source files accessed the variables.
We rejected the blunt expedient of writing the execution-state variables to persistent storage when bc
is about to terminate and then reading them back into memory at the next invocation, which suffers the same asymptotic inefficiency noted earlier for the windowed-average script: O(datasize) vs. O(deltasize). The less ham-fisted alternative of explicitly saving only changes to interpreter state would be difficult and error-prone; this approach would scatter change-tracking logic all over the bc
code, violating the "solve-once/right-place" principle.
So, where in the bc
code should we retrofit pause/resume? A roundabout answer springs from a third general principle: Respect the existing software architecture. The interpreter must somehow persist and later restore enough state to resume execution. Respecting the architecture means doing so in the place where bc
already manages persistent state.
But there is no such place! The bc
interpreter has no module, no software layer, no location where it manages persistent state. Instead, bc
simply manipulates ordinary data structures (C variables and struct
s) in conventional anonymous/ephemeral DRAM-backed memory, as we would expect in a calculator. Respecting the architecture means that we can't clutter the code that manipulates in-memory data with new persistence logic, and we can't shoehorn a new persistent storage module into an architecture that has nothing of the kind.
We can, however, swap an existing software layer for a drop-in replacement without dissing the architecture.
The right layer to swap lies immediately below the bc
interpreter: the memory allocator. Most of the crucial execution state that must be preserved across invocations resides on the conventional heap managed by standard malloc
and free
; the exceptions can easily be moved onto the heap. Then, in principle, all that's necessary to retrofit pause/resume onto the interpreter is to replace the conventional ephemeral heap with a persistent heap.
Easier said than done? Yes, but not much.
Our retrofit uses a persistent memory allocator, pma
, that has been deployed for several years in widely used software.11,12,14 The pma
library provides drop-in pma_*
replacements for standard malloc
and free
. We swap in the new persistent allocator by #define
-ing malloc
and free
to their pma_
counterparts in a header #include
d in all source modules. Under the hood, pma
allocates memory from a file-backed memory mapping. The backing file containing pma
's persistent heap is called the heap file.
To activate persistent-memory mode in our retrofitted bc
interpreter, the user sets a special new environment variable, BC_PM_HEAP_FILE
, to the name of a heap file. If this envar is set, the pma
initialization routine maps the given heap file into memory and prepares for persistent-memory mode. If the envar is not set, the initialization routine places the pma
library in "fall-back-to-malloc
" mode, which means that pma_malloc
passes the buck to conventional malloc
and pma_free
similarly calls ordinary free
. In other words, if the special new envar is not set, bc
behaves as though our persistence retrofit had never happened.
The trickiest aspect of the retrofit is enabling the existing bc
code to access data on the persistent heap. Like all modern persistent heaps, pma
requires applications to ensure that all allocated persistent memory is reachable from the persistent heap's root pointer. So, if the existing bc
code manipulates a variable named "foo
," which now resides on the persistent heap, how does the bc
code find foo
?
A two-step change tackles the problem of access via the root pointer. We first allocate on the persistent heap a new structure to contain all variables that must persist across invocations. The persistent heap's root pointer always points to this new struct
. Second, we #include
in all bc
source files a new header that replaces straightforward accesses to formerly ephemeral variables with indirect accesses to retrofitted persistent variables:
#define foo root_pointer->foo
When pma
operates in fall-back-to-malloc
mode, the new struct
ends up on the conventional heap and everything Just Works.
The net result is that the existing bc
source code scarcely changes at all. One of the few changes to the existing code involves the standard library function strdup
, which allocates memory on the conventional heap. We replace the handful of strdup
s with a code snippet that calls pma_malloc
and strcpy
. In total, the retrofit adds roughly 110 lines of code to bc
and modifies roughly 50 lines in an original code base of ≈6 KLOC.
Arbiters of style might wince at the use of C preprocessor #define
s to redirect function calls and variable accesses. For our bc
retrofit, however, macros minimize disturbance to the original code and thereby preserve its clarity. They also make it easy to rip out the retrofit later if we don't like it. Over-thinking the problem or over-engineering a solution would not yield superior results for our midsize project.
bc
Returning to windowed averages, figure 4 shows that persistent-memory bc
(pm-bc)
correctly handles piecemeal inputs by preserving script execution state, including the window of recent arrivals, across invocations. The user activates persistence by making a heap file—initially a big file of zero bytes created by the truncate
utility—and passing its name to pm-bc
via a special new environment variable. An alternative way to pass the envar is to export
it once prior to running pm-bc
rather than at each invocation. Users who forget that they have export
ed the envar, however, may suffer astonishment when pm-bc
preserves state across invocations.
pm-bc
correctly handles intermittent arrivals
Persistent-memory bc
is asymptotically efficient because the pma
allocator lays out persistent heaps in memory-mapped files.11 When pm-bc
updates its persistent heap, the underlying OS performs work proportional to the number of memory pages that change rather than the size of the persistent heap. More concisely, it performs O(deltasize) work, not O(datasize). And of course the persistence retrofit imposes no overhead whatsoever if the user does not explicitly activate persistence.
Our pm-bc
retrofit showcases the power of principles. A junior programmer with no prior exposure to bc
or persistent memory (Su) pulled off the entire retrofit in a few weeks with minimal guidance from a persistent memory veteran (Kelly) and the bc
maintainer (Phil Nelson). The maintainer plans to include persistence in a forthcoming official release of GNU bc
. The key to success was the thoughtful application of a handful of timeless principles: Seek asymptotic efficiency, solve problems in the right place with minimal effort, and respect the architecture of existing software. Consider these principles for your own retrofit projects.
Retrofitting creates new functionality by adding to existing code. An alternative strategy is to innovate by deleting code. For example, Ken Thompson created the non-interactive grep
utility by stripping down an interactive text editor, ed
.18 For some years, Brian Kernighan required students in his advanced programming course at Princeton to replicate Thompson's creation of grep
from ed
, though starting with a C version, not the original assembly language.15
Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.
— Antoine de Saint Exupéry
Persistent memory isn't the only way to implement pause/resume. Alternatives range from good old-fashioned SIGSTOP/SIGCONT
to more modern options such as CRIU (Checkpoint/Restore In Userspace)6 and DMTCP (Distributed MultiThreaded Checkpointing).7 Unfortunately, these alternatives are ill-suited to persistent scripting because they make it difficult for a resurrected interpreter to handle new inputs.17
Goldberg's classic survey catalogs the virtues and limitations of conventional fixed-width binary floating-point arithmetic.8 Demanding applications that can't accept the tradeoffs of conventional arithmetic might instead employ Boehm's uncompromising library, which evaluates expressions by computing intermediate results to whatever precision is needed to guarantee a desired precision in the final result.2,3 If fixed width isn't a problem but decimal arithmetic is required, the recently standardized decimal floating-point types in C23 might fit the bill.4
Grab the example code tarball at https://queue.acm.org/downloads/2024/Drill_Bits_14_example_code.tar.gz. You get an inner tarball containing pm-bc
, the winavg.bc
script of figure 2, an equivalent Python script, a shell script to compare the performance of the .bc
and .py
scripts, and a PDF document describing another gotcha of fixed-width floating-point arithmetic.
bc
, Python, and Perl scripts. On Unix-like systems, the quick and dirty approach is:/usr/bin/time python3 -c 'print(f"hello")'
bc
scripts.bc
. Consider gdbm
's mechanism.10#define YEET throw
We thank bc
maintainer Phil Nelson for valuable advice; Jon Bentley, John Dilley, Alan Karp, and Kevin O'Malley for brainstorming; and Brian Kernighan for historical information about ed
and grep
. Bentley, Dilley, Karp, Nelson, O'Malley, and Charlotte Zhuang reviewed drafts and provided valuable feedback. Dilley and O'Malley reviewed our code, and Lucas Stevenson re-implemented our "winavg.bc
" script in Python.
pm-gawk
user manual. Available at reference [12]. See Section 4.1 for asymptotic efficiency.pma
: A persistent memory allocator; http://web.eecs.umich.edu/~tpkelly/pma/.bc
arbitrary-precision calculator; https://www.gnu.org/software/bc/.grep
; https://en.wikipedia.org/wiki/Grep.
Terence Kelly ([email protected]) enjoys splashing around in his minivan's jacuzzi.
Ziheng (Aaron) Su enjoys bolting new tricks onto old dogs.
Copyright © 2024 held by owner/author. Publication rights licensed to ACM.
Originally published in Queue vol. 22, no. 6—
Comment on this article in the ACM Digital Library