Drill Bits

January 20, 2025
Volume 22, issue 6

Download PDF version of this article PDF

Retrofitting: Principles and Practice

Terence Kelly
with Special Guest Borer Ziheng (Aaron) Su

Saddling time-tested workhorse software with unexpected new burdens is often the best way to seize an opportunity or trounce a perennial annoyance, but adding radically new functionality to production code isn't easy. A practical case study spotlights principles that help us bolt new tricks onto old dogs.

Practice for Practitioners

Retrofitting means adding major new features to deployed artifacts that were not designed with those features in mind. Adding a trailer hitch to the family minivan doesn't count; replacing the rear seats with a jacuzzi does.

Retrofitting is difficult for several reasons. It begins with reading and understanding other people's code, which is much harder than writing your own. Then it requires designing and implementing major new features to harmonize with what's already there. By analogy, builders adding a wing to a house must uphold numerous invariants (building codes) and ensure that the final product looks like part of the original architect's vision, not like a collision with a shantytown. The existing structure might resist the retrofit, and it certainly won't assist.

Large retrofits happen infrequently in most coding shops, which is bad news for programmers striving to hone their skills. There's not enough practice on the job. Fortunately, it's easy to practice the art of retrofitting in the vast world of open-source software. Find a widely used, actively maintained program that would benefit from a medium-sized new feature that you would use. Scratch your own itch. If you implement the feature well, the maintainer might incorporate it into the official distribution.

Zawinski's Law:
Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.

This episode of Drill Bits walks through such an exercise. The itch is a common data-analytics chore. Our scratching exposes general principles to guide retrofit projects. In the end, we deliver a powerful and general new feature for a venerable open-source utility.

Streams, Precision, and Recall

Modern analytics crunches infinite streams of event-driven data that arrive piecemeal. Examples include datacenter telemetry, financial data from stock exchanges, and sensor readings from the Internet of Things. The first stage of analytics often involves smoothing incoming data, for example, by maintaining windowed averages over the latest arrivals on each of myriad streams. Getting stream analytics right requires rock-solid arithmetic, which leads to the starting line of our retrofit project.

Conventional fixed-width binary floating-point arithmetic is ill-suited to stream analytics for two reasons. First, many serious applications require decimal arithmetic. Mundane decimal numbers such as 0.1 do not admit exact binary representation,⁸ which isn't a mere curiosity or minor annoyance. Binary causes unacceptable errors in finance, accounting, and other domains where decimal inputs must yield correct decimal outputs. In figure 1, for example, C double arithmetic botches the computation of a 5% tax on a 70¢ purchase.⁵ Mathematically, $0.70×1.05 = $0.735, which conventionally rounds to $0.74. The rounding in figure 1 is wrong because in binary the result before rounding is 0.734999.... Misplacing a penny per transaction can lead to multimillion-dollar annual errors for real firms.⁵ Demanding financial applications have therefore long employed arbitrary-precision arithmetic.⁹

Drill Bits: Retrofitting: Principles and Practice

Figure 1: Binary floating-point rounding surprise

The second problem with fixed-width arithmetic is that high-volume data streams can trigger snowballing numerical errors. During the Gulf War of 1991, cumulative snafus in fixed-width arithmetic confused an air-defense system, allowing an Iraqi Scud missile to kill 28 Americans.¹ Our example code tarball illustrates similar pitfalls that can afflict practical calculations involving compound interest and polynomial evaluation.

A calculator utility bundled with Linux distributions, GNU bc, offers arbitrary-precision decimal arithmetic behind an ergonomic no-nonsense interface.¹⁶ It's hard to beat for quick interactive calculations at the command line, and it supports non-interactive scripts as well. The previous episode of Drill Bits used bc extensively for combinatorial calculations over enormous integers begat by factorials.¹³ For arithmetic with fractions, bc computes results to any desired precision.

Python and other scripting languages have arbitrary-precision decimal arithmetic libraries, but bc scripts are clearer, smaller, and more convenient for simple chores. The lean bc interpreter is also noticeably faster than Python and has a much smaller memory footprint, which are important advantages where performance is a concern.

A bc script can easily compute a windowed average over a data stream. The script in figure 2 reads non-negative input numbers into variable x, holds the most recent window of w=3 inputs in array a, counts the total number of inputs received in n, and maintains the sum over the window's contents. The if/else line ensures that the denominator used to compute the windowed average, d, is correct both before and after the window fills; this line also subtracts from the sum numbers that "age out" of a full window. Below the script itself, shell variables IN1 and IN2 contain sample inputs. Piping these inputs into the script followed by a negative number yields output to the precision specified by the scale parameter. The script's mod() function is necessary because bc's modulus operator doesn't do what we want when scale is nonzero; auto designates p and r as local variables.

Figure 2: Windowed average with bc

The script in figure 2 is adequate if all inputs are available at once, but it can't handle an infinite stream of data that arrives piecemeal. In figure 3, the script ingests the two input sequences separately and emits incorrect output as the window fills for the second time. The problem, of course, is that the window vanishes when the first invocation of the script terminates.

Figure 3: Intermittent arrivals cause trouble

A bad solution would be to modify the script in figure 2 to dump the window to persistent storage as the script terminates and read the window back in when the script is next invoked. This approach suffers two shortcomings. First, it can be a gratuitous performance disaster. If input numbers arrive one at a time, it would perform work proportional to the window size for every input number, whereas the inherent computational burden of the task at hand is only constant work per input (O(windowsize) vs. O(1)). A timeless universal principle steers us away from trouble here: Strongly prefer asymptotically efficient solutions.

The second problem with the bad solution is that modifying every script with the same fundamental problem—amnesia—is a gratuitous programming disaster. Innumerable scripts would bloat with different homebrew Band-Aids for a single root problem. A second principle points toward a better solution: Solve a general class of problems exactly once, in the right place.

Keep State and Carry On

It's best to solve the amnesia problem once, efficiently and well, by retrofitting a new capability onto the bc interpreter: Enable the interpreter to save sufficient script-execution state between invocations so that scripts can simply pick up where they left off. This new capability will be a pure opt-in: If the user who invokes bc doesn't request it, bc will behave in its traditional forgetful fashion. Scripts themselves will include no code whatsoever to enjoy the new pause/resume capability; the interpreter will do all of the work, and users will activate the new feature (or not) at runtime. Finally, pause/resume won't be for scripts alone; it will also work for interactive bc sessions.

All of which, of course, is easier said than done. The new feature's description might be concise and clear, but the retrofit code might get hairy. Furthermore, an efficient design that performs only the minimum necessary work might be elusive. Fortunately, the bc codebase is reasonably small and tidy, so we weren't too intimidated to dive in and start contemplating different retrofit strategies.

We identified a set of variables that collectively represent the execution state of a bc script or interactive session. Preserving these variables across invocations is both necessary and sufficient to support our new pause/resume capability. Conveniently, the declarations and definitions of the crucial variables were already consolidated in global.h and global.c, but many other source files accessed the variables.

We rejected the blunt expedient of writing the execution-state variables to persistent storage when bc is about to terminate and then reading them back into memory at the next invocation, which suffers the same asymptotic inefficiency noted earlier for the windowed-average script: O(datasize) vs. O(deltasize). The less ham-fisted alternative of explicitly saving only changes to interpreter state would be difficult and error-prone; this approach would scatter change-tracking logic all over the bc code, violating the "solve-once/right-place" principle.

So, where in the bc code should we retrofit pause/resume? A roundabout answer springs from a third general principle: Respect the existing software architecture. The interpreter must somehow persist and later restore enough state to resume execution. Respecting the architecture means doing so in the place where bc already manages persistent state.

But there is no such place! The bc interpreter has no module, no software layer, no location where it manages persistent state. Instead, bc simply manipulates ordinary data structures (C variables and structs) in conventional anonymous/ephemeral DRAM-backed memory, as we would expect in a calculator. Respecting the architecture means that we can't clutter the code that manipulates in-memory data with new persistence logic, and we can't shoehorn a new persistent storage module into an architecture that has nothing of the kind.

We can, however, swap an existing software layer for a drop-in replacement without dissing the architecture.

Outpatient Layer-Cake Surgery

The right layer to swap lies immediately below the bc interpreter: the memory allocator. Most of the crucial execution state that must be preserved across invocations resides on the conventional heap managed by standard malloc and free; the exceptions can easily be moved onto the heap. Then, in principle, all that's necessary to retrofit pause/resume onto the interpreter is to replace the conventional ephemeral heap with a persistent heap.

Easier said than done? Yes, but not much.

Our retrofit uses a persistent memory allocator, pma, that has been deployed for several years in widely used software.^11,12,14 The pma library provides drop-in pma_* replacements for standard malloc and free. We swap in the new persistent allocator by #define-ing malloc and free to their pma_ counterparts in a header #included in all source modules. Under the hood, pma allocates memory from a file-backed memory mapping. The backing file containing pma's persistent heap is called the heap file.

To activate persistent-memory mode in our retrofitted bc interpreter, the user sets a special new environment variable, BC_PM_HEAP_FILE, to the name of a heap file. If this envar is set, the pma initialization routine maps the given heap file into memory and prepares for persistent-memory mode. If the envar is not set, the initialization routine places the pma library in "fall-back-to-malloc" mode, which means that pma_malloc passes the buck to conventional malloc and pma_free similarly calls ordinary free. In other words, if the special new envar is not set, bc behaves as though our persistence retrofit had never happened.

The trickiest aspect of the retrofit is enabling the existing bc code to access data on the persistent heap. Like all modern persistent heaps, pma requires applications to ensure that all allocated persistent memory is reachable from the persistent heap's root pointer. So, if the existing bc code manipulates a variable named "foo," which now resides on the persistent heap, how does the bc code find foo?

A two-step change tackles the problem of access via the root pointer. We first allocate on the persistent heap a new structure to contain all variables that must persist across invocations. The persistent heap's root pointer always points to this new struct. Second, we #include in all bc source files a new header that replaces straightforward accesses to formerly ephemeral variables with indirect accesses to retrofitted persistent variables:

#define foo root_pointer->foo

When pma operates in fall-back-to-malloc mode, the new struct ends up on the conventional heap and everything Just Works.

The net result is that the existing bc source code scarcely changes at all. One of the few changes to the existing code involves the standard library function strdup, which allocates memory on the conventional heap. We replace the handful of strdups with a code snippet that calls pma_malloc and strcpy. In total, the retrofit adds roughly 110 lines of code to bc and modifies roughly 50 lines in an original code base of ≈6 KLOC.

Arbiters of style might wince at the use of C preprocessor #defines to redirect function calls and variable accesses. For our bc retrofit, however, macros minimize disturbance to the original code and thereby preserve its clarity. They also make it easy to rip out the retrofit later if we don't like it. Over-thinking the problem or over-engineering a solution would not yield superior results for our midsize project.

Persistent-Memory `bc`

Returning to windowed averages, figure 4 shows that persistent-memory bc (pm-bc) correctly handles piecemeal inputs by preserving script execution state, including the window of recent arrivals, across invocations. The user activates persistence by making a heap file—initially a big file of zero bytes created by the truncate utility—and passing its name to pm-bc via a special new environment variable. An alternative way to pass the envar is to export it once prior to running pm-bc rather than at each invocation. Users who forget that they have exported the envar, however, may suffer astonishment when pm-bc preserves state across invocations.

Figure 4: pm-bc correctly handles intermittent arrivals

Persistent-memory bc is asymptotically efficient because the pma allocator lays out persistent heaps in memory-mapped files.¹¹ When pm-bc updates its persistent heap, the underlying OS performs work proportional to the number of memory pages that change rather than the size of the persistent heap. More concisely, it performs O(deltasize) work, not O(datasize). And of course the persistence retrofit imposes no overhead whatsoever if the user does not explicitly activate persistence.

Our pm-bc retrofit showcases the power of principles. A junior programmer with no prior exposure to bc or persistent memory (Su) pulled off the entire retrofit in a few weeks with minimal guidance from a persistent memory veteran (Kelly) and the bc maintainer (Phil Nelson). The maintainer plans to include persistence in a forthcoming official release of GNU bc. The key to success was the thoughtful application of a handful of timeless principles: Seek asymptotic efficiency, solve problems in the right place with minimal effort, and respect the architecture of existing software. Consider these principles for your own retrofit projects.

Drilling Deeper

Retrofitting creates new functionality by adding to existing code. An alternative strategy is to innovate by deleting code. For example, Ken Thompson created the non-interactive grep utility by stripping down an interactive text editor, ed.¹⁸ For some years, Brian Kernighan required students in his advanced programming course at Princeton to replicate Thompson's creation of grep from ed, though starting with a C version, not the original assembly language.¹⁵

Perfection is attained not when there is nothing more to add, but when there is nothing more to remove.
— Antoine de Saint Exupéry

Persistent memory isn't the only way to implement pause/resume. Alternatives range from good old-fashioned SIGSTOP/SIGCONT to more modern options such as CRIU (Checkpoint/Restore In Userspace)⁶ and DMTCP (Distributed MultiThreaded Checkpointing).⁷ Unfortunately, these alternatives are ill-suited to persistent scripting because they make it difficult for a resurrected interpreter to handle new inputs.¹⁷

Goldberg's classic survey catalogs the virtues and limitations of conventional fixed-width binary floating-point arithmetic.⁸ Demanding applications that can't accept the tradeoffs of conventional arithmetic might instead employ Boehm's uncompromising library, which evaluates expressions by computing intermediate results to whatever precision is needed to guarantee a desired precision in the final result.^2,3 If fixed width isn't a problem but decimal arithmetic is required, the recently standardized decimal floating-point types in C23 might fit the bill.⁴

Bits

Grab the example code tarball at https://queue.acm.org/downloads/2024/Drill_Bits_14_example_code.tar.gz. You get an inner tarball containing pm-bc, the winavg.bc script of figure 2, an equivalent Python script, a shell script to compare the performance of the .bc and .py scripts, and a PDF document describing another gotcha of fixed-width floating-point arithmetic.

Drills

Compare the running times and memory footprints of bc, Python, and Perl scripts. On Unix-like systems, the quick and dirty approach is:
/usr/bin/time python3 -c 'print(f"hello")'
Think about the implications for multiple scripts running in parallel on a multicore processor.
Use persistent heaps to transfer data structures between unrelated bc scripts.
Add crash tolerance to persistent-memory bc. Consider gdbm's mechanism.¹⁰
Retrofit persistence onto another scripting language interpreter (e.g., Python or Perl).
Use preprocessor macros to help Millennial/GenZ programmers feel at home in Boomer/GenX languages, e.g., #define YEET throw

Acknowledgments

We thank bc maintainer Phil Nelson for valuable advice; Jon Bentley, John Dilley, Alan Karp, and Kevin O'Malley for brainstorming; and Brian Kernighan for historical information about ed and grep. Bentley, Dilley, Karp, Nelson, O'Malley, and Charlotte Zhuang reviewed drafts and provided valuable feedback. Dilley and O'Malley reviewed our code, and Lucas Stevenson re-implemented our "winavg.bc" script in Python.

References

Arnold, D. N. 2000. The Patriot missile failure; http://www.ima.umn.edu/~arnold/disasters/patriot.html.
Boehm, H.-J. 2017. Small-data computing: correct calculator arithmetic. Communications of the ACM 60(8), 44–49; https://dl.acm.org/doi/pdf/10.1145/2911981.
Boehm, H.-J. 2020. Towards an API for the real numbers. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI); https://dl.acm.org/doi/pdf/10.1145/3385412.3386037.
C23 Standard (draft n3054). https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3054.pdf.
Cowlishaw, M. 2007. Decimal arithmetic FAQ part 1 — general questions. IBM Corp.; https://speleotrove.com/decimal/decifaq1.html.
CRIU (Checkpoint/Restore In Userspace); https://criu.org/Main_Page.
DMTCP (Distributed MultiThreaded Checkpointing); https://dmtcp.sourceforge.io/.
Goldberg, D. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys 23(1), 5–48; https://dl.acm.org/doi/pdf/10.1145/103162.103163.
Hanson, D. R. 1997. C Interfaces and Implementations. Addison-Wesley. See page 323 for high-precision price tracking.
Kelly, T. 2021. Crashproofing the original NoSQL data store. acmqueue 18(4); https://dl.acm.org/doi/pdf/10.1145/3487019.3487353.
Kelly, T. 2022. The pm-gawk user manual. Available at reference [12]. See Section 4.1 for asymptotic efficiency.
Kelly, T. 2022 pma: A persistent memory allocator; http://web.eecs.umich.edu/~tpkelly/pma/.
Kelly, T. 2024. Zero tolerance for bias. acmqueue 22(2), 19–38; https://dl.acm.org/doi/pdf/10.1145/3664645.
Kelly, T., Tan, Z. F. Li, J., Volos, H. 2022. Persistent memory allocation: Leverage to move a world of software. acmqueue 20(2), 16–30; https://dl.acm.org/doi/pdf/10.1145/3534855.
Kernighan, B. 2024. Personal communication.
Nelson, P. 2024. GNU bc arbitrary-precision calculator; https://www.gnu.org/software/bc/.
Tan, Z. F., Li, J., Volos, H., Kelly, T. 2022. Persistent scripting. In Proceedings of the Non-Volatile Memory Workshop (NVMW); http://nvmw.ucsd.edu/nvmw2022-program/nvmw2022-data/nvmw2022-paper35-final_version_your_extended_abstract.pdf.
Wikipedia. 2024. grep; https://en.wikipedia.org/wiki/Grep.

Terence Kelly ([email protected]) enjoys splashing around in his minivan's jacuzzi.

Ziheng (Aaron) Su enjoys bolting new tricks onto old dogs.

Originally published in Queue vol. 22, no. 6—
Comment on this article in the ACM Digital Library