Download PDF version of this article PDF

Sharpening Your Tools

Updating bulk_extractor for the 2020s

Simson Garfinkel, Jon Stewart

DF (digital forensics) is a fast-moving field with a huge subject area. A digital investigator must be able to analyze "any data that might be found on any device anywhere on the planet."12 As such, developers must continually update DF tools to address new file formats, new encoding schemes, and new ways that the subjects of investigations use their computers. At the same time, tools must retain the ability to analyze legacy data formats—all of them, in fact.

Most DF tools run on consumer desktop operating systems, adding another layer of complexity: These operating systems are also continually evolving. Analysts must update and upgrade their systems, lest they risk compromise by malware, which decreases productivity and can discredit an analysis in court. This is true even for workstations that are "air gapped" (not connected to the Internet), since malware in evidence can exploit bugs in forensic software.19

Surprisingly, open-source forensic tools distributed as source code face a greater challenge when the underlying operating system is upgraded: Software compatibility layers typically emphasize compatibility for the ABI (application binary interface), not source code. Software compiled from source must cope with upgraded compilers, libraries, and file locations. As a result, older open-source software frequently does not run on modern systems without updating. One way around this problem is to run the old software inside a virtual machine—but older virtual machines won't be protected against modern malware threats.

One advantage of open-source software is that the end user has the source code and is therefore able to update the application (or pay for a programmer to update the application). In practice, many users of DF tools lack the expertise, financial resources, and time to update the collection of open-source tools they rely upon to do their jobs. Instead, that task falls upon tool developers, who must simultaneously cope with essential changes in DF best practices as well as in operating systems, compilers, and libraries, while avoiding inadvertent changes to important functionality. Developers must also resist the urge for aggressive rewrites that add new expansive functionality, lest they succumb to the "second-system effect."5

This article presents our experience updating the high-performance DF tool BE (bulk_extractor)16 a decade after its initial release. Between 2018 and 2022, we updated the program from C++98 to C++17. We also performed a complete code refactoring and adopted a unit test framework.

The new version typically runs with 75 percent more throughput than the previous version, attributable to improved multithreading. This article provides lessons and recommendations for other DF tool maintainers. All developers can benefit from the detailed discussion of how embracing features in the C++17 standard and modern software engineering practices can improve the correctness, reliability, and throughput of forensic software. Businesses and funding agencies can use this experience to help justify the substantial cost of updating and even rewriting DF tools that appear to be working properly. Students can benefit from reading this article and then consulting the BE source code, which can be found on GitHub.

 

Background

A typical DF examination involves five steps: (1) policy and capability development; (2) evidence assessment; (3) evidence acquisition; (4) evidence examination; and (5) documentation and reporting.20 BE assists in the evidence examination stage.

There are many kinds of evidence examination tools. File-extraction tools use metadata to extract individual files from disk images and network streams; file-carving tools attempt to recognize files within bulk data, such as disk image and product files, based solely on content recognition; file-analysis tools understand file formats and attempt to extract information (often known as artifacts), such as text and Microsoft Office file metadata.

BE does not fit neatly into these categories. Instead, it was designed to be a so-called "find evidence button." It is similar to a file-carving tool in that it attempts to recognize known formats in bulk data and use that data in further processing. In addition to recognizing files, such as JPEG images, BE recognizes smaller "features", such as the EXIF (exchangeable image file) metadata within a JPEG image, or even an email address within an EXIF field. BE can also identify other kinds of identity information, such as URLs and credit card numbers: Such information has proven to be quite valuable in investigations. BE also examines every input block to see if it contains directory entry structures for the FAT32 (File Allocation Table 32) and NTFS (New Technology File System) and, if any are found, reports the decoded metadata.

Overall, BE handles dozens of data formats, all at the same time. The program then constructs normalized Unicode histograms of important strings, such as email addresses and Internet search queries. Experience has shown that this "kitchen-sink" approach—throwing every tool at every byte—finds data that other tools miss, data that can be important in investigations. While such analysis is computationally expensive, it is embarrassingly parallel.

BE also exploits an exceedingly simple I/O model (sequential reads) and in-memory analysis. As a result, BE routinely uses all of the cores of a multicore workstation.

Another distinguishing aspect of BE is that it performs recursive reanalysis of data blocks. BE checks every byte to see if it is the start of a stream that can be decompressed or decoded; if so, the resulting bytes are then recursively reanalyzed. Thus, BE's JPEG carver finds not just ordinary JPEGs, but those that are in gzip-compressed data and those that are in Base64 MIME (Multipurpose Internet Mail Extensions) attachments. The combination of decoding data recursively and recognizing interesting data without regard to file-system structure makes BE a powerful tool that complements traditional forensics tools.

Because BE ignores file boundaries, the modules that it uses to recognize content, called scanners, are typically more complex than the format decoders (sometimes called dissectors) in other forensic programs. Of course, each scanner checks the input to every field before using it for memory references. But BE scanners also check for end-of-memory conditions, since a scanner may be operating on a fragment of a decompressed memory block. Since BE processes memory in parallel, with each block in a different thread, all scanners must be reentrant.

Some of the program's most important scanners are large lexical analyzers written in GNU flex (fast lexical analyzer generator)10 that scan bulk data for email addresses, phone numbers, MAC (media access control) addresses, IP addresses, URLs, and other kinds of formatted text strings (sometimes called selectors18). The approach of using GNU flex for this purpose was first used by SBook14 to recognize email addresses, phone numbers, and other formatted information in free-text address book entries, meaning that some of the code in BE is now 30 years old. (Related to Flex is the Bison parser generator, largely written by Robert Corbett and made yacc-compatible by Richard Stallman.9)

 

History

The BE approach for bulk data analysis was first deployed to find confidential information on a set of 150 hard drives purchased on the secondary market.17 The program was refined and made multithreaded to keep up with the increased number of hard drives and other storage devices collected during the construction of the Real Data Corpus.15 A study revealed specific requirements that would be of use to law enforcement. Most importantly, users wanted a tool that would run without user input on either Linux or Windows, produce output in text files, and not crash.13

BE realized these objectives with a modular architecture. The core engine applies the content-recognizing scanners to blocks of data and provides for recursive reanalysis. The second part incorporates the program's main loop and all code necessary for reading disk images. This part of the program reads data in overlapping blocks, and feeds the data to the framework's API. The third part of BE is the scanners themselves, which can be compiled into the executable or loaded at runtime from shared libraries (.so files on Linux and macOS, .DLL files on Windows). The BE framework allows scanners to provide metadata, declare configuration variables, and modify the program's help messages, letting users create and deploy their own proprietary scanners.

 

BE in education

BE has been widely used in digital forensics education, as evidenced by the more than 400 instructional videos on YouTube, many of which showcase the result of student projects using the tool.

BE is a successful tool in education undoubtedly because it is easy to use; runs on Windows, Mac, and Linux platforms; and finds a variety of forensic artifacts. For advanced students, BE has been used to provide data inputs for a range of projects. BE is also taught in DF professional development training courses.

 

BE in operational use

Since its creation, BE has been used by government agencies and private companies worldwide. In 2011, the program won a U.S. Department of Defense Value Engineering Award.21 The program is part of the Blacklight digital forensics tool7 and has been incorporated into BitCurator,4 which is used by curators in the digital humanities.

Stroz Friedberg's DFIR (Digital Forensics and Incident Response) consulting practice has used BE in some investigations. In a large incident response case, Linux servers with XFS file systems had been attacked, and no popular forensic tools could cope with file-system analysis of XFS. BE was used to triage these servers for relevant indicators of compromise, allowing for rapid progress in the early days of the investigation. In a well-known intellectual property theft case (Waymo vs. Uber), BE was used as one of several processes to scour forensic evidence for relevant material.

In summary, BE has been a powerful tool for more than a decade, with compelling anecdotes of usage, but little is known about how widely or regularly it has been used.

 

Updating BE

BE is a legacy C++ program. The producer-consumer thread pool was developed in 2008, and the underlying scanner-based architecture with recursive reanalysis was in place by 2009. All of this was done with versions of C++ based on the circa-1998 STL (Standard Template Library), well before the ratification of the C++11 standard.

Development of BE largely stopped in 2014. Nevertheless, software maintenance remained an ongoing concern: With each new release of an open-source operating system, the autoconf system typically required some changes so that BE would compile on the new system. By 2018, such changes were coming with alarming frequency. Based on this experience, the time seemed right to embark on an orderly update of BE to create version 2.0.

 

Update goals

We had specific goals for the update:

• Make the program easier to compile and maintain by relying on the C++ standard. The primary reason for the BE upgrade was that the program would no longer compile on modern open-source operating systems.

We were especially eager to rely on the C++ standard library to provide platform independence, because conforming C++ compilers guarantee that conforming code will compile in the future on platforms that do not exist today. They do this by having the build system specify the version of the standard to use when compiling and linking the executable: C++11, C++14, C++17, and so on.

Upgrading the existing code to a modern C++ standard required choosing a specific standard and replacing that code, which had been previously and painstakingly written, debugged, and maintained, with new code that used the C++ standard. This opened up the possibility of introducing bugs into working code, so a better strategy was needed for testing.

We first chose the C++14 standard, as complete C++17 implementations were not widely available when the upgrade started. However, because the project had dragged on for so long that C++17 became more available, we eventually migrated to C++17 for the std::filesystem support.

In 2021, we considered moving to C++20, but attempts to use specific features met with failure, so BE2 uses C++17.

• Simplify the codebase. Although the internal structure of BE was sound, the implementation was needlessly complicated in places, a result of 10 years' development. One way that we simplified the code was by revising the entire codebase so that pointers are passed only if nullptr is a valid value; otherwise, C++ references are passed exclusively.

• Remove experimental and research code from the codebase. BE was initially developed to support digital forensics research, and it contained a significant amount of experimental, research code. This code was removed for BE2: Experiments can continue, but they will be confined to using the plug-in system.

• Make BE run faster. The final objective was to decrease the amount of time that the program requires to run. Initially, the goal was for BE2 to run faster than BE1 on the same hardware. Further analysis revealed that BE2 could take better advantage of multiple processor cores without a corresponding need for high-performance I/O systems.

BE1's parallelism comes from breaking the disk image into 16MB chunks, called pages, each placed in the work queue for processing. A worker takes a work unit and runs each scanner sequentially on each page. To increase parallelism in BE2, the work units now specify both a page and a scanner, so that scanners for the same page now potentially run concurrently in different threads. This required implementing reference-count garbage collection for the memory associated with the pages. The revised system allows new work units to be queued when decompressed or decoded blocks of data are recursively processed. The result is to allow more work to be done in parallel for each disk read—a good engineering decision, since modern high-performance systems typically have more processing capacity for each byte read per second than the systems of a decade ago.

BE has the ability to process a directory of files. In BE1, each file was handled with a single thread. In BE2, all of the files are scanned in advance and then processed in order, each one split into multiple pages, and each page processed in parallel with multiple threads. Once again, this results in more opportunities for increased parallelism, which has a handsome payoff on systems with a large number of cores.

 

Improving code quality

As part of refactoring the codebase, we also improved the quality of the underlying C++ code.

We started by reading most of Bjarne Stroustrup's textbook, The C++ Programming Language.24 More than 1,000 pages long, this book is probably rarely read in its entirety. Moreover, it covers versions only through C++11, and we were using C++14 (and then C++17). However, BE was based on a version of C++ that predates the C++11 standard (as did its developers), and the changes between that version and C++11 are dramatic compared with those that follow. The familiarity that comes with reading such a textbook allows one to make better use of the language's features that are at the same time both more efficient and safer.

The next step was to improve the efficiency, safety, and speed of BE's fundamental memory management C++ class, the sbuf (search buffer). This class represents a sequence of bytes that are read from evidence or decoded from another sbuf. The sbuf records how the contained memory was allocated (and thus, how it needs to be freed); provides accessor methods that are type-safe, memory-safe, and thread-safe; has capabilities for making new sbuf from disk files (slices of other sbufs) or from new memory that is passed to a codec or decompressor; and provides rich debugging capabilities.

To improve encapsulation and provide for better code reuse, many functions were moved from scanners and the BE framework into the sbuf. Moving functionality into the sbuf made it possible to eliminate virtually all raw memory references throughout the rest of BE, as well as many redundant safety checks. (We considered but eventually decided against implementing parts of BE in the Rust programming language because of the extra complexity that would result.)

Other improvements in the codebase include:

• Moving common code out from the scanners and into the underlying BE2 framework. For example, rather than each scanner having options for setting its carve mode, the framework understands how to set the mode for any named scanner.

• Simplifying the API, combining functions and methods with nearly identical functionality.

• Adding explicit phases into the API where the scanners allocate and free global memory. Now scanners are expected to deallocate all memory that they allocate during the run, rather than allowing the operating system to discard the memory when the process exits. This allowed the unit tests to find memory leaks that otherwise would have been missed.

• Passing strings by value as a std::string, rather than passing them by reference as const std::string &, as was done previously. This may necessitate a string copy, but it is not a meaningful impact on performance, especially when compared with the improved safety against possibly using an invalidated reference. This decision simplified code and resulted in the elimination of several use-after-free errors. (We decided against the use of C++ smart pointers because of a thread-safety issue.)

• Defining a clear allocation/deallocation policy for all objects in memory. Special attention has been paid to implementation of C++ move operators, allowing the compiler to realize increased efficiency by using them instead of a copy-and-delete operator.

• Replacing code that was #ifdef'ed for Windows, macOS, and Linux with calls to the C++17 library (where possible). In particular, extensive use was made of the std::filesystem class. The result of these changes made the code smaller and easier to validate.

• Likewise, replacing many preprocessor #define constants with C++ inline static constants whenever possible. This makes the values available to the debugger and makes the code easier to understand.

• Eliminating global variables used to track state. The only use of global variables that remain are static tables that are used for the precomputed value of CPU-intensive functions. The code that uses these variables now checks to verify that they have been initialized and throws an exception if they have not. Essentially, they are now singletons. It would be nice to change the memory protection of these variables to be read-only, but that cannot be done in a portable manner and would require that the variables have their own memory pages. Instead, read-only behavior is enforced at the language level using const correctness.

• In most cases, removing return codes that must be checked to detect errors. Instead, the C++ exception mechanism is used to signal and catch error conditions.

• Eliminating many explicit mutexes and replacing them with the C++ std::atomic<> template. In contrast, BE1.6 used GCC (GNU Compiler Collection) compiler intrinsics for atomic increment in some locations, but made broad use of mutexes to protect variables shared between threads.

• Removing the legacy Posix getopt processing and replacing it with cxxopts,8 a command-line option processing module that is reentrant and does not use global variables. This was necessary to allow unit tests that test the option processing.

• Enabling all compiler warnings, not simply those enabled with -Wall, which, despite its name, does not enable all warnings.

As a result of these changes:

• The BE2 configuration script now runs in 16 seconds on our reference system, a six-core 2019 Mac mini, instead of 25 seconds for the BE1.6 configuration script. This is not a significant improvement for users, but it is for BE developers. The compile time for both is 32 seconds using make -j12.

• The C++ codebase was reduced by roughly 10,000 lines, or 17 percent, even accounting for the lines added for the new unit tests. The program is now roughly 46,000 lines of C++ and GNU flex code.

 

Dynamic analysis with unit tests and test coverage

As hinted previously, despite its widespread use, BE lacked a modern approach to testing. Specifically, the BE codebase was devoid of systematic unit tests. Instead, the program was occasionally run on test datasets during the development process, and the output was manually compared with that from previous runs: If the two outputs were substantially similar, the program was deemed to be not obviously broken.

For BE2, we decided to implement unit tests for all levels of the BE source code, including low-level data manipulation routines, forensic scanners, result-reporting code, option processing, and end-to-end tests. A review of the C++ unit test frameworks on Wikipedia led to Catch2,6 which has support for test scaffolds, implements a minimal CLI (command-line interface), can test for the presence (or absence) of thrown exceptions, and appears to be well supported and maintained.

We also enabled AddressSanitizer22 by default on the development system. (We enabled ThreadSanitizer23 and found several thread-sharing errors, but encountered a false positive resulting from a conflict between one of its heuristics and our multithreading paradigm, preventing it from being enabled by default.)

We started with unit tests for the BE2 framework, generally writing them as the new interfaces were designed and implemented, combining the creation of each new test with related refactoring. The code coverage of the unit tests was tracked and systematically increased, then integrated with GitHub's "actions" system, so that the tests would run on every push. The popular CodeCov.io website displayed the code coverage results.

After all of the new and refactored code had unit tests, the code coverage reports revealed which pieces of legacy code were not covered by the newly written unit tests. The target code coverage was 60 percent. In some cases, legacy code was covered by the new tests, because the new code called the old code. For roughly two-thirds of the legacy code, however, there was no test coverage, which we resolved by writing new tests from scratch. At first, this felt like a pointless compliance exercise—after all, BE had been in use for more than a decade, so we assumed that significant bugs, such as memory allocation errors and off-by-one errors, would not be present in the codebase. The act of writing the unit tests, however, forced us to clarify internal documentation, simplify internal implementations, and in some cases, eliminate legacy code that was no longer being used. We even found a few dormant bugs!

 

Removing functionality

In addition to removing experimental functionality, we improved performance of BE by disabling scanners that were computationally intensive but that rarely if ever extracted useful forensic data.

For example, we now disable the hiberfil (hibernation file) scanner (xpress decompression) by default, because we lack test vectors that could be used to demonstrate the correctness of our implementation, and because Windows may no longer be using the compression algorithm that BE1 implemented.

We also disabled (by default) scanning for 192-bit AES (Advanced Encryption Standard) keys in memory, because AES is rarely used in its 192-bit mode.

All of the features that are disabled by default can be reenabled with command-line options.

We also removed key functionality that was not being used:

• Internet search engines were used to see if some of the program's more obscure command-line options were being referenced in open-source programs, scripts, or even blog entries that provide tutorials for using BE. Obscure options that were unused by the user community were eliminated.

• To the best of our knowledge, no one (other than the original developer) ever used BE's shared library to let the program's scanner system be called from C++ or Python, so we removed it, although it could be reimplemented in the future.

• The ability to load scanners as shared libraries at startup has not been updated for BE2, although this update is trivial and will be implemented if users request it.

 

Incompatible changes

Despite efforts to retain full compatibility between BE1 and BE2, a few minor incompatible changes were required in the interest of correctness and modernization:

• BE feature files are UTF-8, but some of the information in them is binary and must be escaped. In BE1, non-Unicode characters were present and escaped in octal. In BE2, non-Unicode characters are escaped in hexadecimal.

• A persistent problem is how UTF-16 features should be represented in the UTF-8 feature files. BE1 presented UTF-16 as octal-escaped values, which were hard to read. BE2 converts UTF-16 into UTF-8 in the second ("feature") column but leaves the features as (escaped) UTF-16 in the third ("context") column.

• The ZIP scanner was not properly reporting the location of artifacts within ZIP-decoded data blocks; this was corrected (addressed later in this article).

• We dropped support for MD5 (message-digest algorithm) in BE2, and now use SHA-1 (secure hash algorithm-1).2 (Sadly, the DF community has been slow to move to SHA-256 or SHA-3.)

 

Performance tuning

Despite the effort to eliminate all memory copies, an interim version of BE2 was dramatically slower than BE1.6. For example, scanning the 2009-domexusers15 (an NTFS file on a computer running Windows XP containing two user accounts) disk image on the reference Mac mini required approximately 10 minutes with BE1.6 but took 70 minutes with the BE2 development version.

BE has long had the ability to measure each scanner's contribution to runtime. Specifically, it keeps counters (in std::atomic<> variables) of how many times each scanner is called and how many nanoseconds it spends executing. These counters revealed that just three scanners (rar, net, and aes) were responsible for the vast majority of the time spent scanning.

Each of these scanners has a hand-coded loop that scans through the memory image, scanning for a magic number. The loop had been reimplemented and was now making a new sbuf for each location. We removed the loop and moved the scanning function into the sbuf class implementation itself, which eliminated the need to create and destroy an sbuf for each character.

Once these changes were made, the rar scanner was no longer the slowest. Now the slowest scanners were net, aes, and the flex-based email and accts (but not the other flex-based scanners, curiously enough).

We iterated. With microbenchmarks and more testing, we were able to identify many other opportunities for speedup. We cannot stress enough that using the program to measure and improve its own performance turned out to be more effective than using off-the-shelf performance-monitoring tools.

 

Validation

The DF tool community is increasingly turning its attention to developing specifications and tests for various aspects of a tool's intended operation. After all, "a program that has not been specified cannot be incorrect; it can only be surprising."25

We performed two kinds of validation on BE2: correctness and throughput. For correctness, BE2 needed to produce results that were as good as the results of BE1. For throughput, BE2 needed to be at least as fast as BE1.

 

Correctness

When differences occurred between the output of BE1 and BE2, some were cases in which BE2 was correct. In these cases, it appeared that the BE1 output had never been validated in detail. Most of these had to do with the location of recursively analyzed features in the feature file.

All DF tools require the ability to specify the location in evidence from which a feature, such as an email address has been found. With BE the ability to specify such locations is complicated by the fact that a bytestream might need to be decoded, decompressed, or otherwise transformed. BE1 introduced the concept of a forensic path, which allows the specification of both a location and one or more transformations. For example, the BE forensic path 456536-ZIP-1255117 is read to mean that a feature is located 1,255,117 bytes into an inflated ZIP stream that is itself located 456,536 bytes from the beginning of the disk image.

During the development of unit tests, we discovered that some forensic path locations reported by BE1 failed to include six bytes of the ZIP header. This was corrected by adding new code so that the sbuf class computes the offset, rather than hard-coding the calculation into each scanner. This specific error was discovered while writing a unit test for the forensic path printer—the part of BE that reads a forensic path, performs the specified transformations, and performs a hexdump of the evidence. Although this code had been in use for more than 10 years in the BE, apparently it had never worked properly for the ZIP scanner, and none of BE's users had ever reported it not working. (The gzip scanner reported forensic paths correctly.)

Many of the code paths in the BE1 codebase were painstakingly developed on specific test cases, but those cases had not been added to the codebase as unit tests. They were added in BE2, assuring that the tests would automatically run using GitHub's "Actions" continuous integration facilitation on every commit.

 

Throughput

Measuring the speed with which BE processes a disk image or other form of electronic evidence is straightforward. Explaining variations in speed is significantly harder. The time that BE spends processing evidence is highly dependent upon the contents: A disk image that contains many compressed archives will take longer to process because each compressed run of bytes will be decompressed and recursively reanalyzed. A disk that is filled with JPEGs will analyze quickly, but if carving mode 1 is enabled, each JPEG will need to be copied into a new file. If carving mode is set to 2 (the default), however, only the JPEGs that had to be decompressed or otherwise decoded—the JPEGs typically missed by other carving tools—will be copied.

BE also incorporates many techniques to discard data before applying the full recursive analysis. For example, duplicate data is not analyzed a second time, which has the side benefit of protecting the program from compression bombs. Likewise, pages that consist of a repeating n-gram (e.g., ABCABCABC...) will not be analyzed, unless a scanner indicates in its metadata that such analysis is desired.

Another factor in performance is the computer on which the program is run. The number of CPU cores, amount of RAM, speed of that RAM, and speed of the I/O system all impact throughput. All of these factors interact with the evidence under examination: A disk image that has a lot of blank and repeated sectors will benefit more from a faster I/O system, while a disk image with a lot of complex data structures will benefit more from additional cores.

Therefore, throughput and benchmark results are best reported using evidence that is ecologically valid,1 such as an actual disk image. Although such media are commonly used in software development and internal benchmarking, they tend not to be publicly released for privacy reasons.

Table 1 shows the performance of BE1.6 and BE2 with three reference disk images from the Digital Corpora collection, running on three different reference computers. The disk images are nps-2009-ubnist1, a 2.1GB disk image of a bootable USB drive running Ubuntu Linux; nps-2009-domexusers, a 42GB disk image of a Microsoft Windows system that was used by several individuals in a lab; and nps-2011-2tb, a 2.0TB disk image containing the entire GovDocs1 corpus and several other of the Digital Corpora reference disk images. These images were made at the Naval Postgraduate School between 2009 and 2011, and are hosted on the Digital Corpora website (digitalcorpora.org).

Sharpening Your Tools: Updating bulk_extractor for the 2020s

Times for the nps-2009-ubnist1 and nps-2009-domexusers are averages of three runs. The nps-2009-ubnist1 and nps-2009-domexusers read and write to the system SSD (solid-state drive), while the nps-2013-2tb reads from the system SSD and writes to an external USB3 hard drive because of storage considerations. BE1.6 speeds are reported for runs with the standard 30 default scanners enabled: accts, aes, base64, elf, email, evtx, exif, find, gps, gzip, hiberfile, httplogs, json, kml, msxml, net, ntfsindx, ntfslogfile, ntfsmft, ntfsusn, pdf, rar, sqlite, utmp, vcard, windirs, winlnk, winpe, winprefetch, and zip. BE2 speeds are for runs with the standard 29 scanners enabled (hiberfil is disabled) and with AES192 key searching disabled, and with the BE1.6 configuration that adds hiberfil and AES192 key searching. The Apple M1 Pro 10 core processor has eight performance cores and two efficiency cores. Throughput is normalized to the speed of BE1.6 on the same hardware with the same disk image; a throughput of 200 percent means that the disk image will be analyzed in half the time.

Performance is reported using three Apple Macintosh computers. Both BE1.6 and BE2 were compiled on a computer on which the benchmark was run with the current LLVM compiler provided by Apple. All compilation was done with optimization level -O3, with both AddressSanitizer and ThreadSanitizer disabled. We report BE1.6 and BE2 with the default analysis. In this configuration, 30 scanners are enabled for BE1.6, but BE2 disables hiberfil and AES192 key searching. For this reason, BE2 is also reported with the BE1.6 configuration (the rightmost two columns). As can be seen in Table 1, BE2 is faster than BE1.6 in nearly every case, although the speedup is more pronounced on modern hardware with more cores.

 

Recommendations and Future Work

This multiyear exercise shows the value of updating tools to current software engineering practices, even when they appear to be working and bug-free. We recommend a scrub of all modern digital forensics tools, as rewriting these tools will likely make them faster and more reliable.

Reading Stroustrup's book was time-consuming preparation for this project, but well worth the investment. A similar benefit was derived from reading the entire Python reference manual prior to embarking on a large-scale Python project. Detailed reading of all developer documentation for implementation languages and tools is a good idea. Organizations investing in DF research and tools should also be prepared to invest for the long term, to provide for maintenance, adaptation, and growth of promising tools, as well as focused attention for developers.

The improvement in code quality that resulted from the pursuit of 60 percent unit test code coverage was stunning. The power of AddressSanitizer in finding a wide variety of bugs was also surprising. Test-driven development3 and test-driven refactoring11 should be adopted as primary tools, and AddressSanitizer should always be enabled during the development process.

The speed of C++ compared with Python is a clear incentive to use this language for speed-critical applications. Given the lack of C++ programmers in the DF community, however, it is clear that BE requires an interface to allow Python scanners to be called. Because Python is not thread-safe, a separate Python interpreter is required for each analysis thread. C++ should be used with well-designed classes to provide memory safety, and Python-based APIs should be provided to access their functionality.

We achieved a 61 percent code coverage for the BE2 framework but only 47 percent for the remainder of the BE2 codebase (excluding the framework). Clearly there is still room for improvement.

Finally, the increased use of file system-level compression and encryption, combined with the use of the TRIM command on SSDs, means that the bulk data analysis of raw storage devices is likely to yield less data in the future than a systematic extraction of bulk data from resident files. That is, running BE2 with the -r (recursive) option on a mounted file system may one day yield more useful information than running it on the raw device. Ideally, it would be possible to run BE2 on a file system, keep track of the sectors that were scanned, and then process the remaining sectors raw. Another approach would be to perform two passes: one on the mounted files and another on the raw device. Evaluation of these strategies is left as future work.

 

References

1. Andrade, C. 2018. Internal, external, and ecological validity in research design, conduct, and evaluation. Indian Journal of Psychological Medicine 40(5), 498–499; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6149308/.

2. Barker, R., Roginsky, A. 2019. Transitioning the use of cryptographic algorithms and key lengths. National Institute of Standards and Technology, Special Publication 800-131A, revision 2; https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf.

3. Beck, K. 2002. Test-Driven Development: By Example. Addison-Wesley Professional.

4. BitCurator; https://bitcurator.net/.

5. Brooks, F. P. 1975. The Mythical Man-Month—Essays on Software Engineering. Addison-Wesley.

6. Catch2. GitHub; https://github.com/catchorg/Catch2.

7. Cellebrite. Blackbag technology software user license agreements; https://cellebrite.com/en/blackbag-agreements/.

8. cxxopts. GitHub; https://github.com/jarro2783/cxxopts.

9. Donnelly, C., Stallman, R. M. Bison—the Yacc-compatible parser generator. Free Software Foundation; https://www.gnu.org/software/bison/manual/; ftp://ftp.gnu.org/pub/gnu/bison/.

10. Flex—fast lexical analyzer generator, GNU software package; https://github.com/westes/flex.

11. Fowler, M. 2018. Refactoring: Improving the Design of Existing Code, Second Edition. Addison-Wesley Professional.

12. Garfinkel, S. L. 2013. Digital forensics. American Scientist 101(5), 370-377; https://www.americanscientist.org/article/digital-forensics.

13. Garfinkel, S. L. 2013. Digital media triage with bulk data analysis and bulk_extractor. Computers and Security 32(C), 56–72; https://dl.acm.org/doi/10.5555/2748150.2748581.

14. Garfinkel, S. 1992. SBook: Simson Garfinkel's Address Book, Version 2.0, Simson Garfinkel and Associates; https://simson.net/ref/1992/SBook20.pdf.

15. Garfinkel, S. L., Farrell, P., Roussev, V., Dinolt, G. 2009. Bringing science to digital forensics with standardized forensic corpora. In Digital Investigation, Proceedings of the Ninth Annual Digital Forensic Research Workshop 6 (supplement); https://www.sciencedirect.com/science/article/pii/S1742287609000346.

16. Garfinkel, S., 2013. Digital media triage with bulk data analysis and bulk_extractor. Computers and Security 32, 56-72.

17. Garfinkel, S., Shelat, A. 2003. Remembrance of data passed. IEEE Security and Privacy 1(1), 17–27; https://dl.acm.org/doi/abs/10.1109/MSECP.2003.1176992.

18. Liu, J., Moore, R. T. 2015. An overview of the NSA's declassified intelligence oversight board reports. Lawfare;
https://www.lawfareblog.com/overview-nsas-declassified-intelligence-oversight-board-reports.

19. Marlinspike, M. 2021. Exploiting vulnerabilities in cellebrite UFED and physical analyzer from an app's perspective. Signal; https://signal.org/blog/cellebrite-vulnerabilities/.

20. National Institute of Justice. 2004. Forensic examination of digital evidence: a guide for law enforcement; https://www.ojp.gov/pdffiles1/nij/199408.pdf.

21. Office of Deputy Assistant Secretary of Defense, Systems Engineering. 2011. Value engineering: a guidebook of best practices and tools; https://www.usace.army.mil/Portals/2/docs/Value%20Engineering/DoD%20SD-24_VE%20Handbook.pdf.

22. Serebryany, K., Bruening, D., Potapenko, A., Vyukov, D. 2012. AddressSanitizer: a fast address sanity checker. In Proceedings of the Usenix Annual Technical Conference, 28; https://dl.acm.org/doi/10.5555/2342821.2342849.

23. Serebryany, K., Iskhodzhanov, T. 2009. ThreadSanitizer: data race detection in practice. In Proceedings of the Workshop on Binary Instrumentation and Applications, 62-71; https://dl.acm.org/doi/10.1145/1791194.1791203.

24. Stroustrup, B. 2013. The C++ Programming Language, Fourth Edition. Addison-Wesley; https://www.stroustrup.com/4th.html.

25. Young, W. D., Boebert, W., Kain, R. 1985. Proving a computer system secure. The Scientific Honeyweller 6(2), 18–27.

 

Simson Garfinkel has current research interests in applying AI to digital forensics and data governance. He is an ACM Fellow and a member of the National Association of Science Writers.

Jon Stewart is a vice president with Aon Cyber Solutions and leads its software development efforts within its Stroz Friedberg digital forensics consulting practice.

Copyright © 2023 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 21, no. 1
Comment on this article in the ACM Digital Library





More related articles:

Ethan Miller, Achilles Benetopoulos, George Neville-Neil, Pankaj Mehra, Daniel Bittman - Pointers in Far Memory
Effectively exploiting emerging far-memory technology requires consideration of operating on richly connected data outside the context of the parent process. Operating-system technology in development offers help by exposing abstractions such as memory objects and globally invariant pointers that can be traversed by devices and newly instantiated compute. Such ideas will allow applications running on future heterogeneous distributed systems with disaggregated memory nodes to exploit near-memory processing for higher performance and to independently scale their memory and compute resources for lower cost.


Pat Helland - Autonomous Computing
Autonomous computing is a pattern for business work using collaborations to connect fiefdoms and their emissaries. This pattern, based on paper forms, has been used for centuries. Here, we explain fiefdoms, collaborations, and emissaries. We examine how emissaries work outside the autonomous boundary and are convenient while remaining an outsider. And we examine how work across different fiefdoms can be initiated, run for long periods of time, and eventually be completed.


Archie L. Cobbs - Persistence Programming
A few years ago, my team was working on a commercial Java development project for Enhanced 911 (E911) emergency call centers. We were frustrated by trying to meet the data-storage requirements of this project using the traditional model of Java over an SQL database. After some reflection about the particular requirements (and nonrequirements) of the project, we took a deep breath and decided to create our own custom persistence layer from scratch.


Torsten Ullrich - Real-world String Comparison
In many languages a string comparison is a pitfall for beginners. With any Unicode string as input, a comparison often causes problems even for advanced users. The semantic equivalence of different characters in Unicode requires a normalization of the strings before comparing them. This article shows how to handle Unicode sequences correctly. The comparison of two strings for equality often raises questions concerning the difference between comparison by value, comparison of object references, strict equality, and loose equality. The most important aspect is semantic equivalence.





© ACM, Inc. All Rights Reserved.