March 26, 2025
Volume 23, issue 1

Download PDF version of this article PDF

String Matching at Scale

A call for interdisciplinary collaboration and better-directed resources

Dennis Roellke

Freshly brewed coffee steaming from my favorite mug and an open newspaper in my hands, I am ready to enjoy this beautiful Sunday. I take a bite from my croissant and flip to the next page.

Excuse me? Did they recall thousands of cars? What model? My model! The one I just leased for the upcoming three years—great! I had convinced my wife that this car would be the better option for us, but now it may cause us trouble before we even venture on our first trip.

My lazy Sunday is over in a blink. I grab the car keys from the kitchen drawer, jump in the driver seat, and jet down the newly paved road, heading to our car dealership of trust. Annoyed but also excited—will I get some money back?

At the dealership, the mechanic shares his surprising diagnosis: NOT AFFECTED!

How is that possible? I'm driving the exact model they describe in the recall campaign.

Well, as I learned later, the subsidiary of a particular manufacturing plant was supplied with a brittle component because a contractor had used expired adhesives. Not every vehicle in a model line is assembled at the same plant, though. Two cars of the exact same model may have two different components installed—if both comply with the approved specification. My car was fine, and I drove home with a smile.

This endeavor taught me a new appreciation for the strongly process-driven mechanical engineering industry that is surgically building reliable systems and meticulously tracking contributors, components, and materials.

If only we were able to do this for software.

If only I knew what applications are deployed where in which datacenter, and what libraries they each depend on. As a threat analyst, all I'd need to do is read my favorite vulnerability feed and perform string matching between the vulnerabilities and components.

String matching can't be that difficult. But what are we matching on? What is the intrinsic identity of a software component? Does it change when developers copy and paste the source code instead of fetching it from a package manager? Is every package-manager request fetching the same artifact from the same upstream repository mirror? Can we trust that the source code published along with the artifact is indeed what's built into the release executable? Is the tool chain kosher?

So, what is software?

We know that hardware engineering is highly determined by the laws of physics, whereas software engineering is free of rules? One definition, posed by Fred Brooks in the all-time classic, The Mythical Man Month, suggests: "Software is a set of instructions, data, or programs used to operate computers and execute specific tasks. It is the opposite of hardware, which describes the physical aspects of a computer."^2,15 So, software does something with hardware—but like the souls in our bodies, without hardware, software is nothing.

Academics agree that we do not have a good definition of software. A peer-reviewed Springer Nature journal recently published a philosopher's work, who concluded that the closest relative of software is text. Its semantics are mapped to the grammar posed by the target architecture, but without a chip to run on, well, it's nothing but text.¹¹

Just like the spoken word, text can be repeated indefinitely and without any effect on the original. For instance, copying this article will not make it disappear. In the context of software, this paradox is infamously problematic for the detection of data breaches: Stolen data, like a password, is not lost when it's lost and may still be available for uninterrupted usage by its legitimate owner. The same holds true for all software, making it difficult to grasp.

So, software does not respond to the laws of natural sciences, physics, chemistry, or even electrical engineering. It does not fall to the ground; it does not melt under heat; and it certainly does not need to breathe. In turn, software is highly theoretical and obedient to the involved rules of math. Yes, writing a software program can be learned in summer schools and at coding boot camps, but these settings abstract the true complexity of software for the sake of simplicity. The true power of software is unleashed at scale when large amounts of data are processed from globally distributed origins at the speed of light.

In software engineering, when scale changes, everything changes. Scaling software is intuitively compositional and dynamic, but even software itself is highly compositional and dynamic. A single application can be made up of multiple self-contained components that need to be obtained from different local and remote locations.⁷ High levels of software reuse connect large numbers of developers from various backgrounds, bringing them together as communities—but communities at scale have their very own (large) problems.

If that's not enough, multiple technical communities coexist and collaborate, and divergence among these is by design, as one community intentionally fills the gaps of another—as one programming language serves the need left by another. Although software does not exist per se, you can say humans inherently drive it. You can even say that it inherits human properties. Many would agree with Melvin Conway's observation that organizational structure is reflected in solution design.⁴ Subsequently, community-created open source software reflects its community—its contributors' relationships, status, geographic belonging, and political conviction.

Circling back to my initial contemplation about tracking software components that I presented at the beginning of this article, I realize that tracking automobile parts is inherently different from tracking software because logistics for automobile parts are inherently different from tracking software, for at least two reasons:

• Logistics for automobile parts change their physical locations and ultimately the possession of an auto part from one party to another in a 1:1 mapping, whereas software is commonly published by one party yet consumed by many others in a 1:n mapping, where the publisher does not even know about all copies or users. Once it's out there, it's out there.

• The parties involved in car manufacturing can be expected to be legal entities contractually agreeing on the logistics and the possession of the parts, whereas software is commonly published to the communities by individuals, practitioners, and volunteers, who can just put it out there at any time. And, again, once it's out there, it's out there. (Yes, open-source software supposedly has a license that technically establishes two (or more) legal entities contractually agreeing, but for free open-source software, that agreement is generally a copyleft agreement and excludes maintenance liability of the publisher, i.e., it says: "It's out here; you can use it at your own risk.")

When you combine both differences, you are led to a state of repeatedly redistributed, unregistered (software) components—similar to black-market car parts with their serial numbers removed—except that they are impossible to herd because of their immaterial nature and their ability to morph. This is a problem that supposedly breaks down to bookkeeping, which may indeed exist for commercially distributed software, but for free open-source software, there's no central register of easy-to-use identifiers that facilitate accurate matching at scale.

Then, How Do We Identify Software(s) in the Wild?

Historically, we validate the file hashes of downloads or verify their cryptographic signatures to attest integrity and establish accountability. This process is simple string matching. Accordingly, modern version-control systems on top of the git protocol use (hash-based) gitoids to uniquely identify blobs.⁶

Hash-based identifiers are useful to detect changes, but note that a single bit flip in the source code will yield a new hash that's completely uncorrelatable to its parent—let alone that hashes are not human-readable. String matching may be feasible, but the transitive nature of software makes it difficult to scale this approach.

Instead, there are several use cases for which we crave software identifiers that intentionally over-approximate what we are trying to describe with them. My all-time favorite (mainly because I started my career around that time) is the Heartbleed vulnerability from 2014 affecting OpenSSL 1.0.1 to 1.0.1g. Assuming the versioning scheme uses letters a to g, there are at least seven versions of OpenSSL to be aware of, but realistically, not every change to the source gets versioned and released; additionally, releases of the same vulnerable source may be compiled by different toolchains, with different optimizers and configurations, for different architectures, etc., all being the same software but having very different hashes. A quick web search shows that the community has already picked up this challenge. Today, there are various software identifiers for various use cases.

One approach is to catalog acknowledged software identifiers in a publicly available directory, like in a phone book.⁹ The security analyst who discovered Heartbleed could then have assigned it to OpenSSL's entry in the phone book, and everyone looking up OpenSSL after this would find the associated security concern with it.

The public-directory approach is bulletproof for registered devices, e.g., FCC-approved commercial hardware, but its assumptions break for software, especially for community-supported open-source software. The ownership of a piece of code published by hobbyists may not be immediately clear, and no publisher can be associated. It is even common for vendors to republish open-source software as is, but sell it with a different license, maintenance contracts, and support.

If an entry is not in the phone book, it cannot be found. If an entry is in the phone book under multiple names, it's not deterministic with which entry to associate the vulnerability. Further, the directory-based approaches available today have grown historically and missed the right level of granularity to describe nuances of software.¹³ In this approach, it is simply unclear which strings to match.

Other approaches implement the idea that anyone with access to the software should be able to infer its identifier. Intuitively something can always be identified by what it appears to be,¹⁰ such as the brand and model of a car. According to this philosophy, the software identifier should reflect the minimal, obvious aspects of the software it describes. Commonly agreed, that's its name and its version but also its type. Unfortunately, I'm still not sure if OpenSSL's type is its programming language C, in which case, someone else will call it C99, or Debian, in which case, someone else will surely reference it from the sister ecosystem RPM (Red Hat Package Manager). In this approach, it is not clear how to construct the strings themselves.

Don't get me wrong: Identifying software within one ecosystem is easy. As long as there is a package manager, everything is well-defined. That's literally what the package manager does. Things get funky when discussing native software, unpackaged or adhering to a package format but published on proprietary mirrors. Or built from source, incorporating hot fixes. "Just don't!", I hear you think loudly, but that is a terrible argument. Stability-oriented large enterprises rely heavily on forking, customizing, refactoring, and debloating external software as needed. Many run entire teams to complement external dependencies with top-tier internal support.

Identifying Software Is an Unsolved Problem

There is a lot of money in solving this problem, which incentivizes vendors to create proprietary identifiers that bind customers to their products, and also to market their products through free unstructured extra information—monetizing the structured versions of the same data. As a consequence, the space is convoluted by inconsistency.

The alternative to commercial identifiers is a vast landscape of open-source identifiers, historically SWIDs (Software Identification Tags) and CPEs (Common Platform Enumerations), recently pURLs (package URLs), but also OmniBOR, which uses gitoids (Git Object ID) and hashes. Maybe the true problem is the lack of agreement on which of these identifiers to use and how.

So, who is supposed to identify a piece of software—the publisher or the consumer?—and which identifier can they use to synchronize their communication? If there's a mapping between the executable and its source, how do you verify that both parties link the same artifacts to the same identifiers?

As illustrated in the introductory example about the automobile industry, it does not matter, as long as both parties comply with the same specification. For example, as the driver I knew only what brand and model I was driving, but had to consult the dealer to read the serial number and match the string against a list of cars affected by the recall. Similar levels of indirection may facilitate better software identifiers that cover all use cases that both parties can easily get right and that serve as an authoritative directory of curated software identifiers.¹²

In my pursuit to declare what applications are deployed where in my datacenter (and what libraries they depend on), I learned that software does not exist, but software identifiers do. I also discovered that an abundance of solutions falsely suggests that matching these strings at scale should be easy and that the field is overserved, with practitioners frequently citing the comic strip XKCD #927.¹⁶ Most solutions are nuanced, however, excelling only in a subset of scenarios, and none of today's solutions fit all the bills.^1,3,14,

Unlike the automotive field, software engineering is an open-world problem with multiple decentralized players. Abstracting away the "open worldness" and ignoring the general case are treacherous pathways for both the industry as a whole and for corporate adopters internally. We need a solution that works everywhere and that can be used consistently by anyone.

As this solution does not exist, corporations are best advised to employ standard tooling and closely adhere to specifications. Divergence, proprietary tooling, and nonstandard approaches in this space will likely bear high costs in the future. So, focus on the plumbing and keep the interfaces aligned with the outside world.

Furthermore, the business of software supply-chain security is a business of agreement and alignment; however, we find a lot of governance paired with few government resources. The few well-intended government efforts we have cannot keep up with the fast-paced software industry evolutions.^5,8 To develop effective solutions, we need targeted funding and strong partnership between community solutions and academic work. The interplay between these two disciplines is invaluable and inexcusable to ignore. I hope we will see more incentives and appreciation to engage and foundationally improve the existing solutions, rather than adding to the pool or considering the problem solved. It is not.

References

1. Azhakesan, A., Ombredanne, P. 2024. SCA for containers: the good, the bad, and the truth. Open Source Summit Europe; https://events.linuxfoundation.org/archive/2024/open-source-summit-europe/program/schedule/.

2. Brooks Jr., F. P. 1975. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley Publishing Company.

3. Container SBOM Clarity Project Public Report. 2024; https://nexb.com/sca-containers/.

4. Conway, M. E. 1967. How do committees invent?; https://www.melconway.com/Home/pdf/committees.pdf.

5. Fox, B. 2024. The fall of the National Vulnerability Database. Darkreading; https://www.darkreading.com/vulnerabilities-threats/fall-of-national-vulnerability-database.

6. GitBOM; https://www.iana.org/assignments/ uri-schemes/prov/gitoid.

7. Melara, M. S., Torres-Arias, S. 2023. A viewpoint on software supply chain security: Are we getting lost in translation? IEEE Security & Privacy 21(6), 55–58; https://dl.acm.org/doi/10.1109/MSEC.2023.3316568.

8. NVD News. 2024. National Vulnerability Database, National Institute of Standards and Technology; https://www.nist.gov/itl/nvd/nvd-news.

9. Official Common Platform Enumeration (CPE) Dictionary. National Institute of Standards and Technology; https://nvd.nist.gov/products/cpe.

10. package-url. A minimal specification and implementation of purl aka, a Package "mostly universal" URL; https://github.com/package-url.

11. Possati, L. 2020. Towards a hermeneutic definition of software. Humanities and Social Sciences Communications 7(71). Springer Nature; https://www.nature.com/articles/s41599-020-00565-0.

12. purldb is a dataset of purls; https://github.com/aboutcode-org/purldb-data.

13. Springett, S. 2022. New recommendations to improve the NVD. OWASP; https://owasp.org/blog/2022/09/13/sbom-forum-recommends-improvements-to-nvd.

14. Torres-Arias, S., Geer, D., Meyers, J. S. 2023. A viewpoint on knowing software bill of materials quality when you see it. IEEE Security & Privacy 21(6), 50-54; https://www.computer.org/csdl/magazine/sp/2023/06/10315783/1S2UwA5d6mI.

15. Tukey, J. 1958. The teaching of concrete mathematics. American Mathematical Monthly 65(1), 1-9; https://www.tandfonline.com/doi/abs/10.1080/00029890.1958.11989128.

16. xkcd: a webcomic of romance, sarcasm, math, and language; https://xkcd.com/927/.

Dennis Roellke is a Security Architect in the Office of the CTO at Bloomberg, where he provides strategic advice to the company's software supply chain security program. His influence spans multiple departments within the firm, orchestrating a secure software development lifecycle end-to-end in order to provide operational resilience for the company. Prior to this position, Dennis was an embedded systems engineer and worked as a security consultant for three years. He received his Ph.D. from Columbia University, where he studied the intersection of machine learning and cybersecurity.

Originally published in Queue vol. 23, no. 1—
Comment on this article in the ACM Digital Library