When Should a Black Box Be Transparent?

When is a replacement not a replacement?

Dear KV,

We've been working with a third-party vendor that supplies a critical component of one of our systems. Because of supply chain issues, they are trying to "upgrade" us to a newer version of this component, and they say it's a drop-in replacement for the old one. They keep saying that this component should be seen as a black box, but in our testing, we found many differences between the original and the updated part. These aren't just simple bugs but significant technology changes that underlie the system. It would be nice to treat this component as a drop-in replacement and not worry my pretty little head about this, but what I've seen thus far doesn't inspire confidence. I do see their point that the API is the same, but I somehow don't think this is sufficient. When is a component truly drop-in and when should I be more paranoid?

Dropped In and Out

Dear Dropped,

Your letter brings up two thoughts: one about current events and one about the eternal question of, "When should a black box be transparent?"

While we all know that the pandemic has caused incredible amounts of death and destruction to the planet, and the past two years have brought unprecedented attention on the formerly very boring area of supply chains, the sun comes up and the world still spins—which is to say that the world has not ended, yet. Honestly, if it did, it would be a nice break for me. Supply chain issues are both real and the world's latest excuse for everything. If I had kids (and let's all be thankful that I do not) I would expect them to be telling their teachers, "The supply chain ate my homework."

At this point, KV is quite skeptical when a vendor's first excuse is supply chain issues. Of course, that skepticism won't help unless you have a second supplier for whatever you're buying, which you can use to bludgeon your errant vendor.

The eternal question of, "When is a replacement not a replacement?" is one that will plague us in technology forever. The number of people who believe they can treat whatever they're providing as an opaque box with a fixed API is, unfortunately, legion. This belief comes from the physical world, in which a box is a box, and a brick is a brick, and why would you care if your brick is made from a different material anyway?

Here you see the problem: The metaphor breaks down in the physical world as quickly as it would in the realm of software and hardware. Two bricks may both be red, and therefore present an identical look and feel to the external user, but if they're made of different materials, then they have different qualities—for example, in strength, but let's also consider something less obvious, like their weight. The number of bricks that can be stacked on top of each other to build a wall depends on their weight, as well as their strength. If you use heavy but weak bricks, well, you can imagine how this goes, and if you can't, try it—just don't tell your health insurance plan that KV suggested this. And let's say you don't build the wall out of weak and heavy bricks, but years later you replace some damaged bricks with newer, heavier, and weaker bricks. The key here is you wouldn't want to stand near that wall.

A topic KV keeps coming back to, one that may be driving him to drink, is the malleability of software. I keep coming back to this because it is this malleability that often results in the catastrophic failures of software and systems engineering. You mentioned that you saw timing problems with the new component. I can imagine few situations more treacherous than a change in the timing of a critical component. Timing bugs are already some of the hardest to track down and fix, and if the timing is off in a critical component, that's likely to affect the system, so good luck debugging that. May I recommend three measures of gin, one of vodka, a splash of Kina Lillet, shaken over ice, with a slice of lemon? You'll thank me, as you'll be saying evening prayers from now until your ship date slips into infinity. Those who wish to stand on the "API as a contract" quicksand are welcome to do so, but I'm not about to throw them a rope.

The right answer in these cases is to ask the vendor for as much information as possible to reduce the risk in accepting this so-called replacement. First, ask for the test plans and test output so you can understand whether they tested the component in a way that relates to your use case. Just because they tested the thing doesn't mean they tested all the parts your product cares about. In fact, it's unlikely they did. They may have tested just the parts that connect back to the API, rather than the edge cases that would come up when a component is changed in your system.

Second, ask for a complete readout of the differences between the old and new parts. For hardware, this means the underlying technology (e.g., the old part was 90nm and the new one is 45nm), and any voltage changes, as well as the internals. I've seen replacement parts that put whole CPU cores into what were once fixed-function pieces of digital electronics, which is utterly insane, but someone, somewhere, is getting praised for adding "flexibility" to the product rather than being beaten with a rubber truncheon for increasing risk.

Lastly, make sure you have a second supplier for any component you deem critical. This ought to go without saying, but, since I'm saying it, that means you know that's been an issue for a lot of people I've seen looking like the walking wounded after an upgrade completely destroyed their product.

Oh, and you did ask when to be paranoid. I mean, clearly the answer is, always.

KV

Kode Vicious, known to mere mortals as George V. Neville-Neil, works on networking and operating system code for fun and profit. He also teaches courses on various subjects related to programming. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. Since 2014 he has been in Industrial Visitor at the University of Cambridge where he is involved in several projects relating to computer security. He earned his bachelor's degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth but has been deployed, as part of VxWorks in NASA's missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City.

Originally published in Queue vol. 20, no. 2—
Comment on this article in the ACM Digital Library

More related articles:

Jinnan Guo, Peter Pietzuch, Andrew Paverd, Kapil Vaswani - Trustworthy AI using Confidential Federated Learning
The principles of security, privacy, accountability, transparency, and fairness are the cornerstones of modern AI regulations. Classic FL was designed with a strong emphasis on security and privacy, at the cost of transparency and accountability. CFL addresses this gap with a careful combination of FL with TEEs and commitments. In addition, CFL brings other desirable security properties, such as code-based access control, model confidentiality, and protection of models during inference. Recent advances in confidential computing such as confidential containers and confidential GPUs mean that existing FL frameworks can be extended seamlessly to support CFL with low overheads.

Raluca Ada Popa - Confidential Computing or Cryptographic Computing?
Secure computation via MPC/homomorphic encryption versus hardware enclaves presents tradeoffs involving deployment, security, and performance. Regarding performance, it matters a lot which workload you have in mind. For simple workloads such as simple summations, low-degree polynomials, or simple machine-learning tasks, both approaches can be ready to use in practice, but for rich computations such as complex SQL analytics or training large machine-learning models, only the hardware enclave approach is at this moment practical enough for many real-world deployment scenarios.

Matthew A. Johnson, Stavros Volos, Ken Gordon, Sean T. Allen, Christoph M. Wintersteiger, Sylvan Clebsch, John Starks, Manuel Costa - Confidential Container Groups
The experiments presented here demonstrate that Parma, the architecture that drives confidential containers on Azure container instances, adds less than one percent additional performance overhead beyond that added by the underlying TEE. Importantly, Parma ensures a security invariant over all reachable states of the container group rooted in the attestation report. This allows external third parties to communicate securely with containers, enabling a wide range of containerized workflows that require confidential access to secure data. Companies obtain the advantages of running their most confidential workflows in the cloud without having to compromise on their security requirements.

Charles Garcia-Tobin, Mark Knight - Elevating Security with Arm CCA
Confidential computing has great potential to improve the security of general-purpose computing platforms by taking supervisory systems out of the TCB, thereby reducing the size of the TCB, the attack surface, and the attack vectors that security architects must consider. Confidential computing requires innovations in platform hardware and software, but these have the potential to enable greater trust in computing, especially on devices that are owned or controlled by third parties. Early consumers of confidential computing will need to make their own decisions about the platforms they choose to trust.