Operations and Life

May 20, 2024
Volume 22, issue 2

Download PDF version of this article PDF

Make Two Trips

Larry David's New Year's resolution works for IT too.

Thomas A. Limoncelli

During an interview on The Late Show with Stephen Colbert, comedian Larry David explained that his New Year's Resolution was "make two trips" (episode 857, January 8, 2020).

For example, when carrying groceries into the house, it is tempting to carry everything at once, but then you drop the cantaloupe, and now you have to clean up that mess. While it seemed like one trip would have been faster, if you include the time it takes to clean up the mess, it would have been faster to simply make two trips.

This experience led Larry to operationalize it as a New Year's Resolution. When in doubt, make two trips!

This "make two trips" strategy isn't an earth-shattering breakthrough. It won't cure cancer, end world hunger, or fix the climate crisis. However, I have adopted this philosophy, and it has had many benefits.

The immediate benefit is that I am now more likely to have a free hand to open my house door. Pulling keys out of my pocket no longer involves smashing a grocery bag between my chest and the house.

The larger benefit has come from adopting this philosophy in both coding and operations.

Make Two Loops

The other day, I was adding a feature to some old code. The code reported results of an earlier calculation with various formatting options that could be enabled or disabled.

The code was quite complex because certain options affected the format in ways that had downstream implications for other options. The code was able to satisfy all the various options and controls in one pass over the data, printing a report along the way.

Since the code was so complex, however, it was difficult to maintain. Updating one feature would cause bugs in another feature, which would then be fixed and cause line counts to change. Those line counts were needed for an earlier feature, which had already printed its section based on now-outdated assumptions.

All in all, the code was just one big mess. For example, it maintained parallel sets of counters and stats along the way until the actual need was determined—analogous to buying both cat food and preparing for a pet funeral until Schrödinger's box is finally opened.

The code was difficult to read. It was classic spaghetti code—and as someone of Italian heritage, I don't use that phrase lightly.

I struggled in earnest to add my new feature to this ever-growing complicated loop.

Then I remembered Larry's advice: Make two trips.

The code would be significantly simpler if it made two passes over the data. One pass would collect data, count things that needed to be counted, sum subtotals, and so on. The second pass would take all this information and output the report, and would be much easier because it had all the information it needed from the start. No Schrödinger's cat.

Should I make this change? It seemed like two trips would be less efficient. The original version allocated no new memory because it was outputting the report as it iterated over the data. The new version would need to allocate data structures to store the intermediate result.

It was a classic complexity vs. memory engineering decision: Suffer from complexity or suffer from potential memory exhaustion.

My first computer was very slow and had only 5K of RAM. This taught me to be stingy with my use of memory. Sometimes I fall back to my old habit of counting every byte of memory even when it doesn't matter. As Knuth taught us, premature optimization is the root of all evil.

I soon realized I was being silly. The quantity of data being reported on was rarely more than a screenful—typically only zero, one, or two items, and might rarely be in the thousands. Moreover, the output was typically sent to the terminal (stdout), which would be a bigger consumer of time than any minor coding efficiency I might lose in the rewrite.

I refactored the code to take two passes. It turns out the new code didn't even perform many allocations. My biggest fear was for naught.

Now, with much simplified code, I could add new features easily. It was cleaner and easier to test. This gave me the confidence to clean up some minor nits and make the entire report much more readable.

Not to brag, but the new format is pretty slick. Don't thank me. Thank Larry!

Pre-walk, then Walk

Rather than making multiple trips with fewer bags each trip, sometimes it can be useful to carry no bags at all on the first trip. This is the pre-walk—just exploring the path to scout for problems.

A pre-walk finds problems early. When moving large furniture, the pre-walk can clear the path, alert you to troublesome narrow hallway turns, and move the pet who is sitting where the new couch will be placed. Any of those tasks are easier to do with two free hands.

Sometimes I use a pre-walk to open the wine-rack door before I return with new bottles.

Who hasn't checked into a hotel room only to find the room is a mess, occupied, or otherwise unusable? Then you have to take all your bags back down to the reception desk and back up to your new room. Now I pre-walk to the room to check it out first.

Canary Deployments

A canary deployment is the distributed computing version of the pre-walk. Rolling out new software to thousands of instances is risky. It can take a long time if done serially. Doing it in parallel, however, might push a bad release and recreate a large outage very quickly.

The canary strategy says to roll out new software to a single replica to validate a release prior to rolling it out to the n-1 remaining instances. The first trip is slow because it waits for the software to initialize and start reporting a heartbeat or other health check. If that succeeds, the second trip upgrades all other replicas in parallel, confident that they will start, too.

How could a well-tested software package possibly fail to run? It is not just the functionality of the software that is being tested here. It is also the deployment system itself.

For example, I remember a time when a canary died, preventing what could have been a major outage. The host orchestrating the rollout had run out of disk space. Instead of copying the installation package to all machines, it was copying a zero-length file. Because the software divided the work into two trips, the only outage was the canary, not the entire application. Since there were hundreds of replicas, one being down would hardly be noticed.

In another instance, the software had rolled out successfully to the QA (quality assurance) environment, yet the canary died when the exact same bits were deployed in the production environment. How could that be? It turned out the release required a new entry in the configuration file. That field had been added to the configuration file used in the QA environment but not the one used in production. This error was caught early in the canary stage. Another major outage prevented.

Kubernetes's canary deployment upgrades replicas using this strategy. Typically, a small percentage of replicas is upgraded first. The remainder are upgraded only if no errors are detected.

The strategy is also used in mass software deployments. For example, web browsers such as Chrome have a canary release, allowing some users to opt in to receive the newest release earlier than most customers. Even when rolling out new releases to the remaining users, those users are often upgraded to a random sample first. This is also done at the feature level. Risky new features are sometimes shipped in a disabled state, then enabled for a random sample of canary users before enabling them to all.

Drug research trials contain a canary-like phase. Phase 1 trials measure safety by risking only a small number of lives. Later phases test efficacy on larger and larger groups.

Make Two Projects

The most brilliant "make two trips" example I experienced involved migrating a legacy application to the cloud. I can't take credit for this; it was a co-worker's idea.

The project involved moving an application from a legacy datacenter to a cloud provider. We could not simply "lift and shift" the entire system to the cloud. The application was tightly coupled to another application that was staying behind.

Therefore, the project involved decoupling the two applications. The project required modifications to both systems, splitting a database between the two, and other complexities too numerous to mention. Oh, and it had to be done with limited downtime.

The first two attempts at this project failed. The first failed because of a lack of resources. A year later, the second attempt was better resourced but still failed. It was unclear exactly why. Conventional wisdom held that the people involved just ran out of time or didn't have enough support.

If that was true, the third attempt simply required more resources and more time. Right? Wrong.

After studying the architecture, the code, and the past failures, an engineer made an important discovery: The project was impossible! The reason for the failure was that the previous team had tried to do something that couldn't succeed.

The migration had many steps. Some could be rolled back in the event of failure, but the riskiest could not. There was no undo button. For the steps without an undo, there wasn't even a way to test ahead of time to ensure that an undo wouldn't be needed.

In a sufficiently complex system, the only way to know which problems might happen is to try the process, let it fail, roll back, and try again. If there is no way to roll back, and you know the process can't possibly work on the first try, it is essentially impossible.

Could you just risk it and hope it works on the first try? No. Hope is not a strategy.

My co-worker then had a brilliant idea: Make two trips.

If the current architecture makes the project impossible, she concluded that first we must create a project that changes the architecture. Then the migration project would be possible.

Both projects (trips) were complex, but neither was impossible.

The second project still had the largest risk. While all steps now had the ability to roll back, one step had a point of no return at which changes to a database would be made that were impossible to undo. Before this step, however, was an opportunity to do extensive testing that would ensure it was safe to commit to the new database.

Three time windows were announced for this downtime. During the first window, problems were found that would take days to fix. Rollback was performed, as the point-of-no-return step hadn't yet been executed.

The team had two more chances. The next window found some new failures. Most were fixed in realtime, although some were deemed minor enough that they could be fixed later. The third window was not needed, much to the relief of everyone on the project, as well as the customers.

The two projects took a year to complete. It wouldn't have been possible as one trip; as two trips, it was a success.

Dividing the project had the side benefit of providing a better experience for customers. This was unexpected but welcome. The original plan required customers to make changes on their side on a specific day, during a specific window of downtime. Expecting hundreds of customers to make a change at the same time was unrealistic. The fact that there was no way for the customers to test their changes until after the downtime event, which was after the point of no return, made this plan even more precarious.

The two-phase approach eliminated this problem. All customer-visible changes naturally fell during the first project and all user-visible downtime fell during the second project. The changes could now be made and tested any time during a much longer time span. That made it easier for the customers and greatly reduced the complexity of coordinating such changes.

Conveying all of this to the customers became easier, too. Explaining a complex change combined with downtime is difficult. Now the change could be explained in isolation. The downtime would happen months later and was easy to explain using the standard downtime announcement process that customers are used to. Explaining two isolated (as far as the customers are concerned) events is easier than explaining a combined, complex mega-event.

The full story of this project can be found on the Stack Overflow blog:

Part 1: https://stackoverflow.blog/2023/08/30/journey-to-the-cloud-part-i-migrating-stack-overflow-teams-to-azure/

Part 2: https://stackoverflow.blog/2023/09/05/journey-to-the-cloud-part-ii-migrating-stack-overflow-for-teams-to-azure/

Conclusion

Spread the good word of making two trips. It will not only enhance the productivity of your team, but will reduce the amount of explaining required when you leverage the technique. The other day my partner gave me a frustrated look and asked why I left the car door open. I replied, "I'm making two trips" as I walked by. She relaxed. It was all the explanation needed.

While you shouldn't take all your life lessons from Larry David, the "make two trips" strategy is a great tool to have in your toolbox.

Whether your project is as simple as carrying groceries into the house or as complex as a multiyear engineering project, "make two trips" can simplify the project, reduce the chance of error, improve the probability of success, and lead to easier explanations.

Thanks, Larry!

Acknowledgments

Thanks to Benjamin Dumke-von der Ehe, Jessica Hilt, Jeremy Peirce, George V. Reilly, Tom Reingold, Mandy Riso, Margret Treiber, and many others for feedback on early drafts.

Thomas A. Limoncelli is a technical product manager at Stack Overflow Inc. who works from his home in New Jersey. His books include The Practice of Cloud Administration (https://the-cloud-book.com), The Practice of System and Network Administration (https://the-sysadmin-book.com), and Time Management for System Administrators (https://TomOnTime.com). He blogs at EverythingSysadmin.com and posts at @YesThatTom. He holds a B.A. in computer science from Drew University.

Originally published in Queue vol. 22, no. 2—
Comment on this article in the ACM Digital Library