A Clean Approach to Process Optimization

What I learned from my dishwasher about automating processes

Thomas A. Limoncelli

A friend of mine asked for advice about a problem he was having at his job. Onboarding each new customer required significant effort. The process was automated, but it took a long time to execute. The delay was visible to any new customer who had just signed an expensive contract and didn't want to wait days or weeks for their new service.

My Dishwasher

Before I could offer my friend some advice, I took a minute to explain something about my dishwasher.

Most people use their dishwasher as explained in the manual. Dirty dishes collect in its racks over a span of time. Once it's full, someone adds soap and starts the load. Later, the clean dishes are emptied. Repeat.

I do it a little differently.

I load the soap as soon as the clean dishes are put away. Later, once it's time to start the wash cycle, I need only press the start button.

I prefer this approach for a few reasons. It reduces the chance I'll spill the soap powder since I tend to start the wash cycle late at night just before going to bed when I'm both sleepy and in a rush. That combination tends to make me sloppy. Thus, spilled soap.

With my system, a full soap dispenser is an indicator that the dishes are dirty. Have you ever had a family member or housemate ask, "Are the dishes in the dishwasher clean?" It isn't always obvious. Using the soap dispenser as the signal is both more accurate and convenient than a CLEAN/DIRTY magnet or some other mechanism requiring human intervention.

My soap-loading technique isn't revolutionary, and I don't think I'm going to win the Turing Award for this innovation. But it does demonstrate a point about process design: You can eliminate delays in starting a process by front-loading tasks whenever possible.

Front-loading is interesting because it changes when you do tasks but not their order. The process still involves a loop: load dishes, add soap, press start button, empty dishes repeat. You've only changed your mental model of where the loop starts.

Delivering Clusters

Now that you understand my amazing dishwasher technique, let's see how my friend might be able to add his soap ahead of time. He showed me a diagram of all the steps required to onboard a new customer. Three or four of these steps were generic enough to be done ahead of time.

This would save hours, which was significant and would be worthy of the engineering effort. Performing these steps ahead of time also would improve quality. In the old system, when people were rushing to fulfill the customer request, the automation performed only cursory quality-assurance checks. With the new design, any step done ahead of time would benefit from a longer, more rigorous testing cycle. The old system discouraged new tests. The new system encouraged more testing. For example, a new design could be run through disaster-recovery ("failover") tests, which can often take hours.

This then led to another design idea: Why not prebuild many instances and then hand them out as customer contracts are signed? The wait time visible to customers could be reduced from days to minutes.

Books such as The Phoenix Project (Gene Kim, et al.) advocate delaying variation to the end of the process. Auto manufacturers follow this approach. All cars of a particular model start out exactly the same. Variations such as interior colors and audio/entertainment packages are added at the end. Fast-food restaurants follow this approach as well. Burger King advertises that special orders don't upset them, but the only variations they offer are ones that can be accomplished just before the sandwich is wrapped.

Saving variations to the end makes it easier to manage defects. A generic unit with a defect can be moved to the side, repaired, and then put back on the assembly line. In the meantime, another generic item can take its place. Once a bespoke customization for a particular customer has been added, that flexibility is lost. In extreme cases, it's easier to simply throw the burger away.

Realizing this, my friend split the process into two systems: a slow, generic cluster builder and a fast customization engine. The first system focused on creating generic clusters, testing them, and then registering them in an inventory. It built a stockpile of clusters ready to be handed out. There was no need to rush this phase. Quality was more important than speed. We'll call this the "slow phase."

During the slow phase, you can take the time to do extensive testing. When failures are found, you can stop the process and take whatever time is necessary to study the problem, understand the failure, and fix it properly. Major problems can be resolved by deleting the cluster and starting over. Minor problems can be fixed before they become major problems.

This is similar to how the auto industry stops a production line to fix a small problem before it becomes a big problem. This is known as "pulling the Andon cord," referring back to a time when a physical cord was pulled to stop the line.

During the customization phase, meanwhile, my friend's process involved waiting for customer orders, picking a generic cluster from the stockpile and then customizing it for the customer. Let's call this the "fast phase."

Some customers, for example, require larger capacity than others. Originally, my friend's company believed that no work could start until the sales order was signed because that's when capacity became known. Given this assumption, the entire slow/fast design was not possible. But then someone observed that nearly all customers require the same capacity, with only a few outliers requiring larger capacity. So, the decision was made to use the slow phase to build standard clusters that then could be grown during the fast phase if necessary.

Another potential blocker was that the customer name was deeply embedded (or "tattooed") in the cluster configuration—which is to say the cloud provider had no way to rename clusters once they'd been built. This, too, was believed to be a blocker to the slow/fast design. But then the company decided to build all new clusters with generic names (cluster1, cluster2, cluster3, ) and then assign customer-specific aliases during the fast phase. The introduction of aliases required only minor changes to downstream processes. For example, some third-party tools do not pay attention to aliases and thus need to be passed the actual name.

Let's popularize the slow/fast pattern. I've seen this slow/fast pattern in many deployment or service-delivery systems: small systems such as VDI (virtual desktop infrastructure) deployments up to systems larger than the one described here. Sadly, what all of these have in common is that the slow/fast design was always part of a second-generation rewrite.

It's a shame we don't think to build systems this way from the start. I suppose this is because first-generation systems are built in haste. There's no time for architectural navel-gazing when you're tasked with automating a process after a flurry of orders has made it impossible to provide any of those with individualized attention.

However, I think the true reason we don't think to use the slow/fast pattern is that it hasn't yet achieved enough popularity to be at the front of our minds. It isn't taught at the university level, it isn't discussed much in online forums, and—even when the pattern is used—it is often hidden from end users.

Which is to say ACM members could play a large role in popularizing this pattern.

Similar Opportunities in Life

The timing of when we do things is not set in stone. It only feels that way.

There's no rule that dishwasher soap must be loaded immediately before you start the wash cycle. But it's such a common practice that people tend to act as if such a rule exists.

The day you gather trash from bins around your house does not need to be the same day you put your trash bins at the curb. I find it easier to collect the trash on the weekends when I'm doing other chores.

A retail store does not start the day by making preparations for customers. The night before is when the facility is cleaned and the new merchandise is put out on display. Ideally, the morning shift simply opens the doors and is ready for normal business.

Notice that if you sleep late, people call you lazy. But if you go to sleep super early, you sleep just as much and yet people call you wise.

Delay Optional Tasks, Possibly Forever

Examining the order of steps can even help you realize that something can be postponed until much, much, later. In the best case, it might even be postponed long enough that it's never needed at all.

Before paperless billing, I used to fastidiously file away each utility, bank, and credit card statement in a filing cabinet. I had a separate folder for each utility, bank, credit-card company, and so on. Each folder contained past statements lovingly stored in chronological order. It was a lot of work, but I was sure that someday it would prove useful. Maybe I'd win a court case since I'd be able to swiftly calculate the exact amount I'd spent on groceries during the month of July 10 years earlier. I was young, optimistic, and stupid.

One day I realized that all my meticulous filing was eating up a considerable amount of time. In fact, I could reduce the time it took to process my monthly bills by 80 percent by simply not being so fastidious about how I stored old statements. Instead, I just started to pile up the statements in a single folder; and then start a new folder once the current one was full. I wouldn't bother to organize the statements until I actually needed some specific information.

This approach is what's referred to as "lazy evaluation" or "call-by-need" in programming languages. The win here is that, if the need never arises, we've saved a lot of time. In my case, the need to go through that folder never arose. And then, eventually, paperless billing eliminated the need for a filing cabinet altogether.

Discovering That Optional Tasks Aren't Optional

There's also a different possible outcome. Sometimes we examine optional tasks only to discover they aren't actually optional. In this case, managing the optionality (yes, I just invented that word) turns out to be wasted work and complexity that can be eliminated.

I encountered this very situation recently when I was preparing to optimize some complex code in an open source project. There was an expensive string operation that the code avoided until it was sure the result would be required.

Avoiding the operation was good, but I thought I could do better: I would memorize (cache) the result, so that—if the value was needed a second time—I wouldn't have to repeat the operation. This would involve some complex cache-invalidation logic, as the language didn't support lazy evaluation. But, as we all know, cache invalidation is one of the two most difficult problems in computer science. I dreaded the bugs this might introduce to the system.

This proved to be a good time to stop coding and start doing some analysis. Of all the inputs, what percentage actually required the expensive operation and how often was the result accessed two or more times?

To my surprise and delight, the result was required for 100 percent of the input and was always used at least once. With that discovery, I knew I could simply do the operation for each input string upon arrival and then store both the original and the processed result. I then could also expose both as public attributes—with no need for the complexity of memorization and cache invalidation.

And yes, here was another opportunity to move a task up to an earlier point in a process. Moving the task up meant there was no need to test whether it had been done. The result was less code.

Which goes to show that rethinking the order and timing of tasks within a process can actually lead to significant improvements in efficiency and quality.

Whether this means speeding up your morning routine with a simple trick or overhauling a complex business process, the principles remain the same. By front-loading what you can, delaying what isn't critical, dividing work between slow and fast, and reducing complexity by reexamining optional work, you're able not only to optimize tasks but also to pave the way for smoother, more efficient days.

Thomas A. Limoncelli is a senior site reliability engineer at Stack Overflow Inc. He works from his home in New Jersey. His books include The Practice of Cloud Administration (https://the-cloud-book.com), The Practice of System and Network Administration (https://the-sysadmin-book.com), and Time Management for System Administrators (https://TomOnTime.com). He is @YesThatTom on BlueSky and blogs at YesThatBlog.com. He holds a B.A. in computer science from Drew University.

Originally published in Queue vol. 23, no. 1—
Comment on this article in the ACM Digital Library