Operations and Life

  Download PDF version of this article PDF

What do Trains, Horses, and Home Internet Installation have in Common?

Avoid changes mid-process

Thomas A. Limoncelli

I was filled with anticipation. After a long wait, my new home ISP connection was finally being installed. The installation technician had finally arrived.

But I was disappointed when he showed me the work order. One of the options I had ordered was missing.

He nodded and confirmed the error. "Sorry about that, sir. They usually don't mess that up on orders. I can fix it, but..."

He paused. He looked around and waved me over like he was going to tell me a secret, which I thought was a bit odd considering that we were the only two people in the room.

"Look, I've been doing this job for many years. Can I give you a little tip?"

"Of course!" I replied, eager to learn something new.

"You see," he leaned in and in a very serious tone explained, "I could call into the office and they'll add that feature. Sure. However, our team isn't really good at making changes in the middle of an installation. Once the train has left the station, it's not easy to change the color of the train."

"I see." I shook my head knowingly, pretending I knew what he meant.

"But, it's OK. Let me tell you a better way." He looked around again as if he was confirming we were still alone, then leaned in a little more. "That feature can be requested post-installation. If you wait until I'm done here, and then call it in, it will be handled by the post-install team. They have a process for adding that feature. My team? We're the installation team."

"And the train's already left the station?"

"Exactly."

At first, I thought he was just trying to shirk his responsibilities and pass the buck on to someone else. His advice, however, made a lot of sense. The installation team probably generated configurations ahead of time, planned out how and when those changes need to be activated, and so on. The entire day is planned ahead. Bureaucracies usually have a "happy path" that works well, and any deviation requires—who knows what? Managers getting involved? Error-prone manual steps? Ad hoc database queries? There's no way I could know. The point was clear, however: Don't change horses midstream—or the color of the train.

I assured him I understood and would do as he suggested.

The next day, I called the ISP and spoke with the post-installation team. I didn't even tell them the feature was missing from the original order. I just asked for the addition as if I'd had their service for months.

The installation tech was right. The feature was added and all was well.

 

Installing and Fixing are Different Skill Sets

This experience reminded me of a wise thing my manager did back in the 1990s when I was on a system administration team at a large research facility. One of our biggest responsibilities was installing and maintaining the fleet of desktop workstations used by the researchers.

Installing new machines was a common task for us. Our manager noticed that sometimes we did it in minutes, and sometimes it took hours, even days.

He investigated why there was so much variation. He found one ticket in our helpdesk system that had been open for more than a week and asked the system administrator what was taking so long.

The sysadmin explained that there were many reasons, starting with the machine having nonstandard hardware. It had a second Ethernet port so that it could be directly connected to both the usual network plus the researcher's lab network. Our automated configuration script didn't work in that situation, so he had to remove the extra Ethernet card, do the installation, and then reinstall the card.

That added a few hours, but it didn't explain why the ticket had remained open after more than a week. The sysadmin further explained that the user also needed certain software installed. The installation was complex and required some planning and a few tries to get it right.

Now that the software worked, the user realized it stored some data on the local disk, which needed to be backed up. So, he was writing a Perl script to do the backups.

This simple desktop workstation installation was turning into a major project.

"Anything else?" asked the manager.

The sysadmin pulled out a scrap of paper where he had a list of other tasks, including other software to install and researching how to connect this machine to some specialized lab equipment.

It became obvious to the manager that this was becoming the project that would never end.

The manager put his foot down and created a new policy that defined the limits of what would be done as part of an installation. Everything else had to be on separate tickets, to be accomplished after the fact.

We now had one standard process with a limited scope and a predictable time estimate.

He also trained the installers on the importance of asking people to open separate helpdesk tickets for special requests. Technicians want to be helpful and feel obligated to try to do special requests right away. They need to know when the best way to help someone is by not helping them, but by asking them to open a ticket.

He even suggested a script to use so that they had a verbal tool to say no without sounding disagreeable: "Gosh, that sounds great! I'd love to do that for you, but I'm on the install team. That kind of thing is done by the other side of my team. Please open a ticket and they'll take care of it right away."

Customers receive much better service when there is a clear delineation between what can and can't happen as part of an installation.

 

More Examples

I've applied this philosophy in other areas of operations. There are many "setup" or "installation" tasks in IT, and many of them are susceptible to dragging on as users ask for changes and additions.

A recent example in my work involved setting up CI/CD (continuous integration/continuous deployment) pipelines for developers. Usually developers can do this in a self-service manner, but frequently the initial creation requires administrative privileges. If there is no well-defined endpoint, the SRE (site reliability engineer) or DevOps engineer will be finessing and refining the pipeline forever. Instead, I recommend setting up the initial pipeline with some kind of "hello world" example to show that the pipeline works, but putting the responsibility on the developer to swap in their code and evolve the pipeline from then on.

I'm sure if you think about it, you'll find more examples in your environment.

 

Unexpected Benefits

I'd like to highlight two unexpected downstream benefits of limiting installation to just installation. First, it allows the manager to be more proactive about fixing problems. Typically, a repeatable task should always take about the same amount of time to complete. Large variations are a red flag.

In my earlier example, installing a new machine should take about an hour, including unboxing it. If the task takes more than an hour, something has probably gone wrong. Instead of waiting for the user to complain about delays, a manager can intervene and resolve the issue. Cool managers do this in a subtle way, like pretending to stop by for a social visit on their way to get coffee, and nobody will be the wiser when they just happen to be in the right place at the right time to debug the cause of the delay. Users will think the manager has a psychic ability to know when an employee is having difficulties. I'm OK with that.

Another benefit of streamlining installation is that it simplifies the process. A high number of possible variations makes tasks more error-prone. Complexity may lead people to think that a task can't be automated. By simplifying the process, how it could become automated becomes more obvious.

This also allows an organization to be more strategic about automation. How something is automated at installation time may be very different from how it would be done post-installation. Greenfield projects are typically easier than mutating an existing system. It may require two codebases whose changes must be coordinated to stay in sync. A strategic decision can be made instead to automate in only one place. This saves coding and code maintenance, and may avoid confusion when troubleshooting.

Recently, thinking in these terms helped a project I was involved with that was facing a catch-22 dilemma. Certain customers wanted a customization that was difficult to deliver as part of the initial setup. They wanted to be able to upload their own SSL (Secure Sockets Layer) encryption keys via their customer portal—but we couldn't set up their customer portal without having their SSL encryption keys. The initial plan was to create a special "pre-service portal" for this situation. That would have required months of development and security testing.

Instead, we declared that such customizations were permitted only as a post-installation change. In this way, we could leverage the existing customer control panel, saving considerable engineering effort. The customization could be done after the newly built system was made available to the customer but before the customer made the system available to employees.

This installation-only approach may also lead to organizational improvements. It is often beneficial to have one team focused on new installations and another on post-installation work. Each team can become specialized. Each can optimize their processes in ways that are best for them. The management techniques required for the two teams may be different; thus, having two managers—each with different skills—may better support the organization. This would also make it easier to hire for the roles because the managers would not have to be generalists.

The agile scrum methodology gives similar advice, recommending no changes to the plan once the iteration has begun. If interruptions are to be expected—for example, when applying scrum to SRE or operational work—it recommends designating one person per iteration be designated to handle interruptions. That way you have a plan for the unplanned.

Having a well-defined scope of work prevents scope creep. It prevents tasks from never ending. It also looks more professional to have a fixed plan plus a well-defined process to handle special requests and changes.

That installation tech gave me some pretty good advice.

 

Thomas A. Limoncelli is a technical product manager at Stack Overflow Inc. from his home in New Jersey. His books include The Practice of Cloud Administration (https://the-cloud-book.com), The Practice of System and Network Administration (https://the-sysadmin-book.com), and Time Management for System Administrators (https://TomOnTime.com). He blogs at EverythingSysadmin.com and posts at @YesThatTom. He holds a B.A. in computer science from Drew University.

Copyright © 2023 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 21, no. 6
Comment on this article in the ACM Digital Library





More related articles:

Adam Oliner, Archana Ganapathi, Wei Xu - Advances and Challenges in Log Analysis
Computer-system logs provide a glimpse into the states of a running system. Instrumentation occasionally generates short messages that are collected in a system-specific log. The content and format of logs can vary widely from one system to another and even among components within a system. A printer driver might generate messages indicating that it had trouble communicating with the printer, while a Web server might record which pages were requested and when.


Mark Burgess - Testable System Administration
The methods of system administration have changed little in the past 20 years. While core IT technologies have improved in a multitude of ways, for many if not most organizations system administration is still based on production-line build logistics (aka provisioning) and reactive incident handling. As we progress into an information age, humans will need to work less like the machines they use and embrace knowledge-based approaches. That means exploiting simple (hands-free) automation that leaves us unencumbered to discover patterns and make decisions.


Christina Lear - System Administration Soft Skills
System administration can be both stressful and rewarding. Stress generally comes from outside factors such as conflict between SAs (system administrators) and their colleagues, a lack of resources, a high-interrupt environment, conflicting priorities, and SAs being held responsible for failures outside their control. What can SAs and their managers do to alleviate the stress? There are some well-known interpersonal and time-management techniques that can help, but these can be forgotten in times of crisis or just through force of habit.


Thomas A. Limoncelli - A Plea to Software Vendors from Sysadmins - 10 Do’s and Don’ts
A friend of mine is a grease monkey: the kind of auto enthusiast who rebuilds engines for fun on a Saturday night. He explained to me that certain brands of automobiles were designed in ways to make the mechanic’s job easier. Others, however, were designed as if the company had a pact with the aspirin industry to make sure there are plenty of mechanics with headaches. He said those car companies hate mechanics. I understood completely because, as a system administrator, I can tell when software vendors hate me. It shows in their products.





© ACM, Inc. All Rights Reserved.