Operations and Life

  Download PDF version of this article PDF

Split Your Overwhelmed Teams

Two teams of five is not the same as one team of ten.

Thomas A. Limoncelli.

 

Queue welcomes the return of columnist Thomas Limoncelli! You may remember Tom from his column, "Everything Sysadmin" which ran from 2015 to 2020. His new column, "Operations and Life" begins with this issue and will focus on DevOps and IT operations from a uniquely personal perspective.

 

A friend came to me asking for advice. His SRE team was suffering from low morale. People were burning out. Attrition was high. People leaving often cited high stress levels as the reason. There was concern that their backfill replacements wouldn't last.

Then Todd (not his real name) told me the biggest shocker: The team is overworked, yet his boss won't let him hire more people.

The way he explained this gave me the impression that he thought this was the first time in the history of the world that a request for more staff was denied. I broke the news to him as gently as I could.

After wiping the tears from his eyes, I tried to refocus the conversation the best way I knew how. I asked the most powerful question you can ask an engineer: "What problem are you trying to solve?"

He responded, "How to convince my boss to hire more people!"

"Yes," I replied, "but what problem are you trying to solve?"

He thought for a moment and said, "Low morale? High stress? The fact that everyone is so overloaded that nothing gets done?"

Yes! Now we're talking. Hiring more people is a solution, not a problem. By restating the problem as one of morale and stress, we open the door to many possible solutions.

 

The Problem

I learned a lot about the team by discussing the problem.

His team was responsible for managing six systems: the physical infrastructure (wires, cables, and physical computers), as well as cloud infrastructure, plus a series of applications that ran on top of that infrastructure.

Onboarding new people took many months as they learned all the various systems, technologies and processes, policies, and procedures. Todd said they had tried different ways to train people, manage documentation, and so on. All that helped, but it was still a lot of information to keep in your head.

It was clear to me that the main cause of the team's stress was being responsible for so many wildly different systems. The team members weren't stressed by risky, high-stakes work, as you might expect. There weren't a lot of outages. However, people were stressed because they felt incompetent. Six major areas of responsibility meant that no matter how good you became at one thing, there were other areas that you always felt embarrassingly ignorant about.

In simple terms, the team was feeling overwhelmed. This stresses people out and leads to low morale.

Todd said, "But that's all in their heads!" and I agreed. Overwhelmed is a feeling, not a physical problem. If it were a physical problem, it could be surgically removed.

Psychologists' term for this is cognitive load: the amount of working memory resources used. Like a computer running slowly because it is running too many programs at the same time, your brain is overwhelmed.

When I talked with members of the team, I often heard phrases such as "I feel stupid" or "At my last job, I was the expert, but I've been here a year and I still don't know what I'm doing."

These were highly intelligent, experienced engineers, yet they frequently put themselves down. They weren't just being humble; they really felt inadequate.

One team member told me something that explained this better than any book on management theory could. He said that with six areas of functional responsibility, he never did a single task long enough to get good at it. At previous jobs, he learned a task and then applied that learning to literally hundreds of projects. He pointed out that it is normal to feel inadequate while learning a new skill, but every time he got to use that skill, he "got the dopamine hit of a job well done."

On this team, with six major areas of responsibility, there was constant context switching. Every task would bring this team member back to the "feeling stupid" phase. There wasn't enough repetition to get to the "feeling of accomplishment" phase. Nobody wants a job that makes them "feel stupid" every day.

Yes, if he waited long enough, he would be assigned a task that let him use a skill he learned earlier. Skills fade if you don't use them, however, so he would go back to "feeling stupid," compounded by the fact that he would be kicking himself for "not taking good enough notes."

Engineers often say they enjoy learning new things, but I believe what they really enjoy is demonstrating that new knowledge. To achieve a feeling of accomplishment, they need to be able to demonstrate the new skill soon after learning it. Constant context switching among six areas of responsibility meant the dopamine hits were few and far between.

It wasn't always this way. Years earlier. the team was half the size and had half the responsibilities, and everyone was happier. There was no talk of burnout. Everything just worked better.

My suggestion was simple: Split the team in two and give each half as many responsibilities.

 

Pushback and Concerns

Todd initially pushed back against this idea. Why split the team when he could just have certain people specialize? I pointed out he had already tried that, and it wasn't working. Without the hard boundary that comes from having distinct teams, the engineers felt obligated to stay informed and be involved in all the decisions of the team. The specialists must attend meetings outside their specialty and hold the cognitive load of knowing all other functions.

No. To make this work, they needed the hard limits that come from organizational boundaries.

After some discussion, we decided to try reorganizing as two teams of five, each being responsible for three related areas. This would lessen the cognitive load, reduce context switching, and increase the likelihood of the kind of repetition that would lead to more "demonstration of skill" and less "feeling stupid."

 

How to Split a Team

The book Team Topologies (Manuel Pais and Matthew Skelton) has advice about how to divide teams using the concept of a fracture plane. This is a natural seam that allows the system to be split easily into two or more parts. These typically fall along lines of business domains, regulatory compliance, change cadence, team location, risk isolation, types of technology, or user personas.

In this case, there was an obvious natural seam along business functions: application versus the platform on which the application ran. Basically, this was splitting the team along architectural layers.

The split would have several benefits:

* Fewer communication paths. Instead of 10 people trying to communicate with 10 people, most communication would happen within each five-person team, with the tech leads of each team bridging the two.

* Easier to achieve consensus. It is easier to get five people to agree than 10. The people making the decision would be more focused on the issue. It would remove from the decision-making process people that are unaffected or simply don't care.

* Meetings that are more efficient. The more people invited to a meeting, the more difficult it is to schedule; the more likely someone will be late; the more likely someone isn't paying attention and needs to ask for information to be repeated when asked for an opinion.

* Fewer meeting hours per person. Smaller teams would mean team members would be invited to fewer pro forma meetings.

* Increased likelihood of repetition. As described earlier, we work better when we can learn a task and demonstrate that skill frequently. This requires a certain level of repetition in the type of work.

* More clearly defined boundaries and separation of duties. With one large team, the "big ball of mud" design pattern seems to come more naturally. What's a little layering violation among friends? On the other hand, two separate teams are forced to be more intentional about boundaries in technical areas, as well as in policies and processes. Separation into two teams would become a forcing function for well-defined interfaces between technical systems (APIs), as well as personal interfaces (rules of engagement).

* Lighter-weight leadership responsibilities. Instead of one overworked tech lead, now there would be two tech leads, each with a more manageable domain of responsibilities.

* More leadership opportunities. Some of the attrition was the result of a perceived lack of room for advancement. More teams would mean more opportunities.

 

Shared Oncall

One concern Todd had was how this would affect the on-call rotation. Being on call for one week out of 10 was great. The team liked being on call for no more than one or two weeks per quarter. With a five-person team, however, a team member could be on call four times each quarter, or 20 percent. That frequency is detrimental to project productivity.

Instead, the team decided to have a shared on-call rotation. They would cross-train. Each team makes a procedure book that covers any first-on-the-scene tasks for most alerts and issues. If you are on call for the other team's responsibilities, you would be trained in basic tasks but could escalate to the other team. To reduce cognitive load, you are not expected to memorize every task. Instead, all alerts and common issues are documented with a checklist.

You follow the steps of the checklist. If you get to the end of the checklist and the alert is still firing, you can escalate to the responsible team.

The checklists are treated like software: You could file a bug if you find something wrong, confusing, or incomplete. Their code-review process (github pull requests) would be used to suggest changes. If one team feels they are getting too many escalations, they could improve the documentation or the process. Thus, a feedback loop exists to ensure that documentation and training stay fresh and useful. Training new members basically involves reviewing the most-common checklists.

The increased likelihood of repetition not only helps morale, but also has some unexpected other benefits. For example, repeating the same or similar tasks provides more opportunity to make potential improvements to the process. If you do a task once a year and see how the task could be improved, there's little incentive to make those improvements. It isn't worth it. If you must do a task many times each day, you're going to make time to improve the process. Repetition leads to better, more efficient processes.

 

Challenges and Opportunities

There are also many challenges with the split-team approach. People have to let go of old responsibilities. When you are good at something, it can be difficult to leave the task to someone else. Generalists have to pick one team or the other and give up tasks where they had developed expertise and comfort.

At the team level, many existing practices they had grown comfortable with changed. Meeting schedules were reworked. Policies and procedures changed.

On the other hand, there have been unexpected new opportunities. Managing two smaller teams can be less overwhelming than managing one large team. There has even been discussion about hiring a different manager for each team. Each kind of team might need a different kind of manager. For example, one team might do better with a manager who has operational management expertise, while the other team might need a manager with more hardware experience.

 

Summary

This team's low morale and high stress were a result of the members feeling overwhelmed by too many responsibilities. The 10-by-10 communication structure made it difficult to achieve consensus, there were too many meetings, and everyone was suffering from the high cognitive load. By splitting into two teams, each can be more nimble, which the manager likes, and have a lower cognitive load, which the team likes. There is more opportunity for repetition, which lets people develop skills and demonstrate them. Altogether, this helps reduce stress and improve morale.

If your team is suffering from low morale and high stress, look at the cognitive load on the team, review its sources, and look for substantive changes that will have the desired impact. The solution might not be splitting the team, but that could be exactly what is needed.

 

Thomas A. Limoncelli is an internationally recognized author, speaker, and DevOps advocate. He is a Technical Product Manager for SRE at Stack Overflow, Inc. He previously worked at small and large companies including Google, Bell Labs/Lucent, and AT&T. His books include Time Management for System Administrators (O'Reilly), The Practice of System and Network Administration (3rd edition) and The Practice of Cloud System Administration (Pearson).

Copyright © 2022 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 20, no. 5
Comment on this article in the ACM Digital Library





More related articles:

João Varajão, António Trigo - Assessing IT Project Success: Perception vs. Reality
This study has significant implications for practice, research, and education by providing new insights into IT project success. It expands the body of knowledge on project management by reporting project success (and not exclusively project management success), grounded in several objective criteria such as deliverables usage by the client in the post-project stage, hiring of project-related support/maintenance services by the client, contracting of new projects by the client, and vendor recommendation by the client to potential clients. Researchers can find a set of criteria they can use when studying and reporting the success of IT projects, thus expanding the current perspective on evaluation and contributing to more accurate conclusions.


Abi Noda, Margaret-Anne Storey, Nicole Forsgren, Michaela Greiler - DevEx: What Actually Drives Productivity
Developer experience focuses on the lived experience of developers and the points of friction they encounter in their everyday work. In addition to improving productivity, DevEx drives business performance through increased efficiency, product quality, and employee retention. This paper provides a practical framework for understanding DevEx, and presents a measurement framework that combines feedback from developers with data about the engineering systems they interact with. These two frameworks provide leaders with clear, actionable insights into what to measure and where to focus in order to improve developer productivity.


Jenna Butler, Catherine Yeh - Walk a Mile in Their Shoes
Covid has changed how people work in many ways, but many of the outcomes have been paradoxical in nature. What works for one person may not work for the next (or even the same person the next day), and we have yet to figure out how to predict exactly what will work for everyone. As you saw in the composite personas described here, some people struggle with isolation and loneliness, have a hard time connecting socially with their teams, or find the time pressures of hybrid work with remote teams to be overwhelming. Others relish this newfound way of working, enjoying more time with family, greater flexibility to exercise during the day, a better work/life balance, and a stronger desire to contribute to the world.


Bridget Kromhout - Containers Will Not Fix Your Broken Culture (and Other Hard Truths)
We focus so often on technical anti-patterns, neglecting similar problems inside our social structures. Spoiler alert: the solutions to many difficulties that seem technical can be found by examining our interactions with others. Let’s talk about five things you’ll want to know when working with those pesky creatures known as humans.





© ACM, Inc. All Rights Reserved.