OS Scheduling

Better scheduling policies for modern computing systems

Kostis Kaffes with Introduction by Peter Alvaro

Research for Practice combines the resources of the ACM Digital Library, the largest collection of computer science research in the world, with the expertise of the ACM membership. Research for Practice columns have a common goal of sharing the joy and utility of reading computer science research between academics and their counterparts in industry.

In any system that multiplexes resources, the problem of scheduling what computations run where and when is perhaps the most fundamental. Yet, like many other essential problems in computing (e.g., query optimization in databases), academic research in scheduling moves like a pendulum, with periods of intense activity followed by periods of dormancy when it is considered a "solved" problem.

It is exciting to witness this pendulum swing back, and we are very lucky to have Kostis Kaffes to curate a recent selection of outstanding research in scheduling. His choices highlight breakthroughs related to performance, extensibility, and policy choice. The first paper challenges the putative tradeoff between low latency (typically achieved by provisioning dedicated cores) and high utilization (which requires core reallocation) by making allocation decisions possible at single-microsecond granularity. The second enables the creation of arbitrary scheduling policies by factoring apart the creation and manipulation of policy, which can be handled completely by userspace agents, from the fixed kernel mechanisms that communicate events to agents and apply scheduling decisions. Finally, with the ability to perform load-balancing and allocation decisions based on flexible policies at microsecond granularity on the table, the final selection addresses the choice of policy on an application-by-application basis.

Kostis has put together a magnificent and cohesive selection, with an introduction that will make readers who are thinking about research in scheduling for the first time comfortable. Enjoy.

—Peter Alvaro

Research for Practice: OS Scheduling

Kostis Kaffes

Scheduling has long been one of the most fundamental operations in systems and networks. It involves assigning tasks to CPUs and switching between them—decisions critical both for application performance and system efficiency. For a long time, operating system (OS) scheduling focused on fairness. Two recent developments have led to a renaissance in OS scheduling research, however. First, the emergence of cloud computing lent importance to different, hard-to-optimize metrics such as tail latency and timescales—i.e., microsecond (μs) scale—which were not considered by legacy schedulers. Second, the end of Moore's law made necessary the specialization of the system stack, including scheduling, to continue making performance improvements.

This edition of Research for Practice presents three papers that look at ways to bring OS scheduling into the modern age. The first paper focuses on building the fastest possible scheduler, while the second one aims to make it easy to implement new policies that are compatible with existing applications and operating systems. The third paper explores the best scheduling policies for different types of applications. Ultimately, all three papers provide valuable contributions to the ongoing efforts to develop better scheduling policies for modern computing systems.

Microsecond-Scale Core Reallocation

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan (MIT CSAIL). Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. Proceedings of the 16th Usenix Symposium on Networked Systems Design and Implementation, 2019.

https://dl.acm.org/doi/10.5555/3323234.3323265

The first paper, by Ousterhout, et al., answers the fundamental questions of how fast core allocation can take place in an operating system and whether such reallocation is beneficial for application performance. The system presented in the paper, called Shenango, challenges the widely held belief that allocating cores across applications at microsecond timescales is not feasible because of high overhead and the potential for cache pollution. The authors show that rapid core reallocation is indeed possible and can yield significant performance benefits.

The key to achieving microsecond-scale core reallocation in the Shenango operating system is the use of a dedicated scheduling core that could make allocation decisions every 5 μs. To determine when to allocate or reclaim cores from an application, Shenango monitors the length of each application's thread-run queues and network-packet queues, using the derivative of their length as a congestion signal. This algorithm runs entirely on the dedicated core, which also manages the steering of incoming network packets to their respective target application cores.

The authors demonstrated the efficacy of this approach by showing how fine-grain core reallocation improved the performance of latency-sensitive and batch applications co-located on the same system. By allocating cores based on the instantaneous input packet rate, Shenango reduced latency for the former and increased throughput for the latter by more than six times when using a 5-μs core reallocation interval compared with a 100-μs interval.

Subsequent work has shown that Shenango's microsecond-scale scheduler can also help mitigate interference in other system resources, such as last-level caches and memory bandwidth, and provide fine-grain feedback to the network to prevent overload.

A Framework for Deploying OS Scheduling to Linux

Jack Tigar Humphries (Google), Neel Natu (Google), Ashwin Chaugule (Google), Ofir Weisse (Google), Barret Rhoden (Google), Josh Don (Google), Luigi Rizzo (Google), Oleg Rombakh (Google), Paul Turner (Google), and Christos Kozyrakis (Stanford University and Google). ghOSt: Fast & Flexible User-space Delegation of Linux Scheduling. Proceedings of the 28th ACM Symposium on Operating Systems Principles, 2021.

https://dl.acm.org/doi/10.1145/3477132.3483542

Building a highly efficient scheduler such as Shenango is an interesting academic exercise, but it is not immediately applicable to a production environment with constraints such as backward compatibility with existing applications and operating systems such as Linux. This is why a team of Google engineers built a framework, called ghOSt, that makes it easy to implement different scheduling policies and deploy them to the Linux kernel.

The key observation behind ghOSt's design is that scheduling does not need to take place in the kernel. Taking inspiration from microkernels, ghOSt delegates OS scheduling to user-space agents, either global or per CPU. It consists of a bare-bones kernel scheduling class that only communicates scheduling events (e.g., a thread became runnable) to these agents, which make the scheduling decision of what to run where and then communicate it back to the kernel. Thus, developers enjoy the agility of user-space development without being burdened by the limitations of kernel code and long deployment cycles. Furthermore, ghOSt allows applications to share hints with the scheduling agents over shared memory that enable the latter to make more informed scheduling decisions.

The biggest challenge that ghOSt had to overcome was the communication latency between the kernel component and the user-space agents, which can take up to 5 μs. This can cause (1) race conditions (e.g., the user-space agent schedules a thread to a CPU that has been removed from the thread's CPU mask) and (2) underutilization as a CPU remains idle waiting for the agent's scheduling decision.

ghOSt avoids race conditions by implementing a transaction API over shared memory that allows the agent to atomically commit scheduling decisions. To alleviate the second problem, the authors propose using custom eBPF programs that run locally on each core and schedule tasks temporarily until the agent's decision is received. The same techniques are also applicable when offloading other operating system functionality to user space (e.g., memory management).

Choosing the Best Scheduling Policy Option

Sarah McClure (UC Berkeley), Amy Ousterhout (UC Berkeley), Scott Shenker (UC Berkeley and ICSI), Sylvia Ratnasamy (UC Berkeley). Efficient Scheduling Policies for Microsecond-scale Tasks. Proceedings of the 19th Usenix Symposium on Networked Systems Design and Implementation, 2022.

https://www.usenix.org/conference/nsdi22/presentation/mcclure

After the introduction of ghOSt, which allows for easy development and deployment of custom scheduling policies, the question arises as to which policy each application should use. To answer this question, McClure, et al. conducted a comprehensive analysis of the options available.

The authors divided the scheduling process into two distinct decisions: allocating cores across applications and load-balancing tasks across CPUs within each application. Surprisingly, they found that the second decision is relatively simple; work-stealing is the best load-balancing policy in terms of both latency and efficiency, regardless of task service time distribution, number of cores, core-allocation policy, and load-balancing overheads.

In contrast, the core-allocation decision is much more nuanced. For example, contrary to past work, the authors found that proactively reclaiming cores from applications based on average delay or utilization performs better for small tasks than waiting for a CPU to become idle. They also discovered that when serving small tasks, it is better to assign a fixed number of CPUs to each application instead of allocating them dynamically.

This analysis opens up new areas of research, such as the development of new hardware that implements a scalable global queue, which in simulation was shown to perform even better than work-stealing. Furthermore, this study did not consider the presence of preemption, suggesting a need for further examination of how preemption affects scheduling decisions.

Conclusion

These three papers make significant contributions to an ongoing effort to develop better scheduling policies for modern computing systems. The papers highlight the need for better, more efficient, and more flexible OS schedulers; open up new areas of research; and demonstrate the importance of continued development and innovation in OS scheduling policies.

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (disorderlylabs.github.io). His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems in order to make them scalable, predictable, and robust to the failures and nondeterminism endemic to large-scale distribution. He earned his Ph.D. at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the National Science Foundation Career Award, Facebook Research Award, Usenix ATC 2020 Best Presentation Award, SIGMOD 2021 Distinguished PC Award, and UCSC Excellence in Teaching Award.

Kostis Kaffes is an incoming assistant professor at Columbia University and a software engineer at SystemsResearch@Google. He received his Ph.D. in electrical engineering from Stanford University, advised by Christos Kozyrakis. His research interests include computer systems, cloud computing, and scheduling.

Originally published in Queue vol. 21, no. 2—
Comment on this article in the ACM Digital Library