January/February 2018 issue of acmqueue

The January/February issue of acmqueue is out now

January/February 2018

Designing Cluster Schedulers for Internet-Scale Services

Diptanu Gon Choudhury and Timothy Perrett

Embracing failures for improving availability

Despite their apparent ubiquity, operating and implementing scheduling software is an exceedingly tricky task with many nuanced edge cases. This article highlights some of these cases based on the real-world experience of the authors designing, building, and operating a variety of schedulers for large Internet companies.

Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.

Web Services

Everything Sysadmin:
Manual Work is a Bug

  Thomas A. Limoncelli

A.B.A: Always be automating

As you work, you have a choice. Will each manual task create artifacts that allow you to accelerate future work, or do you squander these opportunities and accept the status quo? By constantly documenting and creating code-snippet artifacts, you accelerate future work. That one-shot task that could never happen again, does happen again, and next time it moves faster. Even tasks that aren't worth automating can be improved by documenting them, as documentation is automation.

Everything Sysadmin

Canary Analysis Service

  Štěpán Davidovič with Betsy Beyer

Automated canarying quickens development, improves production safety, and helps prevent outages.

Google has deployed a shared centralized service called CAS (Canary Analysis Service) that offers automatic (and often autoconfigured) analysis of key metrics during a production change. CAS is used to analyze new versions of binaries, configuration changes, data-set changes, and other production changes. CAS evaluates hundreds of thousands of production changes every day at Google.

Web Services


November/December 2017

Continuous Delivery Sounds Great, but Will It Work Here?

  Jez Humble

It's not magic, it just requires continuous, daily improvement at all levels.

Continuous delivery is a set of principles, patterns, and practices designed to make deployments predictable, routine affairs that can be performed on demand at any time. This article introduces continuous delivery, presents both common objections and actual obstacles to implementing it, and describes how to overcome them using real-life examples. Continuous delivery is not magic. It's about continuous, daily improvement at all levels of the organization.


Containers Will Not Fix Your Broken Culture
(and Other Hard Truths)

  Bridget Kromhout

Complex socio-technical systems are hard;
film at 11.

We focus so often on technical anti-patterns, neglecting similar problems inside our social structures. Spoiler alert: the solutions to many difficulties that seem technical can be found by examining our interactions with others. Let's talk about five things you'll want to know when working with those pesky creatures known as humans.

Business and Management

The Soft Side of Software
How Is Your Week Going So Far?

  Kate Matsudaira

Praise matters just as much as money.

None of us hears "thank you" or "awesome job" enough at work. Being the person who praises other people is an amazing person to be, especially when you follow this formula for making your praise ridiculously effective.

Business and Management, The Soft Side of Software

DevOps Metrics

  Nicole Forsgren and Mik Kersten

Your biggest mistake might be collecting the wrong data.

Delivering value to the business through software requires processes and coordination that often span multiple teams across complex systems, and involves developing and delivering software with both quality and resiliency. As practitioners and professionals, we know that software development and delivery is an increasingly difficult art and practice, and that managing and improving any process or system requires insights into that system. Therefore, measurement is paramount to creating an effective software value stream. Yet accurate measurement is no easy feat.


Kode Vicious:
Popping Kernels

Choosing between programming in the kernel or in user space

For our next product, management wants to move nearly all the functions into user space, believing that by having a safer programming environment, the team can create more features more quickly and with fewer errors. You talk about kernel programming from time to time; do you also think that the kernel is not for "mere mortals" and that most programmers should stick to working in the safer environment of user space?

Kode Vicious,

Monitoring in a DevOps World

  Theo Schlossnagle

Perfect should never be the enemy of better.

Long dead are the systems that age like fine wine. Today's systems are born in an agile world and remain fluid to accommodate changes in both the supplier and the consumer landscape. A legitimate response to "adapt or die" is "I'll do DevOps!" This highly dynamic system stands to challenge traditional monitoring paradigms.

Development, Performance


September/October 2017

Research for Practice:
Cluster Scheduling for Data Centers

  Malte Schwarzkopf

Expert-curated Guides to the Best of CS Research

This installment of Research for Practice features a curated selection from Malte Schwarzkopf, who takes us on a tour of distributed cluster scheduling, from research to practice, and back again. With the rise of elastic compute resources, cluster management has become an increasingly hot topic in systems R&D, and a number of competing cluster managers including Kubernetes, Mesos, and Docker are currently jockeying for the crown in this space.

Research for Practice

Everything Sysadmin:
Operational Excellence in April Fools' Pranks

  Thomas A. Limoncelli

Being funny is serious work.

Successful pranks require care and planning. Write a design proposal and a project plan. Involve operations early. If this is a technical change to your website, perform load testing, preferably including a "dark launch" or hidden launch test. Hide the prank behind a feature flag rather than requiring a new software release. Perform a retrospective and publish the results widely.

Everything Sysadmin

Bitcoin's Underlying Incentives

  Yonatan Sompolinsky, Aviv Zohar

The unseen economic forces that govern the Bitcoin protocol

Incentives are crucial for the Bitcoin protocol's security and effectively drive its daily operation. Miners go to extreme lengths to maximize their revenue and often find creative ways to do so that are sometimes at odds with the protocol. Cryptocurrency protocols should be placed on stronger foundations of incentives. There are many areas left to improve, ranging from the very basics of mining rewards and how they interact with the consensus mechanism, through the rewards in mining pools, and all the way to the transaction fee market itself.


Kode Vicious:
Reducing the Attack Surface

Sometimes you can give the monkey a less dangerous club.

Dear KV, I've told the QA and factory teams that there is no way we should leave this code in our shipping product because of the risks that the code would pose if an attacker could access it. They say the code is now too important to the product and have asked us to secure access to it in some way. Networked access to the device is provided only over a TLS (Transport Layer Security) link, and management now thinks we ought to provide a secure shell link to the CLI as well. Personally, I would rather just rip out all this code and pretend it never existed. Is there a middle path that will make the system secure but allow the QA and factory teams to have what they are now demanding?

Kode Vicious, Security

Titus: Introducing Containers to the Netflix Cloud

  Andrew Leung, Andrew Spyker, and Tim Bozarth

Approaching container adoption in an already cloud-native infrastructure

Over the years Netflix has helped craft many cloud-native patterns, such as loosely coupled microservices and immutable infrastructure, that have become industry best practices. The all-in migration to the cloud has been hugely successful for Netflix. Despite already having a successful cloud-native architecture, Netflix is investing in container technology.

While only a fraction of Netflix's internal applications use Titus, we believe our approach has enabled Netflix to quickly adopt and benefit from containers. Though the details may be Netflix-specific, the approach of providing low-friction container adoption by integrating with existing infrastructure and working with the right early adopters can be a successful strategy for any organization looking to adopt containers.

Distributed Development

The Soft Side of Software
Views from the Top

  Kate Matsudaira

Try to see things from a manager's perspective.

Leadership is hard. None of us comes to work to do a bad job, and there are always ways we can be better. So, when you have a leader who isn't meeting your expectations, maybe try reframing the situation and looking at things a little differently from the top down.

Business and Management, The Soft Side of Software

Abstracting the Geniuses Away from Failure Testing

  Peter Alvaro and Severine Tymon

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that the process by which superusers select experiments can be effectively modeled in software. The article describes a prototype validating this conjecture, presents early results from the lab and the field, and identifies new research directions that can make this vision a reality.

Failure and Recovery, Quality Assurance


July/August 2017

Research for Practice:
Private Online Communication; Highlights in Systems Verification

  Albert Kwon, James Wilcox

Expert-curated Guides to the Best of CS Research

First, Albert Kwon provides an overview of recent systems for secure and private communication. While messaging protocols such as Signal provide privacy guarantees, Albert's selected research papers illustrate what is possible at the cutting edge: more transparent endpoint authentication, better protection of communication metadata, and anonymous broadcasting. These papers marry state-of-the-art cryptography with practical, privacy-preserving protocols, providing a glimpse of what we might expect from tomorrow's secure messaging systems.

Second, James Wilcox takes us on a tour of recent advances in verified systems design. It's now possible to build end-to-end verified compilers, operating systems, and distributed systems that are provably correct with respect to well-defined specifications, providing high assurance of well-defined, well-behaved code. Because these system components interact with low-level hardware like the instruction set architecture and external networks, each paper introduces new techniques to balance the tension between formal correctness and practical applicability. As programming language techniques advance and more of the modern computing stack continues to crystallize, expect these advances to make their way into production systems.

Research for Practice

Network Applications Are Interactive

  Antony Alappatt

The network era requires new models, with interactions instead of algorithms.

The miniaturization of devices and the prolific interconnectedness of these devices over high-speed wireless networks is completely changing how commerce is conducted. These changes will profoundly change how enterprises operate. Software is at the heart of this digital world, but the software toolsets and languages were conceived for the host-based era. The issues that already plague software practice will be more profound with such an approach. It is time for software to be made simpler, secure, and reliable.


Escaping the Singularity
XML and JSON Are Like Cardboard

  Pat Helland

Cardboard surrounds and protects stuff as it crosses boundaries.

Semi-structured representations of data are not the cheapest format. There's typically a lot of extra stuff like angle brackets contained in it. JSON, XML, and other semi-structured representations allow for wonderful flexibility and dynamic interpretation. The efficiencies and savings gained from flexibility more than make up for the overhead.

Data, Escaping the Singularity

The Soft Side of Software
Breadth and Depth

  Kate Matsudaira

We all wear many hats, but make sure you have one that fits well.

When people ask me the question of where they should focus their time—should I keep learning one technology or spend time learning a new one?—I ask them this question: What is the one thing you could be the best in the world at?

Business and Management, The Soft Side of Software

Cache Me If You Can

  Jacob Loveless

Building a decentralized web-delivery model

The world is more connected than it ever has been before, and with our pocket supercomputers and IoT (Internet of Things) future, the next generation of the web might just be delivered in a peer-to-peer model. It's a giant problem space, but the necessary tools and technology are here today. We just need to define the problem a little better.

Networks, Web Services

Bitcoin's Academic Pedigree

  Arvind Narayanan and Jeremy Clark

The concept of cryptocurrencies is built from forgotten ideas in research literature.

We've seen repeatedly that ideas in the research literature can be gradually forgotten or lie unappreciated, especially if they are ahead of their time, even in popular areas of research. Both practitioners and academics would do well to revisit old ideas to glean insights for present systems. Bitcoin was unusual and successful not because it was on the cutting edge of research on any of its components, but because it combined old ideas from many previously unrelated fields. This is not easy to do, as it requires bridging disparate terminology, assumptions, etc., but it is a valuable blueprint for innovation.

Education, Networks, Security

Kode Vicious:
Cold, Hard Cache

On the implementation and maintenance of caches

Dear KV, Our latest project at work requires a large number of slightly different software stacks to deploy within our cloud infrastructure. With modern hardware, I can test this deployment on a laptop. The problem I keep running up against is that our deployment system seems to secretly cache some of my files and settings and not clear them, even when I repeatedly issue the command to do so. I've resorted to repeatedly using the find command so that I can blow away the offending files. What I've found is that the system caches data in many places so I've started a list. All of which brings me to my question: Who writes this stuff?!

Kode Vicious, Networks


Older Issues