March/April 2018 issue of acmqueue The March/April issue of acmqueue is out now



March/April 2018


Escaping the Singularity
Consistently Eventual


  Pat Helland

For many data items, the work never settles on a value.

Applications are no longer islands. Not only do they frequently run distributed and replicated over many cloud-based computers, but they also run over many hand-held computers. This makes it challenging to talk about a single truth at a single place or time. In addition, most modern applications interact with other applications. These interactions settle out to impact understanding. Over time, a shared opinion emerges just as new interactions add increasing uncertainty. Many business, personal, and computational "facts" are, in fact, uncertain. As some changes settle, others meander from place to place.

Data and Databases, Escaping the Singularity


Algorithms Behind Modern Storage Systems

  Alex Petrov

Different uses for read-optimized B-trees and write-optimized LSM-trees

The amounts of data processed by applications are constantly growing. With this growth, scaling storage becomes more challenging. Every database system has its own tradeoffs. Understanding them is crucial, as it helps in selecting the right one from so many available choices.

Every application is different in terms of read/write workload balance, consistency requirements, latencies, and access patterns. Familiarizing yourself with database and storage internals facilitates architectural decisions, helps explain why a system behaves a certain way, helps troubleshoot problems when they arise, and fine-tunes the database for your workload.

It's impossible to optimize a system in all directions. In an ideal world there would be data structures guaranteeing the best read and write performance with no storage overhead but, of course, in practice that's not possible.

This article takes a closer look at two storage system design approaches used in a majority of modern databases and describes their use cases and tradeoffs.

Storage


Kode Vicious:
Every Silver Lining Has a Cloud


Cache is king. And if your cache is cut, you're going to feel it.

Clearly, your management has never heard the phrase, "You get what you pay for." Or perhaps they heard it and didn't realize it applied to them. The savings in cloud computing comes at the expense of a loss of control over your systems, which is summed up best in the popular nerd sticker that says, "The Cloud is Just Other People's Computers."

Some providers now have something called Metal-as-a-Service, which I really think ought to mean that an '80s metal band shows up at your office, plays a gig, smashes the furniture, and urinates on the carpet, but alas, it's just the cloud providers' way of finally admitting that cloud computing isn't really the right answer for all applications. For systems that require deterministic performance guarantees to work well, you really have to think very hard about whether or not a cloud-based system is the right answer, because providing deterministic guarantees requires quite a bit of control over the variables in the environment. Cloud systems are not about giving you control; they're about the owner of the systems having the control.

Distributed Computing, Kode Vicious


C Is Not a Low-level Language

  David Chisnall

Your computer is not a fast PDP-11.

In the wake of the recent Meltdown and Spectre vulnerabilities, it's worth spending some time looking at root causes. Both of these vulnerabilities involved processors speculatively executing instructions past some kind of access check and allowing the attacker to observe the results via a side channel. The features that led to these vulnerabilities, along with several others, were added to let C programmers continue to believe they were programming in a low-level language, when this hasn't been the case for decades.

There is a common myth in software development that parallel programming is hard. This would come as a surprise to Alan Kay, who was able to teach an actor-model language to young children, with which they wrote working programs with more than 200 threads. It comes as a surprise to Erlang programmers, who commonly write programs with thousands of parallel components. It's more accurate to say that parallel programming in a language with a C-like abstract machine is difficult, and given the prevalence of parallel hardware, from multicore CPUs to many-core GPUs, that's just another way of saying that C doesn't map to modern hardware very well.

Languages


 


January/February 2018

Research for Practice:
Prediction-Serving Systems


  Dan Crankshaw and Joseph Gonzalez

Expert-curated Guides to the Best of CS Research

This installment of Research for Practice features a curated selection from Dan Crankshaw and Joey Gonzalez, who provide an overview of machine learning serving systems. What happens when we wish to actually deploy a machine learning model to production, and how do we serve predictions with high accuracy and high computational efficiency? Dan and Joey's selection provides a thoughtful selection of cutting-edge techniques spanning database-level integration, video processing, and prediction middleware. Given the explosion of interest in machine learning and its increasing impact on seemingly every application vertical, it's possible that systems such as these will become as commonplace as relational databases are today.

Artificial Intelligence, Research for Practice


Kode Vicious:
Watchdogs vs. Snowflakes


Taking wild-ass guesses with your distributed job-control system

That a system can randomly jam doesn't just indicate a serious bug in the system; it is also a major source of risk. You don't say what your distributed job-control system controls, but let's just say I hope it's not something with significant, real-world side effects, like a power station, jet aircraft, or financial trading system. The risk, of course, is that the system will jam, not when it's convenient for someone to add a dummy job to clear the jam, but during some operation that could cause data loss or return incorrect results. I rather suspect that having a system like this jam while coordinating, for example, the balancing of electrical power across a power grid would have spectacular and perhaps fatal results.

Distributed Computing, Kode Vicious


Thou Shalt Not Depend on Me

Tobias Lauinger, Abdelberi Chaabane, and Christo B. Wilson

A look at JavaScript libraries in the wild

Many websites use third-party components such as JavaScript libraries, which bundle useful functionality so that developers can avoid reinventing the wheel. But what happens when libraries have security issues? Chances are that websites using such libraries inherit these issues and become vulnerable to attacks.

Given the risk of using a library with known vulnerabilities, it is important to know how often this happens in practice and, more importantly, who is to blame for the inclusion of vulnerable libraries?

We set out to answer these questions and found that with 37 percent of websites using at least one known vulnerable library, and libraries often being included in quite unexpected ways, there clearly is room for improvement in library handling on the web. To that end, this article makes a few recommendations about what can be done to improve the situation.

Programming Languages


The Soft Side of Software
How to Come Up with Great Ideas


  Kate Matsudaira

Think like an entrepreneur.

I started my career working in big companies but always dreamed of starting my own. I would read online forums and articles about successful entrepreneurs. I was enamored with the idea of doing a startup. The problem was I didn't have any ideas. Fast forward 10 years and I have so many ideas that choosing the right one is the challenge. I am constantly coming up with ideas and opportunities that could turn into a product, or a whole company. There is no shortage of things that I could do. The key is you have to learn to think like an entrepreneur.

Business and Management, The Soft Side of Software


Designing Cluster Schedulers for Internet-Scale Services

Diptanu Gon Choudhury and Timothy Perrett

Embracing failures for improving availability

Despite their apparent ubiquity, operating and implementing scheduling software is an exceedingly tricky task with many nuanced edge cases. This article highlights some of these cases based on the real-world experience of the authors designing, building, and operating a variety of schedulers for large Internet companies.

Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.

Web Services


Everything Sysadmin:
Manual Work is a Bug


  Thomas A. Limoncelli

A.B.A: Always be automating

As you work, you have a choice. Will each manual task create artifacts that allow you to accelerate future work, or do you squander these opportunities and accept the status quo? By constantly documenting and creating code-snippet artifacts, you accelerate future work. That one-shot task that could never happen again, does happen again, and next time it moves faster. Even tasks that aren't worth automating can be improved by documenting them, as documentation is automation.

Everything Sysadmin


Canary Analysis Service

  Štěpán Davidovič with Betsy Beyer

Automated canarying quickens development, improves production safety, and helps prevent outages.

Google has deployed a shared centralized service called CAS (Canary Analysis Service) that offers automatic (and often autoconfigured) analysis of key metrics during a production change. CAS is used to analyze new versions of binaries, configuration changes, data-set changes, and other production changes. CAS evaluates hundreds of thousands of production changes every day at Google.

Web Services


 


November/December 2017

Continuous Delivery Sounds Great, but Will It Work Here?

  Jez Humble

It's not magic, it just requires continuous, daily improvement at all levels.

Continuous delivery is a set of principles, patterns, and practices designed to make deployments predictable, routine affairs that can be performed on demand at any time. This article introduces continuous delivery, presents both common objections and actual obstacles to implementing it, and describes how to overcome them using real-life examples. Continuous delivery is not magic. It's about continuous, daily improvement at all levels of the organization.

Development


Containers Will Not Fix Your Broken Culture
(and Other Hard Truths)


  Bridget Kromhout

Complex socio-technical systems are hard;
film at 11.


We focus so often on technical anti-patterns, neglecting similar problems inside our social structures. Spoiler alert: the solutions to many difficulties that seem technical can be found by examining our interactions with others. Let's talk about five things you'll want to know when working with those pesky creatures known as humans.

Business and Management


The Soft Side of Software
How Is Your Week Going So Far?


  Kate Matsudaira

Praise matters just as much as money.

None of us hears "thank you" or "awesome job" enough at work. Being the person who praises other people is an amazing person to be, especially when you follow this formula for making your praise ridiculously effective.

Business and Management, The Soft Side of Software


DevOps Metrics

  Nicole Forsgren and Mik Kersten

Your biggest mistake might be collecting the wrong data.

Delivering value to the business through software requires processes and coordination that often span multiple teams across complex systems, and involves developing and delivering software with both quality and resiliency. As practitioners and professionals, we know that software development and delivery is an increasingly difficult art and practice, and that managing and improving any process or system requires insights into that system. Therefore, measurement is paramount to creating an effective software value stream. Yet accurate measurement is no easy feat.

Development


Kode Vicious:
Popping Kernels


Choosing between programming in the kernel or in user space

For our next product, management wants to move nearly all the functions into user space, believing that by having a safer programming environment, the team can create more features more quickly and with fewer errors. You talk about kernel programming from time to time; do you also think that the kernel is not for "mere mortals" and that most programmers should stick to working in the safer environment of user space?

Kode Vicious


Monitoring in a DevOps World

  Theo Schlossnagle

Perfect should never be the enemy of better.

Long dead are the systems that age like fine wine. Today's systems are born in an agile world and remain fluid to accommodate changes in both the supplier and the consumer landscape. A legitimate response to "adapt or die" is "I'll do DevOps!" This highly dynamic system stands to challenge traditional monitoring paradigms.

Development, Performance


 


September/October 2017


Research for Practice:
Cluster Scheduling for Data Centers


  Malte Schwarzkopf

Expert-curated Guides to the Best of CS Research

This installment of Research for Practice features a curated selection from Malte Schwarzkopf, who takes us on a tour of distributed cluster scheduling, from research to practice, and back again. With the rise of elastic compute resources, cluster management has become an increasingly hot topic in systems R&D, and a number of competing cluster managers including Kubernetes, Mesos, and Docker are currently jockeying for the crown in this space.

Research for Practice


Everything Sysadmin:
Operational Excellence in April Fools' Pranks


  Thomas A. Limoncelli

Being funny is serious work.

Successful pranks require care and planning. Write a design proposal and a project plan. Involve operations early. If this is a technical change to your website, perform load testing, preferably including a "dark launch" or hidden launch test. Hide the prank behind a feature flag rather than requiring a new software release. Perform a retrospective and publish the results widely.

Everything Sysadmin


Bitcoin's Underlying Incentives

  Yonatan Sompolinsky, Aviv Zohar

The unseen economic forces that govern the Bitcoin protocol

Incentives are crucial for the Bitcoin protocol's security and effectively drive its daily operation. Miners go to extreme lengths to maximize their revenue and often find creative ways to do so that are sometimes at odds with the protocol. Cryptocurrency protocols should be placed on stronger foundations of incentives. There are many areas left to improve, ranging from the very basics of mining rewards and how they interact with the consensus mechanism, through the rewards in mining pools, and all the way to the transaction fee market itself.

Networks


Kode Vicious:
Reducing the Attack Surface


Sometimes you can give the monkey a less dangerous club.

Dear KV, I've told the QA and factory teams that there is no way we should leave this code in our shipping product because of the risks that the code would pose if an attacker could access it. They say the code is now too important to the product and have asked us to secure access to it in some way. Networked access to the device is provided only over a TLS (Transport Layer Security) link, and management now thinks we ought to provide a secure shell link to the CLI as well. Personally, I would rather just rip out all this code and pretend it never existed. Is there a middle path that will make the system secure but allow the QA and factory teams to have what they are now demanding?

Kode Vicious, Security


Titus: Introducing Containers to the Netflix Cloud

  Andrew Leung, Andrew Spyker, and Tim Bozarth

Approaching container adoption in an already cloud-native infrastructure

Over the years Netflix has helped craft many cloud-native patterns, such as loosely coupled microservices and immutable infrastructure, that have become industry best practices. The all-in migration to the cloud has been hugely successful for Netflix. Despite already having a successful cloud-native architecture, Netflix is investing in container technology.

While only a fraction of Netflix's internal applications use Titus, we believe our approach has enabled Netflix to quickly adopt and benefit from containers. Though the details may be Netflix-specific, the approach of providing low-friction container adoption by integrating with existing infrastructure and working with the right early adopters can be a successful strategy for any organization looking to adopt containers.

Distributed Development


The Soft Side of Software
Views from the Top


  Kate Matsudaira

Try to see things from a manager's perspective.

Leadership is hard. None of us comes to work to do a bad job, and there are always ways we can be better. So, when you have a leader who isn't meeting your expectations, maybe try reframing the situation and looking at things a little differently from the top down.

Business and Management, The Soft Side of Software


Abstracting the Geniuses Away from Failure Testing

  Peter Alvaro and Severine Tymon

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that the process by which superusers select experiments can be effectively modeled in software. The article describes a prototype validating this conjecture, presents early results from the lab and the field, and identifies new research directions that can make this vision a reality.

Failure and Recovery, Quality Assurance


 




Older Issues