September/October 2018 issue of acmqueue The September/October issue of acmqueue is out now

Subscribers and ACM Professional members login here



September/October 2018


Case Study
CodeFlow: Improving the Code Review Process at Microsoft


A discussion with Jacek Czerwonka, Michaela Greiler, Christian Bird, Lucas Panjer, and Terry Coatta

People may associate code reviews with debugging, but that's not as central to the code-review process as you might think. The real win comes in the form of improved long-term code maintainability.

Case Studies, Workflow


Benchmarking "Hello, World!"

  Richard L. Sites

Six different views of the execution of "Hello, World!" show what is often missing in today's tools

Too often a service provider has a performance promise to keep but few tools for measuring the existence of laggard transactions, and none at all for understanding their root causes. As more and more software moves off the desktop and into data centers, and more and more cell phones use server requests as the other half of apps, observation tools for large-scale distributed transaction systems are not keeping up. Know what each tool you use is blind to, know what information you need to understand a performance problem, and then look for tools that can actually observe that information directly.

Development, Performance


 


July/August 2018


Using Remote Cache Service for Bazel

  Alpha Lam

Save time by sharing and reusing build and test output

Bazel is an actively developed open-source build and test system that aims to increase productivity in software development. It has a growing number of optimizations to improve the performance of daily development tasks. Remote cache service is a new development that significantly saves time in running builds and tests. It is particularly useful for a large code base and any size of development team.

Development


Kode Vicious
A Chance Gardener


Harvesting open-source products and planting the next crop

It is a very natural progression for a company to go from being a pure consumer of open source, to interacting with the project via patch submission, and then becoming a direct contributor. No one would expect a company to be a direct contributor to all the open-source projects it consumes, as most companies consume far more software than they would ever produce, which is the bounty of the open-source garden. It ought to be the goal of every company consuming open source to contribute something back, however, so that its garden continues to bear fruit, instead of rotting vegetables.

Kode Vicious, Open Source


Why SRE Documents Matter

  Shylaja Nukala, Vivek Rau

How documentation enables SRE teams to manage new and existing services

SRE (site reliability engineering) is a job function, a mindset, and a set of engineering approaches for making web products and services run reliably. SREs operate at the intersection of software development and systems engineering to solve operational problems and engineer solutions to design, build, and run large-scale distributed systems scalably, reliably, and efficiently. A mature SRE team likely has well-defined bodies of documentation associated with many SRE functions. If you manage an SRE team or intend to start one, this article will help you understand the types of documents your team needs to write and why each type is needed, allowing you to plan for and prioritize documentation work along with other team projects.

Web Development


How to Live in a Post-Meltdown and -Spectre World

  Rich Bennett, Craig Callahan, Stacy Jones, Matt Levine, Merrill Miller, and Andy Ozment

Learn from the past to prepare for the next battle.

The scope of vulnerabilities such as Meltdown and Spectre is so vast that it can be difficult to address. At best, this is an incredibly complex situation for an organization like Goldman Sachs with dedicated threat, vulnerability management, and infrastructure teams. Navigation for a small or medium-sized business without dedicated triage teams is likely harder. We rely heavily on vendor coordination for clarity on patch dependency and still have to move forward with less-than-perfect answers at times.

Good cyber-hygiene practices remain foundational—the nature of the vulnerability is different, but the framework and approach to managing it are not. In a world of zero days and multidimensional vulnerabilities such as Spectre and Meltdown, the speed and effectiveness of the response to triage and prioritizing risk-reduction efforts are vital to all organizations. More high-profile and complex vulnerabilities are sure to follow, so now is a good time to take lessons learned from Spectre and Meltdown and use them to help prepare for the next battle.

Security


The Soft Side of Software
How to Get Things Done When You Don't Feel Like It


  Kate Matsudaira

Five strategies for pushing through

If you want to be successful, then it serves you better to rise to the occasion no matter what. That means learning how to push through challenges and deliver valuable results.

Business and Management, The Soft Side of Software


Tracking and Controlling Microservice Dependencies

  Silvia Esparrachiari, Tanya Reilly, and Ashleigh Rentz

Dependency management is a crucial part of system and software design.

Dependency cycles will be familiar to you if you have ever locked your keys inside your house or car. You can't open the lock without the key, but you can't get the key without opening the lock. Some cycles are obvious, but more complex dependency cycles can be challenging to find before they lead to outages. Strategies for tracking and controlling dependencies are necessary for maintaining reliable systems.

Dependencies can be tracked by observing the behavior of a system, but preventing dependency problems before they reach production requires a more active strategy. Implementing dependency control ensures that each new dependency can be added to a DAG (directed acyclic graph) before it enters use. This gives system designers the freedom to add new dependencies where they are valuable, while eliminating much of the risk that comes from the uncontrolled growth of dependencies.

Development, Web Services



 


May/June 2018


Kode Vicious:
The Obscene Coupling Known as Spaghetti Code


Teach your junior programmers how to read code

Communication is just a fancy word for storytelling, something that humans have probably been doing since before we acquired language. Unless you are an accomplished surrealist, you tell a story by starting at the beginning, then over the course of time expose the reader to more of the details, finally arriving at the end where, hopefully, the reader experiences a satisfying bit of closure. The goal of the writer (or coder) is to form in the mind of the reader the same image the writer had. That is the process of communication, and it doesn't matter if it's prose, program or poetry—at the end of the day, if the recipient of our message has no clue what we meant, then all was for naught.

Development, Kode Vicious


Corp to Cloud: Google's Virtual Desktops

  Matt Fata, Philippe-Joseph Arida, Patrick Hahn, and Betsy Beyer

How Google moved its virtual desktops to the cloud

Over one-fourth of Googlers use internal, data-center-hosted virtual desktops. This on-premises offering sits in the corporate network and allows users to develop code, access internal resources, and use GUI tools remotely from anywhere in the world. Among its most notable features, a virtual desktop instance can be sized according to the task at hand, has persistent user storage, and can be moved between corporate data centers to follow traveling Googlers.

Until recently, our virtual desktops were hosted on commercially available hardware on Google's corporate network using a homegrown open-source virtual cluster-management system called Ganeti. Today, this substantial and Google-critical workload runs on GCP (Google Compute Platform). This article discusses the reasons for the move to GCP, and how the migration was accomplished.

Distributed Computing


Mind Your State for Your State of Mind

  Pat Helland

The interactions between storage and applications can be complex and subtle.

Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems.

Storage


Research for Practice:
Knowledge Base Construction in the Machine-learning Era


  Alex Ratner and Chris Ré

Three critical design points: Joint-learning, weak supervision, and new representations

This installment of Research for Practice features a curated selection from Alex Ratner and Chris Ré, who provide an overview of recent developments in Knowledge Base Construction (KBC). While knowledge bases have a long history dating to the expert systems of the 1970s, recent advances in machine learning have led to a knowledge base renaissance, with knowledge bases now powering major product functionality including Google Assistant, Amazon Alexa, Apple Siri, and Wolfram Alpha. Ratner and Ré's selections highlight key considerations in the modern KBC process, from interfaces that extract knowledge from domain experts to algorithms and representations that transfer knowledge across tasks.

AI, Research for Practice


The Soft Side of Software
The Secret Formula for Choosing the Right Next Role


  Kate Matsudaira

The best careers are not defined by titles or resume bullet points.

When you are searching for the next step in your career, don't just think about the surface-level benefits. Drill down on your biggest goals and do a little thinking about whether or not each job will help you get closer to those goals. The smarter you are about what you choose next, the closer you will get to the things you truly want from your life and your work.

Business and Management, The Soft Side of Software


The Mythos of Model Interpretability

  Zachary C. Lipton

In machine learning, the concept of interpretability is both important and slippery.

Supervised machine-learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? Models should be not only good, but also interpretable, yet the task of interpretation appears underspecified. The academic literature has provided diverse and sometimes non-overlapping motivations for interpretability and has offered myriad techniques for rendering interpretable models. Despite this ambiguity, many authors proclaim their models to be interpretable axiomatically, absent further argument. Problematically, it is not clear what common properties unite these techniques.

This article seeks to refine the discourse on interpretability. First it examines the objectives of previous papers addressing interpretability, finding them to be diverse and occasionally discordant. Then, it explores model properties and techniques thought to confer interpretability, identifying transparency to humans and post hoc explanations as competing concepts. Throughout, the feasibility and desirability of different notions of interpretability are discussed. The article questions the oft-made assertions that linear models are interpretable and that deep neural networks are not.

AI


Everything Sysadmin
GitOps: A Path to More Self-service IT


  Thomas A. Limoncelli

IaC + PR = GitOps

GitOps lowers the cost of creating self-service IT systems, enabling self-service operations where previously they could not be justified. It improves the ability to operate the system safely, permitting regular users to make big changes. Safety improves as more tests are added. Security audits become easier as every change is tracked.

Everything Sysadmin, Systems Administration


 


March/April 2018


Research for Practice:
FPGAs in Data Centers


  Gustavo Alonso

Expert-curated Guides to the Best of CS Research

As Moore's Law has slowed and the computational overheads of datacenter workloads have continued to rise, FPGAs offer an increasingly attractive point in the trade-off between power and performance. Gustavo's selections highlight early successes and practical deployment considerations that inform the ongoing, high-stakes debate about the future of datacenter- and cloud-based computation substrates.

Performance, Research for Practice


Workload Frequency Scaling Law — Derivation and Verification

  Noor Mubeen

Workload scalability has a cascade relation via the scale factor.

Many processors expose performance-monitoring counters that help measure 'productive performance' associated with workloads. Productive performance is typically represented by scale factor, a term that refers to the extent of stalls compared with stall-free cycles within a time window. The scale factor of workload is also influenced by clock frequency as selected by frequency-selection governors. Hence, in a dynamic voltage/frequency scaling or DVFS system, the utilization, power, and performance outputs are also functions of the scale factor and its variations. Some governance algorithms do treat the scale factor in ways that are native to their governance philosophy.

This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system. The scope here though, is limited to demonstrating well-quantified and verified scaling equations.

Performance


Escaping the Singularity
Consistently Eventual


  Pat Helland

For many data items, the work never settles on a value.

Applications are no longer islands. Not only do they frequently run distributed and replicated over many cloud-based computers, but they also run over many hand-held computers. This makes it challenging to talk about a single truth at a single place or time. In addition, most modern applications interact with other applications. These interactions settle out to impact understanding. Over time, a shared opinion emerges just as new interactions add increasing uncertainty. Many business, personal, and computational "facts" are, in fact, uncertain. As some changes settle, others meander from place to place.

Data and Databases, Escaping the Singularity


Algorithms Behind Modern Storage Systems

  Alex Petrov

Different uses for read-optimized B-trees and write-optimized LSM-trees

The amounts of data processed by applications are constantly growing. With this growth, scaling storage becomes more challenging. Every database system has its own tradeoffs. Understanding them is crucial, as it helps in selecting the right one from so many available choices.

Every application is different in terms of read/write workload balance, consistency requirements, latencies, and access patterns. Familiarizing yourself with database and storage internals facilitates architectural decisions, helps explain why a system behaves a certain way, helps troubleshoot problems when they arise, and fine-tunes the database for your workload.

It's impossible to optimize a system in all directions. In an ideal world there would be data structures guaranteeing the best read and write performance with no storage overhead but, of course, in practice that's not possible.

This article takes a closer look at two storage system design approaches used in a majority of modern databases and describes their use cases and tradeoffs.

Storage


Kode Vicious:
Every Silver Lining Has a Cloud


Cache is king. And if your cache is cut, you're going to feel it.

Clearly, your management has never heard the phrase, "You get what you pay for." Or perhaps they heard it and didn't realize it applied to them. The savings in cloud computing comes at the expense of a loss of control over your systems, which is summed up best in the popular nerd sticker that says, "The Cloud is Just Other People's Computers."

Some providers now have something called Metal-as-a-Service, which I really think ought to mean that an '80s metal band shows up at your office, plays a gig, smashes the furniture, and urinates on the carpet, but alas, it's just the cloud providers' way of finally admitting that cloud computing isn't really the right answer for all applications. For systems that require deterministic performance guarantees to work well, you really have to think very hard about whether or not a cloud-based system is the right answer, because providing deterministic guarantees requires quite a bit of control over the variables in the environment. Cloud systems are not about giving you control; they're about the owner of the systems having the control.

Distributed Computing, Kode Vicious


C Is Not a Low-level Language

  David Chisnall

Your computer is not a fast PDP-11.

In the wake of the recent Meltdown and Spectre vulnerabilities, it's worth spending some time looking at root causes. Both of these vulnerabilities involved processors speculatively executing instructions past some kind of access check and allowing the attacker to observe the results via a side channel. The features that led to these vulnerabilities, along with several others, were added to let C programmers continue to believe they were programming in a low-level language, when this hasn't been the case for decades.

There is a common myth in software development that parallel programming is hard. This would come as a surprise to Alan Kay, who was able to teach an actor-model language to young children, with which they wrote working programs with more than 200 threads. It comes as a surprise to Erlang programmers, who commonly write programs with thousands of parallel components. It's more accurate to say that parallel programming in a language with a C-like abstract machine is difficult, and given the prevalence of parallel hardware, from multicore CPUs to many-core GPUs, that's just another way of saying that C doesn't map to modern hardware very well.

Languages


 


January/February 2018

Research for Practice:
Prediction-Serving Systems


  Dan Crankshaw and Joseph Gonzalez

Expert-curated Guides to the Best of CS Research

This installment of Research for Practice features a curated selection from Dan Crankshaw and Joey Gonzalez, who provide an overview of machine learning serving systems. What happens when we wish to actually deploy a machine learning model to production, and how do we serve predictions with high accuracy and high computational efficiency? Dan and Joey's selection provides a thoughtful selection of cutting-edge techniques spanning database-level integration, video processing, and prediction middleware. Given the explosion of interest in machine learning and its increasing impact on seemingly every application vertical, it's possible that systems such as these will become as commonplace as relational databases are today.

Artificial Intelligence, Research for Practice


Kode Vicious:
Watchdogs vs. Snowflakes


Taking wild-ass guesses with your distributed job-control system

That a system can randomly jam doesn't just indicate a serious bug in the system; it is also a major source of risk. You don't say what your distributed job-control system controls, but let's just say I hope it's not something with significant, real-world side effects, like a power station, jet aircraft, or financial trading system. The risk, of course, is that the system will jam, not when it's convenient for someone to add a dummy job to clear the jam, but during some operation that could cause data loss or return incorrect results. I rather suspect that having a system like this jam while coordinating, for example, the balancing of electrical power across a power grid would have spectacular and perhaps fatal results.

Distributed Computing, Kode Vicious


Thou Shalt Not Depend on Me

Tobias Lauinger, Abdelberi Chaabane, and Christo B. Wilson

A look at JavaScript libraries in the wild

Many websites use third-party components such as JavaScript libraries, which bundle useful functionality so that developers can avoid reinventing the wheel. But what happens when libraries have security issues? Chances are that websites using such libraries inherit these issues and become vulnerable to attacks.

Given the risk of using a library with known vulnerabilities, it is important to know how often this happens in practice and, more importantly, who is to blame for the inclusion of vulnerable libraries?

We set out to answer these questions and found that with 37 percent of websites using at least one known vulnerable library, and libraries often being included in quite unexpected ways, there clearly is room for improvement in library handling on the web. To that end, this article makes a few recommendations about what can be done to improve the situation.

Programming Languages


The Soft Side of Software
How to Come Up with Great Ideas


  Kate Matsudaira

Think like an entrepreneur.

I started my career working in big companies but always dreamed of starting my own. I would read online forums and articles about successful entrepreneurs. I was enamored with the idea of doing a startup. The problem was I didn't have any ideas. Fast forward 10 years and I have so many ideas that choosing the right one is the challenge. I am constantly coming up with ideas and opportunities that could turn into a product, or a whole company. There is no shortage of things that I could do. The key is you have to learn to think like an entrepreneur.

Business and Management, The Soft Side of Software


Designing Cluster Schedulers for Internet-Scale Services

Diptanu Gon Choudhury and Timothy Perrett

Embracing failures for improving availability

Despite their apparent ubiquity, operating and implementing scheduling software is an exceedingly tricky task with many nuanced edge cases. This article highlights some of these cases based on the real-world experience of the authors designing, building, and operating a variety of schedulers for large Internet companies.

Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.

Web Services


Everything Sysadmin:
Manual Work is a Bug


  Thomas A. Limoncelli

A.B.A: Always be automating

As you work, you have a choice. Will each manual task create artifacts that allow you to accelerate future work, or do you squander these opportunities and accept the status quo? By constantly documenting and creating code-snippet artifacts, you accelerate future work. That one-shot task that could never happen again, does happen again, and next time it moves faster. Even tasks that aren't worth automating can be improved by documenting them, as documentation is automation.

Everything Sysadmin


Canary Analysis Service

  Štěpán Davidovič with Betsy Beyer

Automated canarying quickens development, improves production safety, and helps prevent outages.

Google has deployed a shared centralized service called CAS (Canary Analysis Service) that offers automatic (and often autoconfigured) analysis of key metrics during a production change. CAS is used to analyze new versions of binaries, configuration changes, data-set changes, and other production changes. CAS evaluates hundreds of thousands of production changes every day at Google.

Web Services


 




Older Issues