The Morning Paper

Sort By:

Back under a SQL Umbrella:
Unifying serving and analytical data; using a database for distributed machine learning

Procella is the latest in a long line of data processing systems at Google. What’s unique about it is that it’s a single store handling reporting, embedded statistics, time series, and ad-hoc analysis workloads under one roof. It’s SQL on top, cloud-native underneath, and it’s serving billions of queries per day over tens of petabytes of data. There’s one big data use case that Procella isn’t handling today though, and that’s machine learning. But in ’Declarative recursive computation on an RDBMS... or, why you should use a database for distributed machine learning,’ Jankov et al.

November 6, 2019

Topic: Databases


GAN Dissection and Datacenter RPCs:
Visualizing and understanding generative adversarial networks; datacenter RPCs can be general and fast.

Image generation using GANs (generative adversarial networks) has made astonishing progress over the past few years. While staring in wonder at some of the incredible images, it’s natural to ask how such feats are possible. "GAN Dissection: Visualizing and Understanding Generative Adversarial Networks" gives us a look under the hood to see what kinds of things are being learned by GAN units, and how manipulating those units can affect the generated images. February saw the 16th edition of the Usenix Symposium on Networked Systems Design and Implementation. Kalia et al. blew me away with their work on fast RPCs (remote procedure calls) in the datacenter.

May 2, 2019

Topic: Networks


How Do Committees Invent? and Ironies of Automation:
The formulation of Conway’s law and the counterintuitive consequences of increasing levels of automation

The Lindy effect tells us that if a paper has been highly relevant for a long time, it’s likely to continue being so for a long time to come as well. My first choice is "How Do Committees Invent?" Author Melvin E. Conway provides a lot of great material that led up to the formulation of the law that bears his name. My second choice is Lisanne Bainbridge’s "Ironies of Automation." It’s a classic treatise on the counterintuitive consequences of increasing levels of automation.

April 15, 2020

Topic: Development


Putting Machine Learning into Production Systems:
Data validation and software engineering for machine learning

Breck et al. share details of the pipelines used at Google to validate petabytes of production data every day. With so many moving parts it’s important to be able to detect and investigate changes in data distributions before they can impact model performance. "Software Engineering for Machine Learning: A Case Study" shares lessons learned at Microsoft as machine learning started to pervade more and more of the company’s systems, moving from specialized machine-learning products to simply being an integral part of many products and services.

October 7, 2019

Topic: AI


SageDB and NetAccel:
Learned models within the database system; network-accelerated query processing

The CIDR (Conference on Innovative Data Systems Research) runs once every two years, and luckily for us 2019 is one of those years. I’ve selected two papers from this year’s conference that highlight bold and exciting directions for data systems.

February 28, 2019

Topic: Development


The Way We Think About Data:
Human inspection of black-box ML models; reclaiming ownership of data

The two papers I’ve chosen for this issue of acmqueue both challenge the way we think about and use data, though in very different ways. In "Stop Explaining Black-box Machine-learning Models for High-stakes Decisions and Use Interpretable Models Instead," Cynthia Rudin makes the case for models that can be inspected and interpreted by human experts. The second paper, "Local-first Software: You Own Your Data, in Spite of the Cloud," describes how to retain sovereignty over your data.

February 18, 2020

Topic: Data


Time Protection in Operating Systems and Speaker Legitimacy Detection:
Operating system-based protection from timing-based side-channel attacks; implications of voice-imitation software

Timing-based side-channel attacks are a particularly tricky class of attacks to deal with because the very thing you’re often striving for can give you away. There are always more creative new instances of attacks to be found, so you need a principled way of thinking about defenses that address the class, not just a particular instantiation. That’s what Ge et al. give us in "Time Protection, the Missing OS Abstraction." Just as operating systems prevent spatial inference through memory protection, so future operating systems will need to prevent temporal inference through time protection. It’s going to be a long road to get there.

July 9, 2019

Topic: Security