Vol. 16 No. 1 – January-February 2018

Designing Cluster Schedulers for Internet-Scale Services

Embracing failures for improving availability

Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.

by Diptanu Gon Choudhury, Timothy Perrett

Manual Work is a Bug

A.B.A: always be automating

Every IT team should have a culture of constant improvement - or movement along the path toward the goal of automating whatever the team feels confident in automating, in ways that are easy to change as conditions change. As the needle moves to the right, the team learns from each other's experiences, and the system becomes easier to create and safer to operate. A good team has a structure in place that makes the process frictionless and collaborative

by Thomas A. Limoncelli

Canary Analysis Service

Automated canarying quickens development, improves production safety, and helps prevent outages.

It is unreasonable to expect engineers working on product development or reliability to have statistical knowledge; removing this hurdle led to widespread CAS adoption. CAS has proven useful even for basic cases that don't need configuration, and has significantly improved Google's rollout reliability. Impact analysis shows that CAS has likely prevented hundreds of postmortem-worthy outages, and the rate of postmortems among groups that do not use CAS is noticeably higher.

by Štěpán Davidovič, Betsy Beyer