Designing Cluster Schedulers for Internet-Scale Services:
Embracing failures for improving availability
Engineers looking to build scheduling systems should consider all failure modes of the underlying infrastructure they use and consider how operators of scheduling systems can configure remediation strategies, while aiding in keeping tenant systems as stable as possible during periods of troubleshooting by the owners of the tenant systems.
Manual Work is a Bug:
A.B.A: always be automating
Every IT team should have a culture of constant improvement - or movement along the path toward the goal of automating whatever the team feels confident in automating, in ways that are easy to change as conditions change. As the needle moves to the right, the team learns from each other’s experiences, and the system becomes easier to create and safer to operate. A good team has a structure in place that makes the process frictionless and collaborative
Canary Analysis Service:
Automated canarying quickens development, improves production safety, and helps prevent outages.
It is unreasonable to expect engineers working on product development or reliability to have statistical knowledge; removing this hurdle led to widespread CAS adoption. CAS has proven useful even for basic cases that don’t need configuration, and has significantly improved Google’s rollout reliability. Impact analysis shows that CAS has likely prevented hundreds of postmortem-worthy outages, and the rate of postmortems among groups that do not use CAS is noticeably higher.
Thou Shalt Not Depend on Me:
A look at JavaScript libraries in the wild
Most websites use JavaScript libraries, and many of them are known to be vulnerable. Understanding the scope of the problem, and the many unexpected ways that libraries are included, are only the first steps toward improving the situation. The goal here is that the information included in this article will help inform better tooling, development practices, and educational efforts for the community.
How to Come up with Great Ideas:
Think like an entrepreneur.
No matter what your profession, learning to think more innovatively and spark new ideas can help you. I have included some points and inspiration that have helped me, but the real key is changing your behavior and taking action.
Watchdogs vs. Snowflakes:
Taking wild-ass guesses
That a system can randomly jam doesn’t just indicate a serious bug in the system; it is also a major source of risk. You don’t say what your distributed job-control system controls, but let’s just say I hope it’s not something with significant, real-world side effects, like a power station, jet aircraft, or financial trading system. The risk, of course, is that the system will jam, not when it’s convenient for someone to add a dummy job to clear the jam, but during some operation that could cause data loss or return incorrect results. I rather suspect that having a system like this jam while coordinating, for example, the balancing of electrical power across a power grid would have spectacular and perhaps fatal results.
Prediction-Serving Systems:
What happens when we wish to actually deploy a machine learning model to production?
This installment of Research for Practice features a curated selection from Dan Crankshaw and Joey Gonzalez, who provide an overview of machine learning serving systems. What happens when we wish to actually deploy a machine learning model to production, and how do we serve predictions with high accuracy and high computational efficiency? Dan and Joey’s selection provides a thoughtful selection of cutting-edge techniques spanning database-level integration, video processing, and prediction middleware. Given the explosion of interest in machine learning and its increasing impact on seemingly every application vertical, it’s possible that systems such as these will become as commonplace as relational databases are today