July/August 2018 issue of acmqueue The July/August issue of acmqueue is out now

Failure and Recovery

  Download PDF version of this article PDF

Error 526 Ray ID: 45fa35eedf7a21ec • 2018-09-25 02:55:13 UTC

Invalid SSL certificate








What happened?

The origin web server does not have a valid SSL certificate.

What can I do?

If you're a visitor of this website:

Please try again in a few minutes.

If you're the owner of this website:

The SSL certificate presented by the server did not pass validation. This could indicate an expired SSL certificate or a certificate that does not include the requested domain name. Please contact your hosting provider to ensure that an up-to-date and valid SSL certificate issued by a Certificate Authority is configured for this domain name on the origin server. Additional troubleshooting information here.


Originally published in Queue vol. 15, no. 5
see this item in the ACM Digital Library



Pat Helland, Simon Weaver, Ed Harris - Too Big NOT to Fail
Embrace failure so it doesn't embrace you.

Steve Chessin - Injecting Errors for Fun and Profit
Error-detection and correction features are only as good as our ability to test them.

Michael W. Shapiro - Self-Healing in Modern Operating Systems
A few early steps show there's a long (and bumpy) road ahead.

Paul P. Maglio, Eser Kandogan - Error Messages
Computer users spend a lot of time chasing down errors - following the trail of clues that starts with an error message and that sometimes leads to a solution and sometimes to frustration. Problems with error messages are particularly acute for system administrators (sysadmins) - those who configure, install, manage, and maintain the computational infrastructure of the modern world - as they spend a lot of effort to keep computers running amid errors and failures.


(newest first)

Tyler Neely | Tue, 13 Mar 2018 16:45:17 UTC

One thing I've been saying a lot recently, and am only getting more fanatical about: using raw sockets today is like using strcpy in the days after "smashing the stack for fun and profit". The success of systems like Jepsen, Namazu, etc... is horrifying because of how low their bug:cpu cycle ratios are. We are building things on our laptops in ways that are fundamentally untestable on our laptops.

Simulation lets you find race conditions in a few seconds on your laptop that would take jepsen days/weeks/forever to stumble on. We can run our algorithms in accelerated time by manipulating clocks and without needing to bring resource intensive IO into the mix.

The catch is that you have to design your implementation from the beginning (or face an expensive redesign) to support "steppability".

If you can design your system such that it implements this interface with 2 methods, it is quite easy to test by plugging a few instances into a fake network simulator:

receive(msg, from_peer, time) -> [(to_peer, msg)] tick(time) -> [(to_peer, msg)]

and define a set of input messages for particular participants in a system. Then use quickcheck et al to generate weather events that slow or block communication between sets of peers at specific omniscient time intervals. We can be exhaustive if our state machines are simple, but even quickcheck will give us great results. For any outbound message, deterministically give it a pseudorandom arrival time / never. Stick these in a priority queue and iterate through, along with periodic calls to "tick" for algorithms with periodic maintenance tasks, like raft's leader election etc...

Provide a set of invariants to be verified after each call to receive/tick. Never 2 leaders capable of influencing a majority of a cluster in any instant / etc...

This is extremely powerful, and then we can just plug the same implementation into a high performance IO layer for production. It is a write-once test, and you just provide invariants for new implementations. This lets us test real implementations with techniques that approach or match the confidence of what we can get with model checking tools like TLA+, if the number of possible states is not cost-prohibitive to evaluate. (same goes for TLA+ though...)

Leave this field empty

Post a Comment:

© 2018 ACM, Inc. All Rights Reserved.