Resilience Engineering: Learning to Embrace Failure

A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli

GAMEDAY EXERCISES CASE STUDY

It’s very nearly the holiday shopping season and something is very wrong at a data center handling transactions for one of the largest online retail operations in the country. Some systems have failed, and no one knows why. Stress levels are off the charts while teams of engineers work around the clock for three days trying to recover.

http://queue.acm.org/detail.cfm?id=2371297

 

Related:

Scale Failure

Automating Software Failure Reporting|

Improving Performance on the Internet

 

Disks from the Perspective of a File System

Disks lie. And the controllers that run them are partners in crime.

MARSHALL KIRK MCKUSICK

Most applications do not deal with disks directly, instead storing their data in files in a file system, which protects us from those scoundrel disks. After all, a key task of the file system is to ensure that the file system can always be recovered to a consistent state after an unplanned system crash (for example, a power failure). While a good file system will be able to beat the disks into submission, the required effort can be great and the reduced performance annoying. This article examines the shortcuts that disks take and the hoops that file systems must jump through to get the desired reliability.

http://queue.acm.org/detail.cfm?id=2367378

 

Related:

Building Systems to Be Shared, Securely

The Five-Minute Rule 20 Years Later: and How Flash Memory Changes the RulesĀ 

GFS: Evolution on Fast-forward