January/February 2018 issue of acmqueue

The January/February issue of acmqueue is out now


  Download PDF version of this article PDF

ITEM not available


Originally published in Queue vol. 12, no. 7
see this item in the ACM Digital Library



Yonatan Sompolinsky, Aviv Zohar - Bitcoin's Underlying Incentives
The unseen economic forces that govern the Bitcoin protocol

Antony Alappatt - Network Applications Are Interactive
The network era requires new models, with interactions instead of algorithms.

Jacob Loveless - Cache Me If You Can
Building a decentralized web-delivery model

Theo Schlossnagle - Time, but Faster
A computing adventure about time through the looking glass


(newest first)

Greg Weiss | Wed, 21 Oct 2015 15:23:14 UTC

Having designed realtime distributed conference calling systems in the early 2000s based on SIP softswitches running in datacenters with engineering attempting to replicate telco-grade reliability, which at the application level attempted but never honestly achieved 4 nines on commodity hardware running on N+N machines and which maintain state in replicated SQL databases across two datatcenters I have experienced a number of these failure scenarios, from the should-never-go-down-dual-loop-fiber-WAN link going down, datacenter-wide power outages due to generator failure and UPS failure combined with datacenter power circuit engineering mistakes, slow system (sometimes) and slow network connectivity (hardware issues where a switch didn't go down but just hung, absorbing traffic without response) triggering failure detection and failover processes leading to dual-primary split brains, DNS hangs/failures affecting one or both sites, NTP failures causing bad dates at one site resulting in unintended consequences, NTP failures and slow clock skew arising across sites, etc. The rules and practices around designing a network, and application, to avoid and deal with these things is a little like the fire code; it accumulates over time based on the painful experience of disaster.

I will make a few observations. 1) If one doesn't personally receive alerts whenever failover triggering happens, one doesn't necessarily intuitively recognize the frequency of these partition events. Resiliency is often a non-event otherwise. If you silently and successfully failover every two weeks and don't know it, your notion of how reliable your system is may cloak real problems. And if one doesn't appropriately log and analyze-over-time failover-trigger logs one doesn't have good perspective. To learn from mistakes one might be wise adopt the telco industry practice of every failure resulting in a Reason For Outage document, the corpus of which can be periodically reviewed. 2) If one doesn't have some post-event integrity checking one doesn't necessarily recognize the consequences of a failover. The more resilient one makes their system, ironically the less obvious it can be that there is some remaining integrity issue once you have mitigated how your application deals with the most critical scenarios. 3) If you haven't taken the time+expense to inject hard failures and hanging behavior in each component of your system you are almost certainly not going to experience reliability , but in practice, often one doesn't have the luxury of testing failures of your network-wide infrastructure like colo-wide or site-wide switches, and for certain devices it is hard to simulate the "hanging" failure case which is different than the "down" failure case. 4) Slow-systems-triggering-failover events due to the database or garbage collection being busy can often be papered over, but not fully resolved, by overprovisioning or tuning and will tend to crop up later in other usage scenarios. 5) Ultimately it's worth trying to rethink one's design to avoid as much network communication as possible.

Leave this field empty

Post a Comment:

© 2018 ACM, Inc. All Rights Reserved.