May/June 2019 issue of acmqueue The May/June 2019 issue of acmqueue is out now

Subscribers and ACM Professional members login here


  Download PDF version of this article PDF

Error 526 Ray ID: 4f7b3da53bf6c5f8 • 2019-07-17 09:38:03 UTC

Invalid SSL certificate








What happened?

The origin web server does not have a valid SSL certificate.

What can I do?

If you're a visitor of this website:

Please try again in a few minutes.

If you're the owner of this website:

The SSL certificate presented by the server did not pass validation. This could indicate an expired SSL certificate or a certificate that does not include the requested domain name. Please contact your hosting provider to ensure that an up-to-date and valid SSL certificate issued by a Certificate Authority is configured for this domain name on the origin server. Additional troubleshooting information here.


Originally published in Queue vol. 12, no. 7
see this item in the ACM Digital Library



Aleksander Kuzmanovic - Net Neutrality: Unexpected Solution to Blockchain Scaling
Cloud-delivery networks could dramatically improve blockchains' scalability, but clouds must be provably neutral first.

Jim Waldo - A Hitchhiker's Guide to the Blockchain Universe
Blockchain remains a mystery, despite its growing acceptance.

Yonatan Sompolinsky, Aviv Zohar - Bitcoin's Underlying Incentives
The unseen economic forces that govern the Bitcoin protocol

Antony Alappatt - Network Applications Are Interactive
The network era requires new models, with interactions instead of algorithms.


(newest first)

Greg Weiss | Wed, 21 Oct 2015 15:23:14 UTC

Having designed realtime distributed conference calling systems in the early 2000s based on SIP softswitches running in datacenters with engineering attempting to replicate telco-grade reliability, which at the application level attempted but never honestly achieved 4 nines on commodity hardware running on N+N machines and which maintain state in replicated SQL databases across two datatcenters I have experienced a number of these failure scenarios, from the should-never-go-down-dual-loop-fiber-WAN link going down, datacenter-wide power outages due to generator failure and UPS failure combined with datacenter power circuit engineering mistakes, slow system (sometimes) and slow network connectivity (hardware issues where a switch didn't go down but just hung, absorbing traffic without response) triggering failure detection and failover processes leading to dual-primary split brains, DNS hangs/failures affecting one or both sites, NTP failures causing bad dates at one site resulting in unintended consequences, NTP failures and slow clock skew arising across sites, etc. The rules and practices around designing a network, and application, to avoid and deal with these things is a little like the fire code; it accumulates over time based on the painful experience of disaster.

I will make a few observations. 1) If one doesn't personally receive alerts whenever failover triggering happens, one doesn't necessarily intuitively recognize the frequency of these partition events. Resiliency is often a non-event otherwise. If you silently and successfully failover every two weeks and don't know it, your notion of how reliable your system is may cloak real problems. And if one doesn't appropriately log and analyze-over-time failover-trigger logs one doesn't have good perspective. To learn from mistakes one might be wise adopt the telco industry practice of every failure resulting in a Reason For Outage document, the corpus of which can be periodically reviewed. 2) If one doesn't have some post-event integrity checking one doesn't necessarily recognize the consequences of a failover. The more resilient one makes their system, ironically the less obvious it can be that there is some remaining integrity issue once you have mitigated how your application deals with the most critical scenarios. 3) If you haven't taken the time+expense to inject hard failures and hanging behavior in each component of your system you are almost certainly not going to experience reliability , but in practice, often one doesn't have the luxury of testing failures of your network-wide infrastructure like colo-wide or site-wide switches, and for certain devices it is hard to simulate the "hanging" failure case which is different than the "down" failure case. 4) Slow-systems-triggering-failover events due to the database or garbage collection being busy can often be papered over, but not fully resolved, by overprovisioning or tuning and will tend to crop up later in other usage scenarios. 5) Ultimately it's worth trying to rethink one's design to avoid as much network communication as possible.

Leave this field empty

Post a Comment:

© 2019 ACM, Inc. All Rights Reserved.