Check out Pat's
Scattered Thoughts on Distributed Systems

pathelland.substack.com

Escaping the Singularity

November 18, 2019
Volume 17, issue 5

Download PDF version of this article PDF

Escaping the Singularity
It's Not Your Grandmother's Database Anymore!

Space Time Discontinuum

Combining data from many sources may cause painful delays.

Pat Helland

Increasingly, calculations are based on timed events originating at many sources. By comparing, contrasting, joining, and noodling over these inputs, you can derive some interesting results. If the inputs to these calculations come from disparate computers (or sensors), you can't always be sure of how quickly the information will propagate. Hence, you can't be sure when you'll get an answer to the requested calculation.

If you can't promise when you'll get the answer, what're ya gonna do? You can wait to get the perfect answer, or you can give an imperfect answer more promptly by basing it on partial knowledge.

How can you meet your SLAs (service-level agreements)? Sometimes, it's not an easy and seamless continuum of options but more of a discontinuum.

Events and Time

In many systems, events come with a timestamp. These may come from temperature sensors, motion detectors, factory floors, your cable box, your security system, automated tollbooths on the freeway, and much more. One source of information that is increasingly important is the monitoring of events in data centers. These events may be used by an automated management system or by humans in complex error spelunking. It is common to try to see which events happened close in time to another particular event. Patterns in time proximity are essential to controlling these complex environments.

Events and Space

Now, all this is cool except when the information you want is "over there." In a distributed system, if the stuff is not "here," it's "over there." When something is far away and over there, it may take a loooonnng time to get information in and out. It can be kinda' like when the only road up the canyon gets washed out in a flood. I refer to remote nodes as being "over yonder" whether there're in the same rack in the data center or thousands of miles away.

Space, Distance, and Traffic Jams

If it's over there with an open queuing network, it will usually get here in a timely fashion. Almost all modern networks are open queuing networks. In such a network, there's no practical limit to the amount of stuff that may try to share the network. Highway 101 through San Francisco is an open queuing network. It works great until it doesn't.

Right now, I'm writing this on an airplane. I'm glad the modern fly-by-wire systems use closed queuing networks. This type of network carefully manages the messages allowed in. By ensuring the network never ingests more work, it can ensure a bounded time for the network to digest the work it already has. Assuming no hardware faults (or a bounded number of hardware faults), the closed queuing network can do what is requested within a specified period of time. I don't want a traffic jam on the fly-by-wire system controlling the flaps on the plane. I'm grateful for a closed queuing network.

In data centers, on the other hand, you don't see closed queuing networks. Instead, you see a freeway that usually has a lot of lanes. It mostly works OK.

Reviewing Our Queuing

Delays in propagating events are not just caused by the network. It's common for events to get pushed into some queuing system as they are published. The published events are then consumed by one or more subscribers. These subscribers typically enqueue the event in yet another queue (ad nauseam) as the event rattles its way through the data center or Internet trying to seek its destination like a fly buzzing around a light on the back porch.

Misconstruing Our Queuing

Each of these queuing systems presents a whole new opportunity for delay. It's almost as if the event gets frozen in time and, hopefully, gets defrosted within the desired time window. Each stage of these queuing systems is very likely to propagate the event quickly. Each time you add a "very likely to propagate the event quickly," you also add a "sometimes it will be slow."

Not only can queuing systems be slow, they rarely have end-to-end guarantees. Unless the final point of consumption is coordinated with the originating sender, the event will sometimes go kablooey and never arrive.

Service Levels

Service levels are a key component of engineering a system for use by a customer. These may deal with system performance, system availability, or quality of the returned results:^4,5

• SLA (service-level agreement). This is a contract with the customer. It may list financial compensation if you don't meet the numbers. It may be an understanding with your boss that you'll miss part of your bonus if things go awry.

• SLO (service-level objective). This is your internal target for the metric in question. You always want your SLO to be better than your SLA (or you're not trying to please your customer).

• SLI (service-level indicator). This is what you actually measure against the metric. Hopefully, your SLI is better than your SLO, which is better than your SLA.

To ensure that many different sources of events are combined, you should leave some room in your SLA to allow for slowness in getting the source data from all those sources. The more sources, the more likely you will blow your SLA as one or more of them blow their SLAs.

Managing Tail Latencies in Idempotent Operations

One of my favorite papers is "The Tail at Scale," by Jeff Dean and Luis André Barroso.¹ In this paper, the authors outline how a service requiring a whole bunch of inputs can use timeout-based retries to dramatically improve the probability of getting all the inputs needed to meet the desired 99.9 percent SLA for response. In this case, sufficient replicas exist to get the answer without loss of quality, simply by retrying and consuming more resources. Key to this is that it is OK to retry the laggard requests because they are idempotent. It does not cause harm to do them two or more times.

Knowing You Can't Know

The more complex the set of inputs, the more likely you won't see everything in a timely fashion. The more complex the store-and-forward queuing, the more likely stuff will arrive too late or not at all. The more distant the sources of your inputs, the more challenges you may have.

As we've seen, sometimes it can be effective to retry the request for input. In particular, in some systems, retrying can ensure that all the inputs are available quickly.

In other systems, the inputs are not simply fetched but are rattling their way through queues similar to Highway 101 through San Francisco. In these environments, the processing probably has to simply cut off with what it has and do the best it can. This means you can't guarantee the stuff is ready when you want it.

So, if you know you can only probably know, what's the plan?

Approximating Queries

There's some fun new work describing analytics with approximate answers. By expressing sampling operators, some systems can provide really good answers based on a small subset of all the inputs one would normally examine.^2,3 In the cited systems, there's more focus on sampling for performance when everything is working and timely. Still, it's quite similar to what you would do to build systems that return answers based on what's available in a timely fashion.

Returning Partial Answers

Many systems work to give some answer in a timely fashion even if they're wounded. Many sophisticated websites will dribble out partial answers to the browser as they become available: The text for a product description may arrive before the product image; the product image may arrive before the user reviews. This decoupling yields a faster overall result and, in general, a more satisfied user.

When dealing with answers relying on data from a distance, it's important to consider how to decouple results and, where possible, return a degraded answer quickly when and if that best meets the needs of the business. What does it take to keep on going?

As the Black Knight from Monty Python says, "'Tis but a scratch" (https://www.youtube.com/watch?v=ZmInkxbvlCs; thank you to Peter Vosshall for raising the Black Knight in discussion when we were both younger many years ago).

The Disconcerting Discontinuum

Back when you had only one database for an application to worry about, you didn't have to think about partial results. You also didn't have to think about data arriving after some other data. It was all simply there.

Now, you can do so much more with big distributed systems, but you have to be more sophisticated in the tradeoff between timely answers and complete answers. The best systems will adapt and interpret their problems as, "‘Tis but a scratch!"

References

1. Dean, J., Barroso, L. A. 2013. The tail at scale. Communications of the ACM 56(2), 74-80; https://dl.acm.org/citation.cfm?id=2408794.

2. Hall, A., Tudorica, A., Buruiana, F., Hofmann, R., Ganceanu, S., Hofmann, T. 2016. Trading off accuracy for speed in PowerDrill. International Conference on Data Engineering; https://ai.google/research/pubs/pub45682/.

3. Kandula, S., Lee, K., Chaudhuri, S., Friedman, M. 2019. Experiences with approximating queries in Microsoft's production big-data clusters. Proceedings of the VLDB Endowment 12(2), 2131-2142; https://www.microsoft.com/en-us/research/publication/experiences-with-approximating-queries-in-microsofts-production-big-data-clusters/.

4. Moguls, J. C., Isaacs, R., Welch, B. 2017. Thinking about availability in large service infrastructures. Proceedings of the 16^th Workshop on Hot Topics in Operating Systems (HotOS'17), 12-17; https://dl.acm.org/citation.cfm?id=3102980.3102983.

5. Moguls, J. C., Wilkes, J. 2019. Nines are not enough: meaningful metrics for clouds. Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS'19), 136-141; https://dl.acm.org/citation.cfm?id=3321432.

The Calculus of Service Availability
You're only as available as the sum of your dependencies.
Ben Treynor, Mike Dahlin, Vivek Rau, Betsy Beyer
https://queue.acm.org/detail.cfm?id=3096459

Toward Higher Precision
An introduction to PTP and its significance to NTP practitioners
Rick Ratzel and Rodney Greenstreet
https://queue.acm.org/detail.cfm?id=2354406

A Lesson in Resource Management
Waste not memory, want not memory—unless it doesn't matter
Kode Vicious
https://queue.acm.org/detail.cfm?id=2523428

Pat Helland has been implementing transaction systems, databases, application platforms, distributed systems, fault-tolerant systems, and messaging systems since 1978. For recreation, he occasionally writes technical papers. He currently works at Salesforce.

Originally published in Queue vol. 17, no. 5—
Comment on this article in the ACM Digital Library