March 29, 2022
Volume 20, issue 1

Download PDF version of this article PDF

Distributed Latency Profiling through Critical Path Tracing

CPT can provide actionable and precise latency analysis.

Brian Eaton, Jeff Stewart, Jon Tedesco, and N. Cihan Tas

For complex distributed systems that include services that constantly evolve in functionality and data, keeping overall latency to a minimum is a challenging task. Critical path tracing (CPT) is a new applied mechanism for gathering critical path latency profiles in large-scale distributed applications. It is currently enabled in hundreds of Google services, providing valuable day-to-day data for latency analysis.

Fast turnaround time is an essential feature for an online service. For services that are constantly supporting increasing data and functionality, keeping overall latency to a minimum is an ongoing issue. Determining the root causes of high latency in a distributed system is challenging. The goal is to answer a key optimization question: Given a distributed system and workload, which subcomponents can be optimized to reduce latency?

Low latency is an important feature for many Google applications such as Search,⁴ and latency-analysis tools play a critical role in sustaining low latency at scale. The systems change constantly because of code and deployment changes, as well as shifting traffic patterns. Parallel execution is essential, both across service boundaries and within individual services. Different slices of traffic have different latency characteristics. Latency-analysis tools struggle to make sense of the system in practice.

CPT provides detailed and actionable information about which subcomponents of a large and complex distributed system are contributing to overall latency. This article presents results and experiences as observed by using CPT in a particular application, Google Search.

The Need for Critical Path Tracing

The critical path describes the ordered list of steps that directly contribute to the slowest path of request processing through a distributed system, such that optimizing these steps reduces overall latency. Individual services have many subcomponents, and CPT relies on software frameworks¹⁵ to identify which subcomponents are on the critical path. When one service calls another, critical path information is propagated from the callee back to the caller via remote procedure call (RPC) metadata. The caller then merges critical paths from its dependencies into a unified critical path for the entire request.

The unified critical path is logged with other request metadata. Log analysis is used to select requests of interest, and then critical paths from those requests are aggregated to create critical path profiles. The tracing process is efficient, allowing large numbers of requests to be sampled. The resulting profiles give detailed and precise information about the root causes of latency in distributed systems.

The overall goal is latency profiling for distributed systems. How can developers get actionable information about the root causes of latency?

An example system

Consider the distributed system in figure 1, which consists of three services, each of which has two subcomponents. The purpose of the system is to receive a request from the user, perform some processing, and return a response. Arrows show the direction of requests where responses are sent back in the opposite direction.

Distributed Latency Profiling through Critical Path Tracing

The system divides work across many subcomponents. Requests first arrive at Service A, which hands request processing off to subcomponent A1. A1 in turn relies on subcomponents B1 and A2, which have their own dependencies. Some subcomponents can be invoked multiple times during request processing (e.g., A2, B2, and C2 all call into C1).

Even though the system architecture is apparent from the figure, the actual latency characteristics of the system are hard to predict. For example, is A1 able to invoke A2 and B1 in parallel, or is there a data dependency such that the call to B1 must complete before the call to A2 can proceed? How much internal processing does A2 perform before calling into B2? What about after receiving the response from B2? Are any of these requests repeated? What is the latency distribution of each processing step? And how do the answers to all of these questions change depending on the incoming request?

Without good answers to these questions, efforts to improve overall system latency will be poorly targeted and might go to waste. For example, in figure 1, to reduce the overall latency of A1 and its downstream subcomponents, you must know which of these subcomponents actually impact the end-to-end system latency. Before deciding to optimize, you need to know whether 𝐴1 → 𝐴2 actually matters.

Analysis with RPC telemetry

RPC telemetry is commonly used for latency analysis. Services export coarse-grain information about how many times an RPC is made and the latency characteristics of those RPCs. Monitoring services collect this information and create dashboards showing system performance. Coarse-grain slicing (e.g., time range, source, and destination information) is commonly used to enhance RPC telemetry-based analysis.^1,9,14

RPC telemetry works well when there are a few RPC services that are always important for latency. Monitoring for those services can quickly identify which are causing problems. Service owners can be identified, and they can work to improve performance.

RPC telemetry struggles with parallelism, however. Referring to the figure, assume that A2 does CPU-bound work while waiting for responses from B2, C1, and C2. Improving latency for B2, C1, and C2 will not improve overall performance, because A2 is not actually blocked while waiting for their responses.

Repeated RPCs can also make monitoring confusing. Perhaps A2 is making hundreds of requests to C1. RPC telemetry can show the average latency as very fast, but a single slow RPC out of hundreds might be the root cause of slowness across the entire system.⁷ RPC telemetry also cannot tell whether those hundreds of requests are happening in parallel or are serialized. This is important information for understanding how those requests impact latency.

Heterogeneous workloads are also a challenge for RPC telemetry, because data points from different workloads are typically mixed together. Information about low-volume but important requests is lost. Problems with diverse workloads increase as the number of important slicing dimensions grows. If an A/B experiment shows a latency regression, latency should be analyzed for each arm of the experiment to identify the root cause. If a latency problem appears in the 99th percentile (i.e., within the top 1 percent of the requests that take the most time), only those very slow requests should be analyzed.

This type of workload slicing is difficult with RPC telemetry. Most RPC telemetry systems rely on aggregating data along a few dimensions for efficiency reasons. For example, each process might aggregate RPC data by caller, and then data collection further aggregates the data across processes. Collecting data across a large number of dimensions (e.g., separately for every A/B experiment) is usually not feasible. Dimensions of interest for latency analysis also need to be defined in advance and shared across all services. For example, imagine a caller wants to analyze latency based on the client device model. If downstream services are not aware of the device model information, they cannot provide RPC telemetry sliced by that dimension.

RPC telemetry struggles with identifying important subcomponents within services. For example, both A1 and A2 (subcomponents within Service A) make requests into B2. Telemetry for Service A and Service B will typically mix these requests together, even though they might have different latency characteristics. This can make it difficult to tell which requests need to be optimized.

The last major issue with RPC telemetry is the streetlight effect: RPC telemetry sheds light in one particular area of the system, so you spend time optimizing that part of the system. Meanwhile, latency problems not caused by RPCs are lost in the dark.

Analysis with CPU profilers

CPU profiling complements RPC telemetry well. Once RPC telemetry has identified a problematic service, CPU profiling can help figure out how to make that service faster. CPU samples with function call stacks are collected and aggregated, providing insights into expensive code paths. Profiles are typically collected for a single service at a time and might include the type of request in-flight and hardware profile counters.^2,10,13

CPU profiling excels at identifying specific expensive subcomponents within services. In cases where CPU time contributes to overall latency, CPU profiling can help identify where to optimize. Many of the same issues that impact RPC telemetry, however, also cause problems for CPU profiles. Lack of information about parallelism means that you can't tell whether CPU-bound work is happening in parallel with RPCs, in parallel with other CPU-bound work, or is actually blocking request progress. Heterogeneous workloads cause problems as well: Small but important slices of traffic are lost in the noise. Joining CPU profiles with information about parallelism and request metadata from distributed tracing can address these limitations, but deployment of that technology is not widespread.³

The streetlight effect also impacts CPU profiling: It makes it more likely you will focus on code that uses a lot of CPU, even if that code is not contributing to overall system latency.

Analysis with distributed tracing

The last common tool in the latency profiling toolkit is distributed tracing. This approach follows individual requests through a system, collecting timing points and additional data as those requests are processed. Traces are aggregated and analyzed to yield application insights.^5,16,17

Unlike RPC telemetry and CPU profiling, distributed tracing handles parallelism and heterogeneous workloads well. Information about all cross-service requests is collected, including timing points. Visualization shows exactly when work for each service began and ended, and which services were running in parallel versus serial.

Most distributed tracing includes tracing for RPC boundaries by default, but leaves out service subcomponent information. Developers can add tracing for subcomponents as needed.

Workload slicing to find traces for particularly important requests is also possible, although again, developers need to manually tag traces that are important. Distributed tracing even allows automated analysis to identify which services contribute to total latency.^5,18

The major obstacle to using distributed tracing for detailed latency analysis is cost. In Google Search, a single service might have a few or dozens of RPC dependencies but can easily have 100 times that number of important subcomponents. Instrumenting subcomponents by default increases the size of traces by orders of magnitude.

Sample rates for distributed tracing systems are another issue. System operators need to make tradeoffs between the amount of data they collect with distributed tracing and the overhead they impose by doing so. As tracing becomes more detailed, it becomes more expensive. Adaptive sampling techniques^16,17 are sometimes used to increase the sampling rate for interesting requests. If the task is to investigate the 99th percentile latency for a 1 percent experiment, finding a single relevant example requires 10,000 traced requests. As shown later in this article, 10,000 to 100,000 examples might be necessary to gain statistical confidence about where a regression occurred. Together, this means that it might be necessary to gather full traces for 10⁸ to 10⁹ requests, with each trace being 100 times larger than the default expected by the distributed tracing system.

Critical path tracing

CPT is designed to fill some of the gaps in these systems. The term critical path in project management¹² refers to the many interdependent steps that must be completed to finish a project. In this article, critical path describes the ordered list of steps that directly contribute to the slowest path of request processing through a distributed system. Aggregating these traces into critical path profiles can identify latency bottlenecks in the overall system.

This methodology is used extensively today for latency analysis in hundreds of Google services, including Search, and has several benefits.

Software frameworks are used to instrument service subcomponents automatically.¹⁵ Support has been added to most of the commonly used frameworks in Google Search, and developers using those frameworks get fine-grain CPT without additional coding effort. (The impact of automatic instrumentation on overall critical path accuracy is discussed later.)

Framework instrumentation code automatically identifies the critical path for request execution. Only the critical path is retained for analysis; other trace information is discarded. This reduces tracing cost by orders of magnitude.

Critical path traces are logged alongside other request and response metadata, such as A/B experiment information and which features appeared on the Search results page. This allows standard log-analysis techniques to use business criteria to find traces from requests of interest. Centralized logging provides additional cost savings. Other distributed tracing systems log traces separately for each machine involved in a request. Reconstructing a complete request requires joining traces from hundreds of machines. Logging all critical path subcomponents together avoids the overhead of joining the traces.

Together, these cost reductions allow detailed traces with high sampling rates.

Tracing a Request

This section describes the work needed to gather a fine-grain critical path trace for a single request.

Critical path definition

For latency profiling, the key input used is the critical path: the set of steps that blocked progress on producing the response. When multiple steps proceed in parallel, the slowest step is the only one on the critical path.

The execution of the request can be modeled as a directed graph of named nodes (e.g., subcomponent names). Each node in the graph does some of its own computation. Each edge in the graph is a dependency where a node must wait for the completion of one of its dependencies before computation can proceed. The critical path is the longest-duration path through the nodes, starting at the request entry point and finishing at the node that computes the response. The length of the critical path is the total latency for processing the request.

Consider a distributed system such as that in figure 1 and suppose the subcomponents execute in parallel according to scenarios as summarized in figure 2. Figure 2a shows an example critical path calculation where the requests to B1 and A2 happen sequentially: A1 does some computation, then blocks, waiting for B1 to complete. A1 then proceeds with additional computation before waiting for A2 to complete. This example has a critical path of {A1=5 ms, B1=20 ms, A1=8 ms, A2=2 ms}, with a total critical path of 35 ms.

Figure 2b shows how this changes when B1 and A2 execute in parallel. The new critical path becomes {A1=5 ms, B1=20 ms, A1=8 ms}, for a total critical path of 33 ms. A2 has been eliminated from the critical path. In this scenario, optimization efforts should focus on A1 and B1.

Parallel execution also applies when the parent node overlaps with child nodes, as in figure 2c. In this example, A1 sends an RPC to B1 but does not immediately block to wait for a response. Instead, A1 continues with other computations in parallel. For this case, whichever node finishes last to the critical path is assigned: {A1=3 ms, B1=14 ms, A1=10 ms}

Identifying service subcomponents

Infrastructure-first distributed tracing systems typically collect data at RPC boundaries by default and allow developers to add trace annotations for more detailed tracing.^16,17 With CPT, you can take advantage of the standardization provided by software frameworks to collect more detailed information by default.¹⁵

As an example, the Dagger framework encourages authors to write code as ProducerModules. (Dagger is a fully static, compile-time dependency injection framework for Java, Kotlin, and Android. It is an adaptation of an earlier version created by Square and now maintained by Google; https://dagger.dev.) Each ProducerModule declares what inputs it requires, and what outputs it produces. Dagger then coordinates the execution of ProducerModules to process a request. Referring back to figure 1, the following code snippet shows a Dagger implementation of subcomponent A1:

@ProducerModule public abstract class A1ProducerModule { @Produces @A1Output static A1Output runA1(@A2Output a2, @B1Output b1, @B2Output b2) { ... Code reads information from a2, b1, and b2, and calculates A1Output... } }

The collection of ProducerModules creates a graph of subcomponents that the framework executes to process the request. For this example, the framework knows which of A2, B1, and B2 was the last to block execution to produce A1's output. Since the framework is aware of the subcomponent dependencies, it can record the critical path.

For Google Search, subcomponent-level traces are collected from several software frameworks in multiple programming languages. Framework-level implementation is essential for scalability, since it allows relatively small teams of developers to provide detailed critical path traces for code written by thousands of other people. Since each framework instruments and reports the critical path automatically, most developers are not aware that critical path tracing is happening. A few services that are not implemented using frameworks have implemented service-specific CPT as well. Traces from these services are coarser grained. The system degrades gracefully when requests are made to services that do not provide any traces. The size of the blind spot in the critical path is reported correctly and includes which service is responsible for that part of the critical path.

See the "Adoption" section later for metrics showing the overall accuracy of critical path traces for Google Search.

Propagation and merging

Propagating child-node paths to parent nodes allows more detailed views of the request critical path.

In figure 2b, the initial critical path is {A1=5 ms, B1=20 ms, A1=8 ms}. The B1 child node takes up the majority of execution time; however, B1's internal execution details are lacking. To add this detail, B1 first computes its own critical path (e.g., {B1=4 ms, B2=12 ms}) and returns that critical path to A1.

A1 then merges the critical path from B1 into the overall critical path. Note that the total critical path returned from B1 is only 16 ms, but A1 observed 20 ms of execution time in B1. This is normal and could occur as a result of network delays, queuing delays, or simply execution paths that are not instrumented to return critical path information. To account for the discrepancy, A1 adds a 4-ms "unaccounted" node to the critical path. Similarly, each subcomponent adds a "self" node to the critical path for denoting the latency incurred internally and not caused by other child subcomponents. The notation 𝑆1/𝑆2 denotes the latency incurred by 𝑆2 as the child subcomponent of 𝑆1.

The merged critical path for the request is {A1=5 ms, B1/self=4 ms, B1/B2=12 ms, B1/unaccounted=4 ms, A1=8 ms}.

This process happens recursively. If B2 returns a more detailed path, the B2 entries in the overall path also become more detailed.

Propagation and merging occur both within a single service and across RPC service boundaries. The RPC transports support RPC request and response metadata channels for conveying horizontal concerns, such as the trace identifiers in Dapper.¹⁷ Services participating in the critical path protocol use standard fields in the response metadata to propagate their critical path to callers. Framework-level code then merges the critical path from RPCs into the per-service critical path.

Propagation requires error correction. Measurement errors are inevitable in large-scale distributed systems: Networks take time, programs stall, bugs exist or will be added over time, and sometimes you cannot be sure exactly what happened. The goal is to keep the measurement error small enough that you can still draw good conclusions from the latency profiles.

Overcounting is the most obvious problem to detect. It occurs when the sum of the critical path elements exceeds the observed wall time for the request by more than a specified margin of error. This typically indicates a bug in the framework that generated the critical path (e.g., because of reporting multiple operations as having happened in serial when they actually occurred in parallel). Google's implementation handles overcounting by allowing it to be merged into the overall critical path. This allows discovering which services are overcounting so they can be fixed. See "The case of the phantom blind spot" later in this article for an example.

Undercounting is expected. The reported critical path for an RPC will be less than the observed wall time because of serialization costs, network transmission time, and scheduling delays. An "unaccounted" child node reflects the differences.

Blind spots are also a common problem, especially when analyzing new code paths that have only recently showed up on the critical path. Blind spots look like large blocks of time on the critical path that are all attributed to a single subcomponent. One common reason for a blind spot would be making an RPC to a slow service that has not implemented CPT.

Triggering

Depending on the framework implementation, CPT can incur significant overhead. In practice, not every request has to be traced, so sampling is used to amortize the cost. (The tradeoffs between overhead and precision are discussed later in the article.) Sampling requires cross-service coordination to avoid creating unnecessary blind spots in the traces.

Each service can make an independent sampling decision, and then it conveys that decision on outbound RPCs in request metadata. As with Facebook's Mystery Machine,⁵ when downstream services see that the caller has opted in to CPT, the downstream services enable tracing as well. Even if the caller has not opted in to sampling, downstream services are free to track and log their own critical paths. Callers that have not requested sampling will ignore the resulting traces.

Google Search also implemented a consistent sampling mechanism that greatly increases the CPT sampling overlap among multiple services in a distributed fashion. This system is used when the servers are locally deciding if CPT should be enabled for a particular request, and it minimizes blind spots when there are disconnected call flows (such as CPT-capable servers behind CPT-incapable ones). This mechanism uses the Dapper¹⁷ trace identifier as the source of distributed agreement and turns this identifier into multiple sampling buckets. Each service then makes a sampling decision by checking if the requests fall in the corresponding buckets for their sampling strategy.

System operators can opt in to tracing for specific requests, instead of relying on random sampling. This is useful for cases where a human operator needs to collect traces for debugging purposes. When identifying a particularly slow type of request, the operator collects many samples for that request to get data for a profile. Since these requests are not random samples, by default they are excluded from the aggregated analysis, described later.

Limitations of CPT

The critical path as defined here is useful for addressing culprit-finding problems but has some limitations.

In general, any latency optimization efforts should focus on the subcomponents that are on the critical path. Resource contention with off-critical path subcomponents, however, can also slow down critical path execution.

Considering figure 2b again, imagine that A2 is holding a mutex or is running intensive computation and causing CPU starvation in critical path code. The critical path will show part of the resource-starvation issue: Nodes waiting on the blocked resource will appear on the critical path. The culprit node (A2), however, will not appear on the critical path. Profilers focused on CPU and lock contention are well-suited for identifying these bottlenecks.

The critical path also lacks visibility into the drag and slack¹² of the overall execution. Drag and slack are measures of the potential change in the critical path based on optimizing a single step. A single subcomponent that blocks all others has large drag: Improving that one subcomponent is likely to improve the overall execution time.

When multiple subcomponents run in parallel, they have large slack: Even if one slows down, it probably won't impact the overall time. Causal profiling⁶ uses an experiment-driven approach to identify headroom by automatically injecting latency into different subcomponents to determine their drag and slack. Quartz³ aims to identify subcomponents with high drag via CPU profiling. The Mystery Machine⁵ identifies subcomponent slack by reconstructing system-dependency graphs via log analysis.

Streaming is an important technique for improving latency, and, unfortunately, CPT for streaming APIs is not well-defined. Services can return a stream of results, allowing work to begin on the early results before later results are ready. Bidirectional streaming, where client and server send multiple messages back and forth, is a more complex programming model but is also useful for latency in some situations. The critical path must be carefully defined for these operations: Should segments end when the first message is returned, or the last? What if the caller of the API is also streaming? Google Search's current implementation of CPT deals with this challenge by defining the critical path segment as ending when the last message of the stream is received, but this definition is misleading in situations where earlier messages are more important for latency.

As is true of most data-collection systems, CPT provides more precise information as more data is collected and aggregated. Aggregating data can also hide critical details. The "Aggregation and Visualization" section discusses the tradeoffs that occur when critical path traces are aggregated.

Even with these caveats, CPT helps focus effort on areas where it is most likely to make a difference.

Adoption

Prior to adoption of CPT, latency analysis in Google Search was primarily based on log analysis and RPC telemetry. Adoption costs were low, since RPC monitoring was enabled by default, but granularity of the measurements was insufficient. Log analysis could correctly identify small changes in latency; for example, regressions of a few milliseconds could easily be detected via A/B testing. Identifying which service had caused the regression was challenging, since many services operated in parallel. Identifying which subcomponent within a service had caused the regression was virtually impossible, since there was no systematic collection of subcomponent contributions to end-to-end latency.

Broad deployment of CPT throughout Google Search has resulted in much better visibility of the root causes of latency.

Figure 3 shows the granularity of CPT in Google Search by comparing the granularity of individual critical path elements with overall request latency.

About 50 percent of request latency is attributable to subcomponents with a latency of under 10 ms. This section of the curve shows software written in frameworks that support fine-grain tracing. When latency regressions occur in these subcomponents, supplemental tools such as CPU profilers are frequently unnecessary. CPT alone is precise enough to flag the root cause of the regression.

The right side of the granularity curve represents latency incurred in blind spots: CPT is too coarse-grained to immediately flag root causes. In some cases, the latency regression has occurred in an RPC service using a framework that has not yet implemented fine-grain CPT. CPT is sufficient to demonstrate that the root cause is somewhere in the given RPC service, but it cannot automatically identify the culprit subcomponent. In other cases, CPT has correctly identified a slow subcomponent on the critical path. CPU profiling of these subcomponents typically finds opportunities to reduce latency.

Operational costs

Operational overhead of CPT is low enough that profile data collection is enabled by default in many applications, including Google Search. For Search, traces are continuously collected on 0.1 percent of requests. Traced requests see a 1.7 percent increase in mean latency, adding .002 percent to overall mean request latency. Overhead remains low at the tail, with 99^th percentile latency overhead for traced requests being 2.0 percent.

CPU overhead is more difficult to calculate. Frameworks were rewritten to incorporate CPT at runtime. Those changes typically incurred less than 0.1 percent of the overall CPU cost, but framework overhead depends on the workload. The CPU overhead was deemed acceptable when compared with the value of the latency profiles.

Network overhead is significant, largely because of an overly verbose wire format. For a critical path with 𝑁 elements and an average of 𝑀 subcomponents per element, the wire format uses 𝑂(𝑁 ∗ 𝑀) memory. A graph representation would encode the same information at lower cost. In practice, the network overhead is mitigated both by sampling only 0.1 percent of requests and via compression.

Aggregation and Visualization

A single critical path trace is interesting but might not present a realistic view of overall system performance. Single requests can be outliers. Merging multiple profiles creates a statistically significant view of system performance.

Individual critical paths are logged along with other metadata about the request. Standard log-analysis techniques are used to select a random sample of requests, meeting whatever criteria are of interest. For example, critical path analysis for Google Search routinely includes slicing by A/B experiment arms, filtering to select requests where certain UI elements appeared on the Search results page, and filtering to select requests from certain time ranges and geographic regions. One productive technique is simply to select only requests that were extremely slow. Analyzing these requests frequently reveals systemic issues that amount to significant optimization opportunities.

Note that unlike distributed tracing systems,¹⁷ a single log entry contains a critical path that spans multiple services. This is important for correctness reasons: Back-end services can record their own critical paths, but they cannot have a priori knowledge of whether their work will be on the critical path of their callers. CPT implements a critical-path-merging mechanism where the callers incorporate critical paths from their back ends into their own. Compared with the alternative of joining information from the back ends' logs, this is a much simpler and faster approach.

Aggregation algorithm

The techniques Google Search uses for aggregating critical path information are similar to the mechanisms generally used in CPU profiling. The process creates a profile for visualization and analysis by merging critical paths from sampled requests into a single "average critical path," similar to the Mystery Machine.⁵ Google Search uses pprof⁸ for visualization, but CPT is flexible and can be used in conjunction with other visualization tools.

Consider figure 1 as an example system where two requests are received. Table 1 shows how these two requests are merged to create an average critical path. In Request 1, A1 calls A2 and B1 in parallel, and only A2 is on the critical path. In Request 2, A2 and B1 are again called in parallel, but B1 is slower and on the critical path. These are aggregated into a single profile by merging subcomponents with identical paths and then dividing by the number of samples to keep metrics relative to average request latency.

This aggregation can be thought of as the average critical path of the overall system. Interpretation of the average critical path needs to bear in mind that subcomponents are not sequential, and the average might not reflect any real critical path that the system could have taken. If two subcomponents are called in parallel, it is not possible for both to be on the critical path for a single request. Both will still likely appear on the average critical path.

Subcomponent time on the average critical path reflects both how often the subcomponent is on the critical path and how long it takes when present. The difference becomes important when considering system subcomponents that are usually fast but occasionally slow. A system subcomponent that always takes 5 ms on the critical path will show up in an aggregate profile as taking 5 ms. A system subcomponent that is only on the critical path 1 percent of the time but takes 500 ms when it does appear will also show up in an aggregate profile as taking 5 ms.

As with other statistical analysis, looking at the distribution of the data in the sample pool can be helpful. In addition to pprof format profiles, aggregation tools collect latency histographs for critical path segments. At Google we consult this data early on when investigating latency problems. Optimizing something that occurs rarely but is very slow is a different problem from optimizing something that occurs frequently. In cases where the latency distribution for a particular subcomponent is highly skewed, additional sampling is helpful, to focus on cases where the subcomponent is slow. The "Case Studies" section later in this article discusses several incidents where focused analysis made root causes more obvious.

Precision of total critical path

How many samples are needed before we are confident that a critical path change of a certain size is not a result of random variation in request latency? The central limit theorem can calculate confidence intervals for various sample sizes. Table 2 shows estimates of the width of the 95th percentile confidence intervals for Google Search profiles.

In practice, profile visualization becomes quite slow with millions of requests in a sample. We default to looking at samples of 100,000 requests as a compromise between precision and usability of profiling tools, and increase sampling beyond 100,000 requests as needed.

Individual subcomponent precision

Aggregated profiles contain thousands of subcomponents. This creates a situation where even 95th percentile confidence intervals can yield a large number of false positives—cases where a subcomponent appears to have had a statistically significant change in latency, but where the difference is actually caused by random variation. False positives waste engineering time since investigating the latency change will not find a root cause.

One way of measuring the false positive rate is to create two profiles using identical sampling criteria, and then compare the resulting profiles using pprof's difference-base view. The difference-base view highlights large differences between profiles. When you know in advance that the profiles should be the same, any differences larger than a given threshold are false positives.

To quantify the latency-profiling false positive rate for Google Search, we repeated the pairwise comparison experiment 100 times at various sample sizes and with various thresholds for considering a profile difference important enough to investigate. Table 3 shows the 95th percentile confidence intervals for the expected number of false positives. As an example, if the profile comparison is based on 1,000 requests per profile, you can be 95 percent confident that the average number of components with false positives larger than 5 ms is between 0.2 and 0.4.

In Google Search, it's extremely rare for profiles to show large differences in a single subcomponent because of random variation. An additional finding from the analysis is that not all subcomponents are equally likely to show random differences. Components that have larger mean times and larger standard deviations also have larger false positive rates. (The central limit theorem predicts this outcome. Even though subcomponents have non-normal latency distributions, estimates of mean latency become more precise as standard deviation shrinks and sample sizes grow.)

This finding has important implications for latency investigations in Search. Identifying the root causes of latency differences in A/B experiments is straightforward, because two profiles with identical sampling criteria can be compared. Any subcomponent-level differences above 1 ms were probably caused by the experiment, not by random variation. Root-cause analysis is much more difficult when comparing profiles taken on different workloads. Under those circumstances, larger sample sizes do not make the profile differences clearer. Any large differences between the profiles are probably a result of the different sampling criteria, not random variation.

Visualization

Once profiles have been aggregated, a variety of visualizations can be used to help engineers understand system latency. Top-down, bottom-up, call graphs, and flame graphs¹¹ are all helpful. Figure 4 shows an example CPT visualization as a flame graph, where a latency critical path is represented as columns of boxes, and each box represents a subcomponent. (The color of each box is insignificant and optimized for easy viewing.) Profile comparison is essential for diagnosing regressions, since they show only subcomponents with large changes. This is particularly helpful for analyzing regressions in A/B experiments.

Case Studies

In the following examples CPT profiles are used to debug latency issues.

The case of the nonblocking RPC

One frequent strategy for identifying optimization opportunities is to look for outlier requests. Some high-latency threshold will be chosen (e.g., slow enough to be a very bad experience for end users). Latency profiles are then taken for requests that exceeded that threshold. Sometimes the resulting profiles simply prove that a series of unusual events can cause high latency. In this case, the outlier requests weren't outliers: They were part of a pattern that had gone undetected for years.

One latency analysis turned up a single common query that had latency that was several hundred milliseconds longer than normal. The latency profile showed a blocking call to an API that was rarely used in Search. The profile also identified the subcomponent that was making this call, which was sufficient to show that the behavior was caused by a particular type of Search feature that occurred in less than 10 percent of requests. Additional profiles were taken, this time including all requests that showed this RPC on the critical path.

The first identified request was actually not as much of an outlier as originally thought: Almost all requests that triggered this Search feature were incurring a large latency penalty. This RPC had triggered a transient alert 18 months earlier, but it was not severe enough to require a thorough investigation.

Contacting the team that owned the relevant code was helpful: The original intent of using this API was for reporting purposes, and in fact this code was never intended to block request processing. Refactoring the code to make an actual nonblocking RPC resolved the issue and resulted in significant latency reductions on the affected slice of Search traffic.

Conclusion: Putting events in context is helpful. As discussed earlier, RPC telemetry fell short due to a lack of support for traffic slicing. CPT's traffic-slicing capabilities helped identify and resolve the problem.

The case of the code that didn't run

When investigating average Search latency, a common pattern appeared in one particular service. Thousands of subcomponents were occasionally blocked on a child identified on the critical path as a framework, indicating the critical path delay occurred in the framework rather than application code. Even though most of these blocked subcomponents were fast, a few were slow; cumulatively, they represented a significant opportunity to cut latency.

Initial investigations failed to identify the root cause. CPU profiling the framework code showed no obvious problems. Investigating the impacted subcomponents that presented high delay in framework code also showed no obvious patterns, other than that they seemed slow. A few CPU optimizations were implemented, but none had significant latency impact.

Eventually an engineer hit on the idea that the problem was not the framework using too much CPU; the issue was that the framework should be using more. A few years earlier, someone had identified that per-request lock contention was causing significant latency problems. Dozens of CPUs might be working on the same request in parallel and were competing to grab locks that protected per-request state. Experiments tested different levels of parallelism per request, and it turned out that limiting a single request to using no more than two CPUs at the same time was optimal. That configuration used less CPU because of decreased lock contention, and it reduced Search latency.

The new discovery was that the two-CPU request rule was optimal only as a default. Certain subcomponents had large amounts of CPU-bound work to perform and could benefit from additional parallelism without increasing lock contention. A new framework feature was added, allowing subcomponents to request that additional CPUs be assigned to their work. The optimization worked and is now routinely applied when subcomponents show high amounts of framework time.

Conclusion: When optimizing for latency, latency profiling should be preferred. CPU profiling, RPC telemetry, and distributed tracing were all unable to detect this particular issue, whereas latency profiling successfully discovered the unexpected latency pattern and motivated further investigation.

The case of the phantom blind spot

While researching blind spots in CPT, a concerning pattern appeared. A subcomponent labeled as deferredwork had large blind spots and was contributing significantly to critical path time. Researching the code showed that the deferredwork subcomponent was under active development. It was intended to generate information that should be added to the bottom of the Search results page. One common optimization in Search is to generate information that should be displayed at the top of the results page first and then stream that information to the client. Was the deferredwork subcomponent actually blocking the top of the page as well?

That explanation didn't match the data: The deferredwork subcomponent was routinely showing up as taking more than a second on the critical path. Routinely adding one second to Search latency would have been causing alerts to fire. Comparing the total request latency to the sum of the critical path latency showed the real issue: The critical path time was much larger than the actual request latency. The framework team maintaining the service with the problem investigated, and it confirmed that the deferredwork code was triggering a critical path tracing bug.

Conclusion: Better framework support for streaming behavior is an area for future development.

The case of the slow home page

Monitoring showed that the frequency of very slow home-page loads had increased. Investigation showed that the problem had begun a few days prior, and the initial assessment was that the problem was significant but not severe. The proximate cause had been a configuration change rollout. Each rollout batches together hundreds of independent changes and runs them through a canary process to make sure they are safe. The canary had passed, these changes had been live for days without causing major issues, so no rollback was performed.

Concern grew, however, as the problem was investigated. The issue was actually more serious than the monitoring indicated, because the problem was not random. Most users were not impacted at all, but users on certain types of devices in one geographic area were experiencing high latency. The urgency of the investigation increased.

None of the changes included in the configuration rollout seemed likely to have caused the problem, so the root cause was hard to identify. Normally a latency regression in a geographic area is a result of networking issues. A Dapper trace that captured the problem, however, showed the opposite. RPCs were fast, but one service involved was using far too much CPU serving these requests. The Dapper trace was too coarse-grained to capture which subcomponents were burning CPU. CPU profiling wasn't helpful either: Requests triggering the problem were rare enough that they were lost in the noise.

At this point in debugging, the problem mysteriously vanished with another configuration rollout. Investigation continued, now using latency profiling to try to figure out what had happened. Two latency profiles were taken, one capturing normal home-page requests, and another capturing extremely slow home-page requests.

Comparing the profiles found the culprit: A customization system responsible for short-lived region-specific home-page changes was blocking request processing. Reviewing the configuration changes again showed that the rollout that triggered the problem had included a new home-page customization and that the scheduled end of the home-page customization had coincided with the end of the latency problems.

The team responsible for home-page customization took a look at the issue and quickly identified the root cause: A "simple" configuration had been expanded by an automated system into an unexpectedly complex set of configuration rules. Processing those rules had required seconds of CPU time on every request where they triggered.

Conclusion: Ability to focus on the impacted traffic is key to latency profiling. Home-page requests from a specific category of devices in a specific geographic region were lost in the noise of CPU profiles. Dapper traces could show the issue but were unable to focus on the subcomponents causing the problem. The ability to gather fine-grain latency profiles focusing on just the impacted requests made the difference.

Future Work

The approach outlined here is practical and scales well but has limitations. One area for future work is increasing adoption of fine-grain tracing and reducing the size of blind spots in profiles. This would likely require increasing adoption of frameworks-based programming models.

Joining critical path tracing with CPU profiling could be a powerful tool. This work will be quite challenging, since sampling CPU profilers by necessity take a sample before request processing has completed. The critical path is not known until the request is complete, which might be long after the sample has been collected. Finding a way to join data from these systems would allow CPU profilers to focus developer attention on code that is already known to be on the critical path.

Adding frameworks-level support for CPT in streaming protocols would be helpful. The approach described in this article is defined only for request/response RPC protocols that create a directed graph of calling nodes, and only when all the critical work for those RPCs is done between the arrival of the request and the transmission of the response. Streaming protocols that produce incremental progress, or conversational protocols with multiple requests and responses in a session, are harder to analyze. Avoiding blind spots in profiles where streaming protocols are used would be helpful.

CPT focuses on server-side latency analysis. Adding support for client-side components of latency (as was done with Facebook's Mystery Machine⁵) would improve the ability to understand which contributors to end-to-end latency have the most impact on the end-user experience.

The frameworks integration that CPT uses to identify service subcomponents has profiling applications outside of latency. We have built proof-of-concept distributed cost profilers within Google Search that allow downstream systems to trace expensive traffic to specific subcomponents in callers. These cost profiles are detailed and useful but have not been generalized for use outside of Search.

Finally, adding metadata to trace collection seems both possible and helpful. For example, frameworks could collect slack information for each critical path node. In cases where a node is blocked, frameworks could collect information about the contended resource causing the delay. Continued research in this area should yield additional techniques to help profile for latency.

Conclusions

In large, real-world distributed systems, existing tools such as RPC telemetry, CPU profiling, and distributed tracing are valuable to understand the subcomponents of the overall system, but are insufficient to perform end-to-end latency analyses in practice. Issues such as highly parallel execution flows, heterogeneous workloads, and complex execution paths within subsystems make latency analysis difficult. In addition, these systems and workloads change frequently.

No one team or person has a detailed understanding of the system as a whole.

CPT in such systems addresses these challenges to provide actionable, precise latency analysis. Integration at the frameworks level¹⁵ provides detailed views of application code without requiring each developer to implement their own tracing. Common wire protocols allow consistent mechanisms for triggering, aggregation, and visualization. Efficient implementation allows a high sampling rate, giving precise profiles even for relatively rare events. Scalable and accurate fine-grain tracing has made CPT the standard approach for distributed latency analysis for many Google applications, including Google Search.

Acknowledgements

This article would not be possible without a number of contributors from Google. Fabian Gura, David German, and numerous others in Search built the instrumentation and tools that implement this methodology in practice in Google Search. Steven Procter designed, implemented, and launched the first frameworks-based CPT algorithm, demonstrating that collecting fine-grain traces at scale was possible. Several frameworks teams collaborated to scale the effort to support thousands of developers. Michael Levin's work guided the statistical precision analysis.

We are grateful to Jennifer Klein, Łukasz Milewski, Ali Sheikh, Paul Haahr, and Brian Stoler for providing critical feedback about the readability and contents of the article. Salim Virji coordinated the publishing process. Without his support, the critical path for publishing this article would have been a lot longer.

References

1. Amazon Web Services. Amazon CloudWatch: Observability of your AWS resources and applications on AWS and on-premises; https://aws.amazon.com/cloudwatch/.

2. Amazon Web Services. What is Amazon CodeGuru profiler?; https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html.

3. Anderson, T.E., Lazowska, E.D. 1990. Quartz: A tool for tuning parallel program performance. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 115–125; https://dl.acm.org/doi/10.1145/98457.98518.

4. Arapakis, I., Bai, X., Cambazoglu, B.B. 2014. Impact of response latency on user behavior in web search. In Proceedings of the 37th international ACM SIGIR Conference on Research and Development in Information Retrieval, 103–112; https://dl.acm.org/doi/10.1145/2600428.2609627.

5. Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F. 2014. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th Usenix Symposium on Operating Systems Design and Implementation, 217–231; https://dl.acm.org/doi/10.5555/2685048.2685066.

6. Curtsinger, C., Berger, E.D. 2015. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, 184–197; https://dl.acm.org/doi/10.1145/2815400.2815409.

7. Dean, J., Barroso, L.A. 2013. The tail at scale. Communications of the ACM 56(2), 74–80; https://dl.acm.org/doi/10.1145/2408776.2408794.

8. GitHub. pprof; https://github.com/google/pprof.

9. Google. Cloud monitoring; https://cloud.google.com/monitoring.

10. Google. Cloud Profiler; https://cloud.google.com/profiler.

11. Gregg, B. 2016. The flame graph. Communications of the ACM 59(6), 48–57; https://dl.acm.org/doi/10.1145/2909476.

12. Kelley., J.E. 1961. Critical path planning and scheduling: Mathematical basis. Operations Research 9(3), 296–435; https://www.jstor.org/stable/167563.

13. Microsoft. Profile production applications in Azure with application insights; https://docs.microsoft.com/en-us/azure/azure-monitor/app/profiler-overview.

14. Microsoft. Azure Monitor; https://azure.microsoft.com/en-au/services/monitor/.

15. Nokleberg, C., Hawkes, B. 2021. Best practice: Application frameworks. acmqueue 18(6), 52–77; https://queue.acm.org/detail.cfm?id=3447806.

16. Pandey, M., Lew, K., Arunachalam, N., Carretto, E., Haffner, D., Ushakov, A., Katz, S., Burrell, G., Vaithilingam, R., Smith, M. 2020. Building Netflix's distributed tracing infrastructure. The Netflix Tech Blog; https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304.

17. Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Google Technical Report; https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdf.

18. Yang, C.-Q., Miller, B. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of the 8th International Conference on Distributed Computing Systems. IEEE Computer Society Press, 366–375.

Brian Eaton joined Google's Information Security team as a software engineer in 2007, focusing on usable security and authentication. Beginning in 2014, he joined Web Search to lead engineering productivity and performance teams, developing latency profiling and the first microservices deployments in Web Search. He now works on Search Quality and user trust, helping users understand the origins and context of information they find on the Web. He holds a BS degree from the University of California, Santa Cruz, and is a semi-professional foosball player.

Jeff Stewart worked at Google for over 17 years. For more than 10 years, he improved Web Search, leading teams to improve compute efficiency and end user latency. He led the initiative to enable https on search traffic by default. Jeff helped launch Gmail. He moved to a new employer in 2021. He earned both his bachelor's degree in Electrical and Computer Engineering and his master's degree in Information Networking from Carnegie Mellon University in Pittsburgh, Pennsylvania. He holds a black belt in American Kenpo Karate. He lives near Boston.

Jon Tedesco has worked at Google for 9 years in web Search infrastructure. During that time he has led several teams to build latency and compute analysis tools for the Search stack, and to develop new components of Search infrastructure focusing on performance analysis and optimization. He holds MS and BS degrees in computer science from the University of Illinois at Urbana-Champaign.

N. Cihan Tas is a site reliability software engineer on the Stadia team at Google, currently focusing on the optimization of datacenter networks. Prior to Google, he worked at Siemens Corporate Research in the area of wireless networks, broadly ranging from vehicle-to-vehicle networks to smart-grid communication networks. He holds a BS degree from Bilkent University, and MS and PhD degrees from University of Maryland, College Park, all in computer science. N. Cihan is an ACM Member.

Originally published in Queue vol. 20, no. 1—
Comment on this article in the ACM Digital Library