Download PDF version of this article PDF

Workload Frequency Scaling Law - Derivation and Verification

Workload scalability has a cascade relation via the scale factor.

Noor Mubeen, Intel

Many processors expose performance-monitoring counters that help measure 'productive performance' associated with workloads. Productive performance is typically represented by scale factor, a term that refers to the extent of stalls compared with stall-free cycles within a time window. The scale factor of workload is also influenced by clock frequency as selected by frequency-selection governors. Hence, in a dynamic voltage/frequency scaling or DVFS system (such as Intel Speed Shift1), the utilization, power, and performance outputs are also functions of the scale factor and its variations. Some governance algorithms do treat the scale factor in ways that are native to their governance philosophy.

This article presents equations that relate to workload utilization scaling at a per-DVFS subsystem level. A relation between frequency, utilization, and scale factor (which itself varies with frequency) is established. The verification of these equations turns out to be tricky, since inherent to workload, the utilization also varies seemingly in an unspecified manner at the granularity of governance samples. Thus, a novel approach called histogram ridge trace is applied. Quantifying the scaling impact is critical when treating DVFS as a building block. Typical application includes DVFS governors and or other layers that influence utilization, power, and performance of the system. The scope here though, is limited to demonstrating well-quantified and verified scaling equations.

 

Workload Scaling

Intel has three architecture-independent registers that are relevant to this topic:

Aperf. A running counter that counts at actual clock rate of execution at that instant. This actual clock frequency may vary over time based on governance and/or other algorithms. This register counts only during active state (C0).

Mperf. A running counter for activity that counts at a fixed TSC (time stamp counter) clock rate. It counts only during active state (C0).

Pperf. This counter is similar to Aperf, except that it does not count when the activity stalls as a result of some dependency, likely gated on another IP's clock domain (e.g., memory).

The delta of these counters in a given time window commonly interpret the following:

 

• Utilization. U = ΔMperf /ΔTSC

• Scale factor. S = ΔPperf /ΔAperf

Ideally, if an activity is free of stalls, then Pperf = Aperf (i.e., the scale factor S will be 1). In such a situation, the time taken to complete an activity in a given time window would simply be the inverse of the actual frequency in that window. In most real workloads the scale factor varies and is often less than 1. Establishing accurate equations relating utilization, scale factor, and frequency allows DVFS governors to dispense well-informed frequency-change decisions.

At the outset, it may seem that the workload-scaling problem is similar to the well-known Amdahl's law, which deals with workload speedup as associated with parallel-compute resources. However, Amdahl's law cannot be applied here since the "parallelizability factor" analogy to "scale factor" doesn't hold up - the former is independent of the resource being scaled (i.e., parallel processors), while the latter is a function of the resource being scaled (i.e., frequency). Utilization scalability over frequency has a cascade relation via the scale factor.

 

Terminology and Assumptions

This article makes use of following terminology:

• Utilization or C0 activity percentage refers to the active (non-idle) clocked state2 within a time window. Load and utilization, used interchangeably here, refer to the same attribute.

• Scalable activity refers to the part (or percentage) of the operation whose execution-time scales inversely with frequency.

• Stall in activity refers to the part (or percentage) of the operation that involves stall during its completion. While a longer delay actually causes C-state demotion, the instruction stalls referred to here are much shorter and occur with the CPU in C0 active state.

Consider a frequency governor on a DVFS system, updating frequency at every periodic time window T. Consider the present instant at time t0 ("NOW"), as depicted in figure 1. Let the just-completed window T of workload have tact (or ta) as part of its scalable activity (i.e., tact represents the cumulative portion of activity that keeps the processor busy in execution without the need for any dependent delay or stalls). Let Tstall (or Ts) be the sub-duration depicting the effective stalls experienced (inter-subsystem dependency stalls). Also let Toff (or To) be the cumulative duration where the DVFS subsystem is in a deeper C-state, which is a significantly lower power than C0 state. The fundamental difference between Toff and Tstall is that, in the latter case the subsystem is still active C0 while experiencing very short dependency stalls; whereas in the former case the delays are large enough and put the subsystem into a momentary deeper C-state.

Workload Frequency Scaling law - Derivation and Verification

In the given window, the load execution active-time is the inverse of the frequency f.

Or, in general, in any window:

 

 

Intel productive performance3 register counts the productive (non-stall) counts proportional to tact. Defining scalability S as a ratio of delta in productive-count (Pperf) to active-count (Aperf), results in the following general equations:

 

 

Also, the load l (C0 percentage) in this window can be defined as:

 

 

Causality

Consider one specific window T between t-1 and t0. Here, the governor has dispensed a frequency f0. If the governor had dispensed a different frequency f'0 in the same window, the scalable part of the work (i.e., Ta) would have been, say t'a and thereby t0 to t'0. For practical purposes, the stall cycles can be assumed to be unrelated to the selected frequency, since the stalls are not explicitly dependent on a local subsystem's clock frequency. Hence ts remain independent as far as their causality with frequency is considered. Similarly, within the same window the number of productive (stall-free) instructions to be executed remain fixed. Specifically, we can generalize and imply that:

Stall time and productive cycle count are invariant of frequency causality.

Note also that frequency and load are causality considerations within a time window, not across. In the rest of the derivation, a reference to change in frequency is not along the time flow, but another possible causality if the frequency were some other.

 

Utilization (load) to Frequency Change Estimation

An important aspect in scalability of workload relates to the extent of change in the load caused by a change in frequency.

As shown in figure 2 the sampled delta values of Pperf and Aperf as ΔP0,ΔA0 within windows ΔTSC can be associated with active, stall, and off time within T.

Workload Frequency Scaling law - Derivation and Verification

 

 

Now, assume the DVFS governor had chosen some target f'0 instead of initial actual frequency f0. The requirement is to find a relation that accurately represents the changed final target load l'0 as a result of this causality. Hence derive estimated load change l1 at time t1.

 

Let the Aperf count in the causality window with frequency f0 hypothetically transition from the present ΔA0 to ΔA'0

 

ΔA'0 = f'0(t'a + t's)

 

The stall time is not an explicit function of a local subsystem clock, the time spent in stall remains the same. In other words,

stall time is invariant of frequency causality.

t's = ts

 

 

When we solve this with the following two equations for load and scale-factor in general, at time tn:

 

ΔA = fn ln ΔTSC/Ftsc

 

and the scale factor as

 

 

solving yields

l'0 = s'0 l'0 + l0 - s0l0

 

or

 

Now,

 

 

Again, due to stall time invariance,

 

 

Since this is not an iteration to the next time window, the productive cycles to be executed remain the same. In other words,

The productive cycle count is invariant of frequency causality.

 

ΔP'0 = ΔP0

 

This results in the scaling of scale factor equation

 

 

This intermediate equation defines the changes (scaling) of the scale factor itself over frequency. Solving further, we get

the scaling of load equation:

 

 

In the causality-based derivation above, there was no workload inherent change in that window. In other words, if the causality from f0 to f'0 were actual change f1 in adjacent future time window t1, and the equation above were used to estimate such a target load l1, then any divergence (l1actual - l'0) from expected target load is clearly workload inherent change culminated over the time window t0 to t1. Since l1actual can be known, any divergence from the estimated load can be calculated. The estimated scaled load is this given by:

 

Similarly, the estimated scaled scale-factor is approximately given by:

 

 

Verification on an Actual Platform

The last two equations above can be called the workload scaling equations. Verifying them empirically on an actual platform, however, isn't as straightforward. At the granularity of a single cycle, the three parameters on the right (f0,f1,s0) may be determined; but as described, the math result cannot be compared directly to l after the end of that cycle; primarily because the incoming load itself could fluctuate as seen at a discrete sample basis. It is possible though, to statistically verify by applying a series of histograms at different frequencies and tracing its ridge. Using this histogram ridge trace method, the generality of the equation is verified.

 

To begin, a workload is run completely at a fixed frequency.

Workload #1 is a random browser-based graphics load (figure 3; https://webglsamples.org/multiple-views/multiple-views.html)

Workload Frequency Scaling law - Derivation and Verification

While capturing the data, it is preferred to stop unwanted background services that are irrelevant to the workload under analysis. Also, a precise and light-weight data logging tool is preferred. On Linux systems, the Intel PSST4 can accomplish logging at fixed, low, and precise overhead. The data is captured for, say, a duration of 100 seconds and a sampling poll period of 20 ms. Data captured with every sample includes:

• CPU load.

• scale factor.

• clamped frequency value.

The entire run is repeated at every frequency of CPU, while keeping other DVFS as fixed sources (e.g., Gfx frequency at a moderate but fixed value).

The CPU scale-factor and utilization logs are captured. In this example, time-domain analysis is not practical for gathering scalability and utilization. Instead, a histogram analysis identifies in which bucket the maximum number of load points lie for a given frequency of operation. The peak of the histogram represents the load value with the highest probability of occurrence for the given frequency. In the example data in figure 5, about 500 samples (or 80 percent of total samples) occurred at a 33 percent load. Such inferences are obviously hard to derive from the time chart in figure 4 even if the frequency is fixed.

Workload Frequency Scaling law - Derivation and Verification

Workload Frequency Scaling law - Derivation and Verification

The experiment is iterated through all other frequencies, one at a time, to get a range of peak histogram values, as shown in figure 6.

Workload Frequency Scaling law - Derivation and Verification

With the same data from the set of experiments above, a similar histogram ridge trace for scale-factor can be plotted as shown in figure 7.

Workload Frequency Scaling law - Derivation and Verification

Scaling of scale factor

A plotting tool is used to create a map-view of peak histogram points (in scale factor) in figure 7, which is then plotted with red dots in figure 8. On the same map-view, the scaling of scale factor equation is plotted for different possible target frequencies based on any one initial seed value for utilization and scale factor. The correlation between the equation and the full data set is shown in Figure 8.

Workload Frequency Scaling law - Derivation and Verification

 

Scaling of load

The map-view of peak histogram points of utilization (figure 6) is plotted with red dots in figure 9. On the same map view, the scaling of load equation is plotted for different possible target frequencies based on one initial seed value for utilization and scale factor.

Workload Frequency Scaling law - Derivation and Verification

As summarized by the results shown in these figures, the independent estimates of the equation fit nearly precisely with the actual experiment's statistical results. This methodology is applied to multiple other workloads (with different scale factors and load levels) and verified as above.

 

Conclusion

This article elaborated on the causalities of utilization and scale factor in relation to possible frequency-selection choices within a time window. Based on this, generic workload scaling equations are derived, quantifying utilization impact. The equations are verified by applying the histogram ridge trace method at discrete DVFS block level. Though only the CPU core example was detailed, similar equation agreements were observed on other DVFS blocks (graphics sub blocks), other workloads and other operating systems (Windows/Linux). The equation-estimated curve correlates very accurately with the actual ridge trace curve. This implies that the utilization impact can be "predicted" to be statistically accurate in every cycle and any divergence from it can be attributed to workload-inherent utilization change in that cycle and treated as appropriate to the solution using it.

 

Acknowledgments

The author is thankful for all the help from colleagues at Intel - specifically, valuable input from principal engineers Harinarayanan Seshadri and Rajeev Muralidhar and software engineer B. M. Shravan K.

 

References

1. Howse, B. 2015. Examining Intel's new Speed Shift tech on Skylake: more responsive processors. AnandTech; examining-intel-skylake-speed-shift-more-responsive-processors.

2. Kidd, T. 2014. Power management states: P-states, C-states, and Package C-states. Intel Developer Zone; power-management-states-p-states-c-states-and-package-c-states.

3. Intel. Intel 64 and IA-32 Architectures, Software Developer's Manual, Volume 3B: System Programming Guide, Part 2; 64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf.

4. Power Shaping and Stress Tool (PSST) for Intel Platforms. https://github.com/intel/psst

 

Related articles

Power-Efficient Software
Eric Saxe
Power-manageable hardware can help save energy, but what can software developers do to address the problem?
https://queue.acm.org/detail.cfm?id=1698225

Maximizing Power Efficiency with Asymmetric Multicore Systems
Alexandra Fedorova, et al.
Asymmetric multicore systems promise to use a lot less energy than conventional symmetric processors. How can we develop software that makes the most out of this potential?
https://queue.acm.org/detail.cfm?id=1658422

Energy Management on Handheld Devices
Marc A. Viredaz, et al.
Whatever their origin, all handheld devices share the same Achilles' heel: the battery.
https://queue.acm.org/detail.cfm?id=957768

 

Noor Mubeen is a software architect at Intel client R&D group, focusing on power, thermal, and energy management breakthroughs. In the past, he has architected thermal management on Intel mobile devices. He has over 17 years' experience spanning a breadth of domains including networking, storage file systems, and embedded devices. He is an open-source GNU/Linux enthusiast.

Copyright © 2018 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 16, no. 2
Comment on this article in the ACM Digital Library





More related articles:

David Collier-Brown - You Don't know Jack about Application Performance
You don't need to do a full-scale benchmark any time you have a performance or capacity planning problem. A simple measurement will provide the bottleneck point of your system: This example program will get significantly slower after eight requests per second per CPU. That's often enough to tell you the most important thing: if you're going to fail.


Peter Ward, Paul Wankadia, Kavita Guliani - Reinventing Backend Subsetting at Google
Backend subsetting is useful for reducing costs and may even be necessary for operating within the system limits. For more than a decade, Google used deterministic subsetting as its default backend subsetting algorithm, but although this algorithm balances the number of connections per backend task, deterministic subsetting has a high level of connection churn. Our goal at Google was to design an algorithm with reduced connection churn that could replace deterministic subsetting as the default backend subsetting algorithm.


Theo Schlossnagle - Monitoring in a DevOps World
Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better. DevOps enables highly iterative improvement within organizations. If you have no monitoring, get something; get anything. Something is better than nothing, and if you’ve embraced DevOps, you’ve already signed up for making it better over time.


Ulan Degenbaev, Jochen Eisinger, Manfred Ernst, Ross McIlroy, Hannes Payer - Idle-Time Garbage-Collection Scheduling
Google’s Chrome web browser strives to deliver a smooth user experience. An animation will update the screen at 60 FPS (frames per second), giving Chrome around 16.6 milliseconds to perform the update. Within these 16.6 ms, all input events have to be processed, all animations have to be performed, and finally the frame has to be rendered. A missed deadline will result in dropped frames. These are visible to the user and degrade the user experience. Such sporadic animation artifacts are referred to here as jank. This article describes an approach implemented in the JavaScript engine V8, used by Chrome, to schedule garbage-collection pauses during times when Chrome is idle.





© ACM, Inc. All Rights Reserved.