The Morning Paper

July 9, 2019
Volume 17, issue 3

Download PDF version of this article PDF

The Morning Paper

Time Protection in Operating Systems and Speaker Legitimacy Detection

Operating system-based protection from timing-based side-channel attacks; implications of voice-imitation software

Adrian Colyer

Timing-based side-channel attacks are a particularly tricky class of attacks to deal with because the very thing you're often striving for—improved performance—can give you away. There are always more creative new instances of attacks to be found, so you need a principled way of thinking about defenses that address the class, not just a particular instantiation. That's what Ge et al. give us in "Time Protection, the Missing OS Abstraction." Just as operating systems prevent spatial inference through memory protection, so future operating systems will need to prevent temporal inference through time protection. It's going to be a long road to get there.

The second paper chosen for this edition comes from NDSS'19 (Network and Distributed System Security Symposium) and studies the physiological and social implications of the ever-improving abilities of voice-imitation software. It seems people may be especially vulnerable to being fooled by fake voices. "The crux of voice (in)security: a brain study of speaker legitimacy detection," by Neupane et al., is a fascinating study with implications far beyond just the technology.

Time protection: the missing OS abstraction

Ge et al., EuroSys'19 (European Conference on Computer Systems)

https://ts.data61.csiro.au/publications/csiro_full_text//Ge_YCH_19.pdf

Ever since the prominent emergence of timing-based microarchitectural attacks (e.g., Spectre, Meltdown, and friends), I've been wondering what to do about them. When a side channel is based on observing improved performance, a solution that removes the improved performance can work but is clearly undesirable. In today's paper choice, for which the authors won a best paper award at EuroSys'19 in March, Qian Ge et al. set out a principled basis for protecting against this class of attacks. Just as today's systems offer memory protection, they call this time protection. The paper sets out what can be done in software given today's hardware, and along the way also highlights areas where cooperation from hardware will be needed in the future.

Timing channels, and in particular microarchitectural channels, which exploit timing variations due to shared use of caches and other hardware, remain a fundamental OS security challenge that has eluded a comprehensive solution to date... We argue that it is time to take temporal isolation seriously, and make the OS responsible for time protection, the prevention of temporal inference, just as memory protection prevents spatial inference.

If padding all the things to make execution consistently as slow as the slowest path isn't a desirable solution, then the other avenue left to explore is the elimination of the sharing of hardware resources that are the underlying cause of timing channels.

Microarchitectural channels

Microarchitectural timing channels result from competition for hardware resources that are functionally transparent to software... the [ISA] abstraction leaks, as it affects observable execution speed, leading to timing channels.

Microarchitectural state of interest includes data and instruction caches, TLBs (translation lookaside buffers), branch predictors, instruction- and data-prefetcher state machines, and DRAM (dynamic RAM) row buffers. There are also stateless interconnects, including buses and on-chip networks.

A covert cache-based channel (for example) can be built by the sender modulating its footprint in the cache through its execution, and the receiver probing this footprint by systematically touching cache lines and measuring memory latency and by observing its own execution speed. (Side channels are similar, but the sender does not actively cooperate).

A covert channel can be built over a stateless interconnect in a similar manner by the sender encoding information in its bandwidth consumption, and the receiver sensing the available bandwidth.

Threat scenarios

Hardware support is not available to prevent interconnects being used as covert communication channels, but security can still be improved in many use cases. The paper focuses on two key use cases:

• A confined component running in its own security domain, connected to the rest of the system by explicit (e.g., IPC [inter-process communication]) input and output channels. "To avoid the interconnect channel, we have to assume the system either runs on a single core (at least while the sensitive code is executing), or co-schedules domains across the core, such that at any time only one domain executes."

• Preventing side-channel attacks between VMs (virtual machines) hosted on public-cloud infrastructure. Hyperthreading either must be disabled, or all hyperthreads of a core must belong to the same VM.

These two threats can be mitigated by the introduction of time protection at the operating-system level:

Time protection: a collection of OS mechanisms which jointly prevent interference between security domains that would make execution speed in one domain dependent on the activities of another.

The five requirements of Time Protection

Enforcement of a system's security policy must not depend on correct application behaviour. Hence time protection, like memory protection, must be a mandatory (black-box) OS security enforcement mechanism. In particular, only mandatory enforcement can support confinement.

Time protection is based on preventing resource sharing. There are two strategies for this: some classes of resource (e.g., cache) can be partitioned across domains; those that are instead time-multiplexed have to be flushed during domain switches. Assuming that a core is not pinned to a single domain, then we have the first requirement:

Requirement 1: When time-sharing a core, the OS must flush on-core microarchitectural state on domain switch, unless the hardware supports partitioning such state.

Spatial partitioning of physical memory frames can be achieved using page coloring. This ensures that a particular page can only ever be resident in a specific section of the cache, referred to as the color of the page. Typically, LLC (last-level cache) and L2 caches can be colored this way, but the smaller L1 caches and other on-core states such as the TLP and BP cannot. So, these on-core caches must be flushed on a domain switch.

The code and data of the kernel itself can also be used as a timing channel. To protect against this:

Requirement 2: Each domain must have its own private copy of OS text, stack and (as much as possible) global data.

All dynamically allocated kernel memory is provided by userland, and hence will be colored. This leaves a small amount of global kernel data uncolored...

Requirement 3: Access to any remaining OS shared data must be sufficiently deterministic to avoid information leakage.

Even when we do flush caches, the latency of flushing can itself be used as a channel (since it forces a write-back of all dirty lines.)

Requirement 4: State flushing must be padded to its worst-case latency.

Finally, since interrupts can also be used for a covert channel:

Requirement 5: When sharing a core, the OS must disable or partition any interrupts other than the preemption timer.

Implementation in seL4

The authors demonstrate how to satisfy these five requirements in an adapted version of seL4 (https://sel4.systems/). Each domain is given its own copy of the kernel, using a kernel clone mechanism that creates a copy of a kernel image in user-supplied data, including a stack and replicas of almost all kernel data. Two kernels share the minimum static data required for handing over the processor. The Kernel_SetInt system call allows IRQs (interrupt requests) to be associated with a kernel, such that kernels cannot trigger interrupts across partition boundaries (see §4.2).

Domain switches happen implicitly on a preemption interrupt. When this happens, the stack needs to be switched, and then all on-core microarchitectural state is flushed. The kernel defers returning until a configured time has elapsed (requirement 4). Kernel cloning ensures that kernels share very little data. For what remains, requirement 3 is satisfied by carefully prefetching all shared data before returning to userland, by touching each cache line. All interrupts are masked before switching the kernel stack, and after the switch only those associated with the new kernel are unmasked.

Evaluation

The evaluation addresses two main questions: Do the time-protection mechanisms outlined here actually protect against covert and side channels as intended? How much performance overhead do they add?

Preventing information leaks

Information leakage is quantified using mutual information as the measure of the size of a channel. Experiments are conducted on both x86 and Arm v7. (Note that in the Arm v8 architecture, cores contain architectural state that cannot be scrubbed by architected means and thus contain unclosable high-bandwidth channels.)

Compare the top and bottom plots on the following figure. The top graph shows mutual information through an LLC covert channel without protection, and the bottom plot shows the mutual information with the time-protection enhancements in place.

Time Protection in Operating Systems

Without protection, the kernel channel can transmit 395 bits per second. With protection the channel disappears.

The following table shows the mutual information capacity of raw (unprotected) caches, the results of a full flush, and the results with time protection enabled.

Time Protection in Operating Systems

The residual L2 channel on Haswell is closed by a full flush but not by the time-protection mechanisms. Disabling the data prefetcher substantially reduces the channel; the remaining small channel "likely results from the instruction prefetcher, which cannot be disabled."

Performance overhead

Across a set of IPC microbenchmarks, the overhead of time protection is remarkably small on x86 and within 15 percent on Arm.

Time Protection in Operating Systems

The Arm cost is attributed to kernel clone operations; with the four-way associativity of Arm v8 cores the expectation is that the overhead will be significantly reduced.

The following table further shows the impact on domain switching:

Time Protection in Operating Systems

...the results show that our implementation of time protection imposes significantly less overhead than the full flush, despite being as effective in removing timing channels...

The overall cost of cloning is a fraction of the cost of creating a process.

What next?

Time protection is obviously at the mercy of hardware, and not all hardware provides sufficient support for full temporal isolation. We have seen this with the x86 L2 channel in Table 3, which we could not close... The results reinforce the need for a new, security-oriented hardware-software contract...:
• the OS must be able to partition or flush any shared hardware resource
• concurrently accessed resources must be partitioned
• virtually addressed state must be flushed

The most obvious weakness of current hardware in this regard is in the interconnects.

The ultimate aim of the authors is to produce a verified seL4 with time protection.

The crux of voice (in)security: a brain study of speaker legitimacy detection

Neupane et al., NDSS'19 (Network and Distributed System Security Symposium)

https://www.ndss-symposium.org/wp-content/uploads/2019/02/ndss2019_08-3_Neupane_paper.pdf

The key results of this paper are easy to understand, but the implications are going to take a long time to unravel. Speech morphing (voice morphing) is the process of transforming a speaker's voice to sound like a given impersonation target. This capability is now available off the shelf—this paper uses the Carnegie Mellon University Festvox voice converter (http://festvox.org/)—and is getting better all the time. The ability to impersonate someone's voice takes threats such as social-engineering attacks to a whole new level...

...voice imitation is an emerging class of threats, especially given the advancement in speech synthesis technology seen in a variety of contexts that can harm a victim's reputation and her security/safety. For instance, the attacker could publish the morphed voice samples on social media, impersonate the victim in phone conversations, leave fake voice messages to the victim's contacts, and even launch man-in-the-middle attacks against end-to-end encryption technologies that require users to verify the voices of the callers, to name a few instances of such attacks.

So, voice should sit alongside images and video as a source that can't be trusted in this new post-reality world. It seems especially powerful when combined with something else (e.g., voice impersonation and video). But we already knew that. It turns out, though, that voice may be a particularly devastating attack vector because deep down, inside our brains, we genuinely can't tell the difference between a real voice and a morphed voice impersonating it.

Previous studies have investigated brain activity when users are looking at real and fake websites and images, and found that "subconscious neural differences exist when users are subject to real vs. fake artifacts, even though users themselves may not be able to tell the two apart behaviorally." The work reported by Neupane et al. began with the hypothesis that there should be similar neural differences between listening to the original and fake voices of a speaker. Despite trying really hard, though, no such difference could be detected.

Our key insight is that there may be no statistically significant differences in the way the human brain processes the legitimate speakers vs synthesized speakers, whereas clear differences are visible when encountering legitimate vs different other human speakers... Overall, our work ... reveals users' susceptibility to voice synthesis attacks at a biological level.

However much people are taught to be on the lookout for voice impersonation, it seems they're not going to be able to detect it reliably, especially as voice-synthesis techniques continue to improve. To defend against voice-impersonation attacks, we're going to need machine assistance. But if voice synthesis is trained in an adversarial style, will even that be possible?

If there's a silver lining to be found here, then perhaps it's this: Since people can't tell the difference between a real and synthetic voice, the current morphing technology may be ready to serve those who have actually lost their voices.

Analyzing neural activity using fNIRS

Neural activity is studied using fNIRS (functional near-infrared spectroscopy). fNIRS is a noninvasive imaging method (using lots of external probes as shown in the image) that measures the relative concentration of oxyHb (oxygenated hemoglobin) and deoxyHb (dexoygenated hemoglobin) in the brain cortex. It provides better temporal resolution than an fMRI—without requiring the subject to be in a supine position in a scanner—and better spatial resolution than an EEG.

Speaker Legitimacy Detection

Experiment setup

Voice samples were collected from the Internet for Oprah Winfrey and Morgan Freeman, both of whom have distinctive and easily recognizable voices. Then 20 American speakers were recruited via Amazon Mechanical Turk to record the speech of these two celebrities in their own voices, imitating as best they could the original speaking style, pace, and emotion. One male and one female speaker from among these 20 were then selected, and their speech was fed through the Festvox voice converter to generate morphed versions of the voices of Freeman and Winfrey.

Then 20 experiment participants were recruited: 10 male and 10 female; all English speaking; and all in the age range 19-36. Each participant was familiarized with the voice of a victim speaker for one minute, and then played 12 randomly selected speech samples: four in the original speaker's voice, four in a morphed voice designed to impersonate the original speaker, and four in a different speaker's voice. The participants were asked to identify the legitimate and fake voices of the victim speakers but were not explicitly told about voice-morphing technology.

The experiment was conducted four times for each participant: twice with one of the celebrity voices as the victim, and twice with a voice for which they were "briefly familiar" (i.e., only encountered for the first time during the experiment).

Speaker Legitimacy Detection

While all this was going on, the fNIRS probe-cap captured brain activity in the following regions of the brain:

Speaker Legitimacy Detection

Key results

With the original voices, participants were correctly able to identify the speaker as the victim voice 82 percent of the time. The morphed voices were (incorrectly) identified as the authentic victim's voice 58 percent of the time. As a baseline, the different voice impersonating the original speaker was (incorrectly) identified as the authentic victim's voice 33 percent of the time.

When comparing the neural activation in the brain of the original speaker speech samples and the morphed voice impersonating that speaker, no statistically significant differences were observed. At this point you might be thinking that perhaps the experiment just didn't monitor the part of the brain where the difference shows (I certainly was). That's still a possibility, but comparing the neural activation between an original speaker speech sample and a different voice impersonating the speaker did show statistically significant differences (i.e., the probes were at least measuring brain activity in some of the areas that matter).

Speaker Legitimacy Detection

Differences were also observed in brain activation between familiar voices (the celebrities) and the "briefly familiar" voices (those they were exposed to only as part of the experiment). Famous speakers led to higher activation in the frontopolar area and middle temporal gyrus.

Having failed to detect any statistically significant difference in brain activity for real and morphed voices using traditional statistical techniques, the authors then tried training a machine-learning classifier on the brain activation data to see if it could learn to distinguish. The best-performing model achieved only 53 percent accuracy (i.e., there genuinely seems to be nothing in the captured brain activity that can tell the difference between the two cases).

A final thought:

Since this potential indistinguishability of real vs. morphed lies at the core of human biology, we posit that the problem is very severe, as the human detection of synthesized attacks may not improve over time with evolution. Further, in our study, we use an off-the-shelf, academic voice morphing tool based on voice conversion, CMU Festvox, whereas with the advancement in the voice synthesizing technologies (e.g., newer voice modeling techniques such as those offered by Lyrebird and Google WaveNet), it might become even more difficult for users to identify such attacks. Also, our study participants are mostly young individuals and with no reported hearing disabilities, while older population samples and/or those having hearing disabilities may be more prone to voice synthesis attacks.

Bear in mind that participants in the study, while not primed on voice-morphing technology, were explicitly asked to try to distinguish between real and impersonated voices. In a real-world attack people are unlikely to be so attuned to the possibility of impersonation.

Adrian Colyer is a venture partner with Accel in London, where it's his job to help find and build great technology companies across Europe and Israel. (If you're working on an interesting technology-related business he would love to hear from you: you can reach him at [email protected].) Prior to joining Accel, he spent more than 20 years in technical roles, including CTO at Pivotal, VMware, and SpringSource.

Reprinted with permission from https://blog.acolyer.org

Originally published in Queue vol. 17, no. 3—
Comment on this article in the ACM Digital Library