The Morning Paper

May 2, 2019
Volume 17, issue 2

Download PDF version of this article PDF

The Morning Paper

GAN Dissection and Datacenter RPCs

Visualizing and understanding generative adversarial networks; datacenter RPCs can be general and fast.

Adrian Colyer

For this edition of "The Morning Paper," I've chosen two papers from very different areas.

Image generation using GANs (generative adversarial networks) has made astonishing progress over the past few years. While staring in wonder at some of the incredible images, it's natural to ask how such feats are possible. "GAN Dissection: Visualizing and Understanding Generative Adversarial Networks" gives us a look under the hood to see what kinds of things are being learned by GAN units, and how manipulating those units can affect the generated images.

February saw the 16th edition of the Usenix Symposium on Networked Systems Design and Implementation. Kalia et al. blew me away with their work on fast RPCs (remote procedure calls) in the datacenter. Through a carefully considered design, they show that RPC performance with commodity CPUs and standard lossy Ethernet can be competitive with specialized systems based on FPGAs (field-programmable gate arrays), programmable switches, and RDMA (remote direct memory access). It's a fabulous reminder to ensure we're making the most of what we already have before leaping to more expensive solutions.

- Adrian Colyer, The Morning Paper

GAN dissection: visualizing and understanding generative adversarial networks

GAN dissection: visualizing and understanding generative adversarial networks, Bau et al., arXiv'18

Today's paper choice gives us a fascinating look at what happens inside a GAN. In addition to the paper, the code is available on GitHub, and video demonstrations can be found on the project home page.

We're interested in GANs that generate images.

To a human observer, a well-trained GAN appears to have learned facts about the objects in the image: for example, a door can appear on a building but not on a tree. We wish to understand how a GAN represents such a structure. Do the objects emerge as pure pixel patterns without any explicit representation of objects such as doors and trees, or does the GAN contain internal variables that correspond to the objects that humans perceive? If the GAN does contain variables for doors and trees, do those variables cause the generation of those objects, or do they merely correlate? How are relationships between objects represented?

The basis for the study is three variants of progressive GANs trained on LSUN scene datasets. To understand what's going on inside these GANs, the authors develop a technique involving a combination of dissection and intervention.

Given a trained segmentation model (i.e., a model that can map pixels in an image to one of a set of predefined object classes), we can dissect the intermediate layers of the GAN to identify the level of agreement between individual units and each object class. The segmentation model used in the paper was trained on the ADE20K scene dataset and can segment an input image into 336 object classes, 29 parts of large objects, and 25 materials.

Dissection can reveal units that correlate with the appearance of objects of certain classes, but is the relationship causal? Two different types of intervention help us to understand this better. First, we can ablate those units (switch them off) and see if the correlated objects disappear from an image in which they were previously present. Second, we can force the units on and see if the correlated objects appear in an image in which they were previously absent.

Figure 1 in the paper provides an excellent overview. Here we can see (a) a set of generated images of churches and (b) the results of dissection identifying GAN units matching trees. When we ablate those units (c) the trees largely disappear, and when we deliberately activate them (d) trees reappear.

The same insights can be used for human-guided model improvements. Here we see generated images with artifacts (f). If we identify the GAN units that cause those artifacts (e) and ablate them, we can remove unwanted artifacts from generated images (g).

Characterizing units by dissection

For dissection we take an upsampled and thresholded feature map of a unit and compare it to the segmentation map of a given object class.

The extent of agreement is captured using an IoU (intersection-over-union) measure. We take the intersection of the thresholded image and the pixels defined as belonging to the segment class, and divide it by their union. The result shows what fraction of the combined pixels is correlated with the class.

The following examples show units with high IoU scores for the classes table and sofa.

Finding causal relationships through intervention

We can say that a given hidden unit causes the generation of object(s) of a given class if ablating that unit causes the object to disappear and activating it causes the object to appear. Averaging effects over all locations and images provides the ACE (average causal effect) of a unit on the generation of a given class.

While these measures can be applied to a single unit, we have found that objects tend to depend on more than one unit. Thus we need to identify a set of units U that maximize the average causal effect for an object class c.

This set is found by optimizing an objective that looks for a maximum class difference between images with partial ablation and images with partial insertion, using a parameter that controls the contribution of each unit.

Here you can see the effects of increasing larger sets of hidden units, in this case identified as being associated with the class tree.

Findings from GAN analysis

• Units emerge that correlate with instances of an object class, with diverse visual appearances. The units are learning abstractions.

• The set of all object classes matched by units of a GAN provides a map of what a GAN has learned about the data.

The units that emerge are object classes appropriate to the scene type: for example, when we examine a GAN trained on kitchen scenes, we find units that match stoves, cabinets, and the legs of tall kitchen stools. Another striking phenomenon is that many units represent parts of objects: for example, the conference room GAN contains separate units for the body and head of a person.

• The type of information represented changes from layer to layer. Early layers remain entangled; middle layers have many units matching semantic objects and object parts; and later layers have units matching pixel patterns such as materials, edges, and colors.

Here is an interesting layer-by-layer breakdown of a progressive GAN trained to generate LSUN living room images.

• Compared to a baseline progressive GAN, adding minibatch stddev statistics increases the realism of the outputs. The unit analysis shows that it also increases the diversity of the concepts represented by units.

• Turning off (ablating) units identified as associated with common object classes causes the corresponding objects to mostly disappear from the generated scenes. Not every object can be erased, though. Sometimes the object seems to be integral to the scene. For example, when generating conference rooms, the size and density of tables and chairs can be reduced but they cannot be eliminated entirely.

• By forcing units on, we can try to insert objects into scenes. For example, activating the same door units across a variety of scenes causes doors to appear—but the actual appearance of the door will vary in accordance with the surrounding scene.

We also observe that doors cannot be added in most locations. The locations where a door can be added are highlighted by a yellow box... it is not possible to trigger a door in the sky or on trees. Interventions provide insight into how a GAN enforces relationships between objects. Even if we try to add a door in layer 4, that choice can be vetoed later if the object is not appropriate for the context.

GAN dissection

By carefully examining representation units, we have found many parts of GAN representations can be interpreted, not only as signals that correlate with object concepts but as variables that have a causal effect on the synthesis of objects in the output. These interpretable effects can be used to compare, debug, modify, and reason about a GAN model.

There remain open questions for future work. For example, why can a door not be inserted in the sky? How does the GAN suppress the signal in the later layers? Understanding the relationships between the layers of a GAN is the next hurdle....

Read this post at Adrian's blog: https://blog.acolyer.org/2019/02/27/gan-dissection-visualizing-and-understanding-generative-adversarial-networks/.

Datacenter RPCs can be general and fast

Datacenter RPCs can be general and fast, Kalia et al., NSDI'19

We've seen a lot of exciting work exploiting combinations of RDMA, FPGAs, and programmable network switches in the quest for high-performance distributed systems. I'm as guilty as anyone in getting excited about all of that. The wonderful thing about today's paper, for which Kalia et al. won a best paper award at NSDI this year, is that it shows in many cases we don't actually need to take on that extra complexity. Or to put it another way, it seriously raises the bar for when we should.

eRPC (efficient RPC) is a new general-purpose remote procedure call (RPC) library that offers performance comparable to specialized systems, while running on commodity CPUs in traditional datacenter networks based on either lossy Ethernet or lossless fabrics... We port a production grade implementation of Raft state machine replication to eRPC without modifying the core Raft source code. We achieve 5.5 μs of replication latency on lossy Ethernet, which is faster than or comparable to specialized replication systems that use programmable switches, FPGAs, or RDMA.

What eRPC needs is just good old UDP (User Datagram Protocol). Lossy Ethernet is just fine (no need for fancy lossless networks), and it doesn't need PFC (priority flow control). The perceived wisdom is that you can either have general-purpose networking that works everywhere and is nonintrusive to applications but has capped performance, or you have to drop down to low-level interfaces and do a lot of your own heavy lifting to obtain really high performance.

The goal of our work is to answer the question: can a general-purpose RPC library provide performance comparable to specialized systems?

Astonishingly, yes.

From the evaluation using two lossy Ethernet clusters (designed to mimic the setups used in Microsoft and Facebook datacenters):

• 2.3 μs median RPC latency.

• Up to 10 million RPCs/second on a single core.

• Large message transfer at up to 75 Gbps on a single core.

• Peak performance maintained even with 20,000 connections per node (2 million clusterwide).

eRPC's median latency on CX5 is only 2.3 μs, showing that latency with commodity Ethernet NICs and software networking is much lower than the widely-believed value of 10-100 μs.

(CURP [Consistent Unordered Replication Protocol] over eRPC in a modern datacenter would be a pretty spectacular combination!)

So, the question that immediately comes to mind is how? As in, "What magic is this?"

The secret to high-performance general-purpose RPCs...

...is a carefully considered design that optimizes for the common case and avoids triggering packet loss due to switch-buffer overflows for common traffic patterns.

That's it? Yep. You won't find any super low-level, fancy, new, exotic algorithm here. Your periodic reminder that thoughtful design is a high-leverage activity! You will, of course, find something pretty special in the way all the pieces come together.

So, what assumptions go into the common case?

• Small messages.

• Short-duration RPC handlers.

• Congestion-free networks.

Which is not to say that eRPC can't handle larger messages, long-running handlers, and congested networks. It just doesn't pay a contingency overhead price when they are absent.

Optimizations for the common case (which we'll look at next) boost performance by up to 66 percent in total. On this base eRPC also enables zero-copy transmissions and a design that scales while retaining a constant NIC (network interface controller) memory footprint.

The core model is as follows. RPCs are asynchronous and execute, at most, once. Servers register request handler functions with unique request types, and clients include the request types when issuing requests. Clients receive a continuation callback on RPC completion. Messages are stored in opaque DMA (direct memory access)-capable buffers provided by eRPC, called msgbufs. Each RPC endpoint (one per end-user thread) has an RX and TX queue for packet I/O, an event loop, and several sessions.

The long and short of it

When request handlers are run directly in dispatch threads you can avoid expensive interthread communication (adding up to 400 nanoseconds of request latency). That's fine when request handlers are short in duration, but long handlers block other dispatch handling, increasing tail latency, and prevent rapid congestion feedback.

eRPC supports running handlers in dispatch threads for short-duration request types (up to a few hundred nanoseconds) and worker threads for longer-running requests. Which mode to use is specified when the request handler is registered. This is the only additional user input needed in eRPC.

Scalable connection state

The choice by eRPC to use packet I/O over RDMA avoids the circular buffer-scalability bottleneck in RDMA (see §4.1.1). By taking advantage of multipacket RX-queue (RQ) descriptors in modern NICs, eRPC can use constant space in the NIC instead of a footprint that grows with the number of connected sessions (see Appendix A).

Furthermore, eRPC replaces NIC-managed connection state with CPU-managed connection state.

This is an explicit design choice, based upon fundamental differences between the CPU and NIC architectures. NICs and CPUs will both cache recently used connection state. CPU cache misses are served from DRAM, whereas NIC cache misses are served from the CPU's memory subsystem over the slow PCIe bus. The CPU's miss penalty is therefore much lower. Second, CPUs have substantially larger caches than the ~2MB available on a modern NIC, so the cache miss frequency is also lower.

Zero-copy transmission

Zero-copy packet I/O in eRPC provides performance comparable to lower-level interfaces such as RDMA and DPDK (Data Plane Development Kit). The msgbuf layout ensures that the data region is contiguous (so that applications can use it as an opaque buffer) even when the buffer contains data for multiple packets. The first packet's data and header are also contiguous so that the NIC can fetch small messages with one DMA read. Headers for remaining packets are at the end, to allow for the contiguous data region in the middle.

eRPC must ensure that it doesn't mess with msgbufs after ownership is returned to the application, which is fundamentally addressed by making sure it retains no reference to the buffer. Retransmissions can interfere with such a scheme, however. eRPC chooses to use "unsignaled" packet transmission optimizing for the common case of no retransmission. The tradeoff is a more expensive process when retransmission does occur:

We flush the TX DMA queue after queuing a retransmitted packet, which blocks until all queued packets are DMA‑ed. This ensures the required invariant: when a response is processed, there are no references to the request in the DMA queue.

eRPC provides zero copy reception for workloads under the common case of single-packet requests and dispatch mode request handlers, too, which boosts eRPC's message rate by up to 16 percent.

Sessions and flow control

Sessions support concurrent requests (eight by default) that can complete out of order with respect to each other. Sessions use an array of slots to track RCP metadata for outstanding requests, and slots have an MTU (maximum transmission unit)-size preallocated msgbuf for use by request handlers that issue short responses. Session credits are used to implement packet-level flow control. Session credits also support end-to-end flow control to reduce switch queuing. Each session is given BDP (bandwidth delay product)/MTU credits, which ensures that each session can achieve line rate.

Client-driven wire protocol

We designed a wire protocol for eRPC that is optimized for small RPCs and accounts for per-session credit limits. For simplicity, we chose a simple client-driven protocol, meaning that each packet sent by the server is in response to a client packet.

Client-driven protocols have fewer moving parts, with only the client needing to maintain wire protocol state. Rate limiting becomes solely a client responsibility, too, freeing server CPU.

Single-packet RPCs (request and response require only a single packet) use the fewest packets possible. With multipacket responses and a client-driven protocol the server can't immediately send response packets after the first one, so the client sends an RFR (request-for-response) packet. In practice this added latency turned out to be less than 20 percent for responses with four or more packets.

Congestion control

eRPC can use either Timely or DCQCN (Data Center Quantized Congestion Notification) for congestion control. The evaluation uses Timely, as the cluster hardware could not support DCQCN. Three optimizations brought the overhead of congestion control down from around 20 percent to 9 percent:

• Bypassing Timely altogether, the RTT (round-trip time) of a received packet on an uncongested session is less than a low threshold value.

• Bypassing the rate limiter for uncongested sessions.

• Sampling timers once per RX or TX batch rather than once per packet for RTT measurement.

These optimizations work because datacenter networks are typically uncongested—for example, at one-minute timescales 99 percent of all Facebook datacenter links are uncongested, and for web and cache traffic on Google, 90 percent of ToR (top-of-rack) switch links (the most congested) are less than 10 percent utilized at 25 μs timescales.

Packet loss

eRPC keeps things simple by treating reordered packets as losses and dropping them (as do current RDMA NICs). When a client suspects a lost packet, it rolls back the request's wire protocol state using a go-back-N mechanism. It reclaims credits and retransmits from the rollback point.

Evaluation highlights

This writeup is in danger of getting too long again, so I'll keep this very brief. This table shows the contribution of the various optimizations through ablation.

We conclude that optimizing for the common case is both necessary and sufficient for high-performance RPCs.

Here you can see latency with increasing threads. eRPC achieves high message rate, bandwidth, and scalability with low latency in a large cluster with lossy Ethernet.

For large RPCs, eRPC can achieve up to 75 Gbps with one core.

Section 7 discusses the integration of eRPC in an existing Raft library and in the Masstree key-value store. From the Raft section the authors conclude: "The main takeaway is that microsecond-scale consistent replication is achievable in commodity Ethernet datacenters with a general-purpose networking library."

eRPC's speed comes from prioritizing common-case performance, carefully combining a wide range of old and new optimizations, and the observation that switch buffer capacity far exceeds datacenter BDP. eRPC delivers performance that was until now believed possible only with lossless RDMA fabrics or specialized network hardware. It allows unmodified applications to perform close to the hardware limits.

Read this post at Adrian's blog: https://blog.acolyer.org/2019/03/18/datacenter-rpcs-can-be-general-and-fast/.

Adrian Colyer is a venture partner with Accel in London, where it's his job to help find and build great technology companies across Europe and Israel. (If you're working on an interesting technology-related business he would love to hear from you: you can reach him at [email protected].) Prior to joining Accel, he spent more than 20 years in technical roles, including CTO at Pivotal, VMware, and SpringSource.

Reprinted with permission from https://blog.acolyer.org

Originally published in Queue vol. 17, no. 2—
Comment on this article in the ACM Digital Library