Interviews

  Download PDF version of this article PDF

A Conversation with Kurt Akeley and Pat Hanrahan

Graphics veterans debate the evolution of the GPU

Interviewing either Kurt Akeley or Pat Hanrahan for this month’s special report on GPUs would have been a great opportunity, so needless to say we were delighted when both of these graphics-programming veterans agreed to participate.

Akeley was part of the founding Silicon Graphics team in 1982 and worked there for almost 20 years, during which he led the development of several high-end graphics systems, including GTX, VGX, and RealityEngine. He’s also known for his pioneering work on OpenGL, the industry-standard programming interface for high-performance graphics hardware. Akeley is now a principal researcher at Microsoft Research Silicon Valley, where he works on cutting-edge projects in graphics system architecture, high-performance computing, and display design.

Hanrahan also has years of experience in computer graphics, including industry positions at Pixar and DEC and academic posts at Princeton and Stanford, where he currently teaches and does research. While at Pixar, Hanrahan helped design the RenderMan Interface Specification, integral in creating the sleek, photorealistic images seen in Pixar’s feature films. In 2004 Hanrahan received an Oscar for Technical Achievement in Computer Graphics for his role in modeling the ways light scatters below translucent surfaces such as skin.

Tom Duff, who works at Pixar, conducted our interview with Akeley and Hanrahan at Pixar’s studios in Emeryville, California. A graphics-computing veteran himself, Duff has spent more than 30 years writing software for feature films, from Star Trek II to Ratatouille. His current passion is building robots and their software for theme parks and other entertainment uses. Duff has a lengthy resume of publications and patents and has received two Academy Awards for his scientific and technical achievements.

TOM DUFF What are GPUs and how have they evolved over the years?

PAT HANRAHAN Graphics requires a lot of computation. People who want realtime graphics want as many cycles as they can get as cheaply as they can get them. GPUs were developed in response to the demand for creating a lot of inexpensive cycles that you can use for graphics.

KURT AKELEY It used to be that there was so little compute power available for the amount of money somebody could spend on a desktop, that we really couldn’t do very high-quality graphics at all. What that meant was the problem had to be reduced to something simple enough that you could build hardware that was very specific and solved a very specific set of problems.

The graphics hardware pipeline, the abstraction that shows up in some of the articles in this issue of Queue, was an answer to that: an architecture for solving a very small subset of what’s required to do movie-quality graphics, but solving it efficiently enough that you could get it to work interactively 30 years ago.

What’s happened is that over those 30 years, the Moore’s law increase in transistor counts by 50 percent a year or so, and clock increases by maybe 20 percent a year over most of that period, resulted in a fantastic amount of computing power. This means the GPUs are actually solving a much more general problem now. There’s enough compute power that you don’t have to do just a little bit to a vertex and barely compute a pixel and stick it in a frame buffer—that was the whole story 30 years ago. Now you can do really advanced shading. Of course, that involves much more general operations.

So, where GPUs used to be quite specific and very distinct from CPUs, today we’re having this collision in terms of architecture; what a GPU is and what a CPU is are no longer disjoint sets.

What we’re talking about isn’t just whether we can use graphics processors to do general-purpose computing, but in the bigger sense, how will general-purpose computing be done? How will graphics processing and other technologies that have evolved influence the way computing is done in general? That’s a big issue that the world’s going to be working through for the next five or ten years.

PH One way to think about it is that GPU architects, because they had this huge problem to solve—that is, making pictures in realtime—had to come up with some very innovative techniques to develop multicore chips. In the process of innovating, they’ve actually created things that are more general than they might have thought. I mean, they knew they were general, but now people are starting to discover them, or at least the ideas behind them.

Several years ago the major CPU chip vendors weren’t interested in parallel computers. They would just say, “Clock rates will continue to increase; don’t worry about parallelism.” But then they decided a couple of years ago that they can’t keep making these things faster and faster without using parallelism. Now everybody realizes that converting your programs to run on multicores is a big thing, and you have to do it or you won’t get more performance.

You can view GPUs as a couple of steps ahead of the game. They were out there maybe a little bit further, and they still are out there further than the dual-core or quad-core CPUs. GPUs will have 16 or 32 cores, and they’re specialized for certain different classes of workloads that are more related to graphics.

It’s not that GPUs are on a weird, parallel track trying to solve only the graphics problem—they actually got ahead of the more-general computing game by innovating in computer architecture. That’s very interesting, and if you’re a programmer, it’s the main reason you should be aware of these techniques and what’s going on.

TD The interesting thing that’s been coming down the pike for the past several years is using these processors for computational purposes that don’t really have anything intrinsically to do with graphics. There were two competing directions driving all of this.

On the one side are the engineering workstations that SGI was building in the beginning that were running at very high speeds, basically just drawing lots of polygons with simple shading—a very circumscribed sort of thing. Pulling the other way is the trend toward using a very general model to describe shading.

Now, those things pull in opposite directions. The performance of old-school GPUs really depended on the fact that we knew exactly what the algorithm was. All of the control junk that was in a normal CPU was pretty much irrelevant.

KA It has been a smoother transition. People often say that programmability is a recent innovation in GPUs. Well, GPUs have been programmable for about the entire time that they’ve been built. With most SGI machines, if you opened one up and looked at what was actually in there—processing vertexes in particular, but for some machines, processing the fragments—it was a programmable engine. It’s just that it was not programmable by you; it was programmable by me. From an architecture standpoint, that’s a fairly subtle distinction. What we weren’t doing was selling application development. It’s a little like mobile phones now. In general, they’re not extensible except by a very small set of people, so they appear unprogrammable.

All along, those SGI machines had microcode engines that were programmable; we just weren’t exposing the programmability to the world. Frankly, part of the reason was that we didn’t have control of those components.

We went out to the market and said, “You know, the Intel 860 is the best floating-point-per-dollar solution this time, so we’ll put in one of those and build a microcode engine that runs it.”

Then the next time, we would go out and say, “Mmm, this TI 40-bit floating-point gizmo is the best one, so we’ll use that.” We couldn’t promise the same coding environment generation after generation, so we couldn’t reveal that it was programmable or else our customers would get very upset. We tried that. It actually does upset customers when you let them invest in coding and then sell them another machine that’s faster but doesn’t run their code. So for a variety of sort of tactical reasons, the programmability wasn’t exposed.

The story is more complicated now because there is less programmability in some areas. But the general notion that people woke up eight years ago and said, “Oh, it makes sense to put programmability in these things,” is definitely oversimplifying. This architectural trend has been smoother than that.

TD There was, in fact—and I was here for this—an awful lot of resistance from the big players in the GPU business to exposing that programmability.

KA I was part of that.

TD I like to think that the transition happened because of us in the movie-quality imaging business. We pressed hard for it and demonstrated that if you were going to make high-quality images, this was the way you were going to do it.

PH They always knew you were right; it’s just that it was too costly for them to consider. The market opportunity wasn’t there. But the games eventually started getting so sophisticated that there was no way of making them look better without exposing programmability to the [John] Carmacks and [Tim] Sweeneys of the world.

KA Games were a big enough market that you could afford to do it. That’s the part that’s less obvious. It cost a huge amount of engineering, and it took a lot of steps and a lot of years to build this into the marketplace, which is bigger than movies at this point. It costs a lot of money to engineer these things, so it wasn’t like you could just wake up one day and say we ought to do it. It took all these years to build up the capital expenditure capability that an Nvidia or an ATI has to actually do it.

If mistakes had been made along the way—big ones—it wouldn’t have happened. There are lots of examples of marketplaces where there was custom hardware that hasn’t beautifully evolved into the space the way graphics has. I think a lot of that is market opportunity; it’s not pure technology. Those markets just wouldn’t support it.

TD If you look at the big computing machines in the world, you see that most of them are devoted to fluid dynamics and electrostatic simulation, for sort of obvious defense-related reasons.

PH And n-body calculations. To me, graphics is mostly about simulation. There are basic computational building blocks that go into simulation. To the extent that graphics uses a certain set of those in certain ways, a lot of other people use other sets of those in other ways. Once you start seeing the building blocks designed for simulation in a fairly general-purpose parallel way, you can say, “Yeah, it’s not just for graphics; it could be used for other things.” That’s what other people are starting to find out.

TD We’ve heard that GPU performance increases faster than Moore’s law. Is that just low-hanging fruit because of the primitive state of GPU architectures, or is this trend going to continue? Are those CPU and GPU curves going to merge?

KA Moore’s law, just to be clear, has to do with transistor count and is formulated, I think, as an economic law that the number of transistors on the most economically produced die size will go up exponentially—and it turned out around 50 percent a year has been the number. So 1.5 is the compound average growth. But remember, it isn’t a performance law; it is a transistor-count law.

TD Sure, but performance is related.

KA Performance is related to both transistor count and clock speed, and the clock speed mattered a lot. The clock speed has been going up around 20 percent a year.

PH One way to think of it is sort of as a rate-cubed effect. You get a square for the area, and if something shrinks in size by a half, you get four times as many of them, but the clock also goes up by roughly a factor of two.

KA It’s not purely linear, but if you go back and look at the compound rates, you could argue that if performance is the total number of transistor transitions per second—that’s a reasonable proxy—and call that capability, then that’s the compound of the Moore’s law transistor count and the clock-rate increase. Those two things together are the capability rate, and they have been going up roughly 1.7 to 1.8 per year, until the past few years, for quite a long time.

That’s how much faster you would expect an idealized thing to get year over year if the people doing it weren’t getting any smarter, if they weren’t learning anything. Indeed, GPUs have been getting faster by some metrics—not all, but by some—at a rate a little bit faster than that capability rate. So, we can say that their designers have been getting smarter.

CPUs are intrinsically sequential, which means they have a single thread of execution. The transistors didn’t go to more computation; they went to all kinds of cleverness to feed that one engine faster. It’s an interesting historical quirk that for a while the increase worked out close enough to a 1.5 compound growth rate that people started calling that Moore’s law, but it’s not the same thing.

PH This is really important for people to realize. You had this potential for CPUs to go, say, 75 percent faster every year, but they got only 50 percent faster. That means they were losing 25 percent a year to what they could have achieved every year since the dawn of the microprocessor. Not only that, 25 percent of the 50 percent was for free because the clock got faster. So they had 50 percent more area or more transistors. They used only 25 percent of the capability of those extra transistors. Fewer than half of their extra transistors were turning into anything useful. That sounds like bad engineering to me.

When I used to consult at SGI, Kurt told me that if we turn only half of our new transistors into performance, we haven’t done our job as engineers. Our goal as engineers is to use our resources fully. Since the dawn of the microprocessor, however, we’ve been throwing away half our transistors. That’s just another way of saying how inefficient CPUs have become.

Now, when GPUs hit the market, they got a performance increase of about a factor of 20 over CPUs. One way of thinking about it is GPUs put us back on the Moore’s law curve—not the number-of-transistors one, but the real capability curve. CPUs have never been on that curve.

KA And GPUs have arguably exceeded it, but when you look carefully at the numbers, the bandwidths aren’t going up at those rates. There’s more compression. Some trickery and clever engineering have made them get faster by a bit. Plus, the raw capability gives you this huge disparity between GPU peak performance on problems that are suited to them and what you can get on a CPU.

The interesting thing is that people in the CPU world are not sitting on their hands anymore. As soon as they made the decision to go parallel, the gloves came off. They’re going to stop squandering all those transistors on trying to make one thread go incrementally faster, and they’re going to start using them to make a bunch of threads go faster. This puts them potentially on the same curve as GPUs. One prediction you might make is that this disparity is going to stop changing so quickly as it has for the past 20 years or so.

TD When you get into this sort of architectural discussion, the first question that always has to come up is, where’s the bottleneck? Here’s where I see the problem right now: if I have a nice piece of silicon with 64 or 128 cores on it, and it’s only got a few hundred or a thousand pins, there’s still a serious communication problem off-chip. We don’t see much progress happening on that.

PH Right. Don’t fool yourself that this problem will be solved.

TD Really? When Seymour Cray was building the fastest computers in the world, it was precisely by addressing that problem, by making memory buses that were enormously wide paths to memory.

PH Let me tell you why my intuition is that the problem won’t go away. If you look at the cost of computing, it’s about communication. That’s where all the power goes. It’s hard and expensive to provide that bandwidth. Assuming the most expensive part is usually well engineered, you try to do the best job you can with the parts of the system that matter. People are working as hard as they can at making communication costs lower. The low-hanging fruit is to take the problem away from being one involving communication to one that doesn’t involve your most expensive resource.

Our programming environments have to be more aware of communication. Let’s say every time you said “equal sign,” you thought 1,000 times more power was being exerted than when you said “multiply.”

Bill Dally [chair of the Stanford University computer science department] has this great number, just to put this in context. If you build a 32-bit floating-point unit, it takes a picojoule to do the floating-point operation. If you execute a 32-bit floating-point instruction on a processor, it takes a nanojoule, 1,000 times more power.

The actual computing part was free, but sending the data to the floating-point unit, reading it back, putting it in the cache, and trying to put it onto the bus uses 1,000 times more power. You’re just fighting physics. Physics tells you communication is expensive, and your programming model has to revolve around the communication if it is going to be efficient. So, that problem is not going to go away—there’s just no way to defeat physics.

KA The way to minimize communication is by coherence, by having like things happen in like space and like time. Parallel processors, SIMD (single instruction, multiple data), are just a way of establishing execution coherence; putting in cache memory is a way to create locality, but it’s a very general way.

Again, the CPU people gave us a really pleasant abstraction. But in a C program, that equal sign might be a nanojoule or it might be a millijoule, depending on what actually happens. There’s no visibility into that to a C programmer. It’s really hard to look at a C program and detect that 1000:1 difference in the cost of that equality, an assignment operator.

On the other hand, in a parallel-programming environment—a fairly crude one today—it’s quite visible to you because you’re handed something that’s data-parallel, and you deal with the fact that, roughly speaking, the same thing is happening to similar data all at the same time. By being willing to deal with that, you’ve been able to get this huge increase in coherence that allows the performance to happen for a reasonable amount of power or a reasonable amount of communication. So the question is, what are some abstractions we can find that aren’t onerous to program to but that allow those things that matter to perform and to become more visible to programmers so that they can make more reasonable choices, or abstract them away so that choices are made automatically?

But, again, the fact that a modern CPU has so much of its die area dedicated to cache is expensive. It’s saving power, but it costs a lot of die area and power to save the power. You can always do better if you move more responsibility to a higher level.

TD The idea, then, is moving the work to where the data is instead of moving the data to where the work is.

KA It’s both. The important thing is having them be near each other. The original graphics pipeline was this gorgeous example of that: do a bunch of work here, move the data to something right next door and do a bunch more work, and then move it to something right next door and do a bunch more work.

If texture mapping hadn’t come along, your argument that graphics systems would be worthless for general-purpose computing would be true. Texture mapping is this awful sort of incoherent thing. It has some coherence, but as you put it into a shader and allow people to generate texture addresses, eventually you can completely destroy the coherence.

TD It’s incoherence, but it’s a scatter/gather kind of incoherence.

KA It’s a gather mostly, the way GPUs deal with it. The point is, as they’ve dealt with that more and more, the communications have gotten a lot richer. That ability to gather is a huge distinction from the old pipeline that really had no communication between the elements.

Dealing with that lower level of coherence has made the machine much more general purpose. It turns out that you can do that by caching a lot more cheaply than you can with the general-purpose caching on a CPU, so it’s not all the way to that extreme. But it’s a lot less coherent than the non-texture mapped pipelines that I started with. They were almost perfectly coherent.

TD Pat, a couple of years ago, one of your former students gave a talk here about the future of computing on GPUs. His claim basically was that all of Pixar’s fancy rendering stuff—ray tracing and subsurface scattering and more complicated simulation effects—doesn’t happen on GPUs these days. He pointed out a series of papers that covered basically everything that we do that’s really hard.

The conclusion he drew from that was there’s no reason not to run the whole thing on a GPU right now, but the examples he showed us were all isolated examples that don’t play together. It was pretty obvious that of these separate pieces, there was no reasonable way to build a whole system. They all required different data structures for storing geometry and dealing with piles of rays. I guess the point is that kernels are not systems.

PH Exactly.

TD Two things to consider: First, somebody needs to be thinking about how to bridge that. The system-integration problem is really hard.

Second, unless the architecture of GPUs evolves in ways that I don’t expect, they’re going to be attached processors forever and there’s going to be a general-purpose processor somewhere that’s doing some of the work. How the work is allocated between two different heterogeneous kinds of machines is a really important problem, and it’s really hard because optimization strategies on the two kinds of machines are fundamentally different.

PH That is a great point, and I think that is actually the biggest challenge facing us right now—for another important reason, which we haven’t talked about.

As you probably know, AMD acquired ATI and both Intel and AMD are working on building heterogeneous multicore systems that basically combine a CPU and GPU on a single chip. In the future, it might even have some other specialized hardware on it, such as a video codec. This will be our mainstream computing platform.

A laptop, for example, will have one of these single-chip things in it. How are we going to program this thing? How are we going to schedule work on it? How are we going to deal with different instruction sets or different vector units?

I don’t really know, but I do know that people are going to build these things, and we had better start thinking about it. It’s going to be very challenging to figure out.

KA One way to think about this is to figure out what we’re going to mean by GPU and CPU over the next few years, and what is the difference between the two? A lot of people right now think of something that’s data-parallel, with lots of execution units, as a GPU, and something more sequential as a CPU. But that’s not going to be the right distinction down the road.

TD Certainly, a multicore Intel box with 64 or 128 CPUs on it looks an awful lot like a data-parallel machine from 50,000 feet.

KA But it has a fundamental difference, and I think in some underlying way this may get at your issue: ultimately, the way the resources are deployed and harnessed and the way the data is moved around on a CPU is under software control; the way the data and resources are deployed on a GPU is still significantly under non-software control.

There’s a lot of general-purpose computing in there, but the way it’s wired together, the way the data moves, is not general purpose or at least not exposed yet. It’s still a graphics pipeline, or it’s pretty much neutered in something like CUDA. You lose this notion of wiring a bunch of different things together, and you’re pretty much given a single data-parallel space to operate in. I’m simplifying a bit here, but that’s roughly true. In some sense, what makes something a GPU is that the resources aren’t organized by your software control; they are organized by somebody else.

Think back to the old 860. It was an Intel part that had a general-purpose CPU, but it had a little rasterizer thing on the side. It’s very clear that the CPU directed the rasterizer; the rasterizer didn’t direct the CPU. If you open up a GPU, the rasterizer is pretty much what doles out the work that makes the high-performance thing go.

In some sense, it’s that orientation that determines if it is a CPU or a GPU. When GPUs evolve to the point where that’s no longer true, that’s the day that some of your lower-level concerns get addressed. You say, “Gee, they’ve got different data structures, and how do we wire all this stuff up?”

Once you free up the special-purpose stuff to be slaves to the general-purpose stuff, instead of having the general purpose be a slave to the special purpose, that’s what software programmers are used to. That’s what allows you to change data structures and organize the shape of your overall computation.

I don’t think GPUs are so far away from that, and when that threshold is crossed, then there really aren’t GPUs and CPUs anymore. Now there are just resources that are optimized for highly parallel computation.

PH It’s a neat way of thinking about it.

TD Yes, it is.

KA And it gives you a chance to flip your hands up.

acmqueue

Originally published in Queue vol. 6, no. 2
Comment on this article in the ACM Digital Library





More related articles:

David Crandall, Noah Snavely - Modeling People and Places with Internet Photo Collections
This article describes our work in using online photo collections to reconstruct information about the world and its inhabitants at both global and local scales. This work has been driven by the dramatic growth of social content-sharing Web sites, which have created immense online collections of user-generated visual data. Flickr.com alone currently hosts more than 6 billion images taken by more than 40 million unique users, while Facebook.com has said it grows by nearly 250 million photos every day.


Jeffrey Heer, Ben Shneiderman - Interactive Dynamics for Visual Analysis
The increasing scale and availability of digital data provides an extraordinary resource for informing public policy, scientific discovery, business strategy, and even our personal lives. To get the most out of such data, however, users must be able to make sense of it: to pursue questions, uncover patterns of interest, and identify (and potentially correct) errors. In concert with data-management systems and statistical algorithms, analysis requires contextualized human judgments regarding the domain-specific significance of the clusters, trends, and outliers discovered in data.


Robert DeLine, Gina Venolia, Kael Rowan - Software Development with Code Maps
To better understand how professional software developers use visual representations of their code, we interviewed nine developers at Microsoft to identify common scenarios, and then surveyed more than 400 developers to understand the scenarios more deeply.


Brendan Gregg - Visualizing System Latency
When I/O latency is presented as a visual heat map, some intriguing and beautiful patterns can emerge. These patterns provide insight into how a system is actually performing and what kinds of latency end-user applications experience. Many characteristics seen in these patterns are still not understood, but so far their analysis is revealing systemic behaviors that were previously unknown.





© ACM, Inc. All Rights Reserved.