Reconfigurable Future

The ability to produce cheaper, more compact chips is a double-edged sword.

Mark Horowitz, Stanford University

Predicting the future is notoriously hard. Sometimes I feel that the only real guarantee is that the future will happen, and that someone will point out how it's not like what was predicted. Nevertheless, we seem intent on trying to figure out what will happen, and worse yet, recording these views so they can be later used against us. So here I go...

Scaling has been driving the whole electronics industry, allowing it to produce chips with more transistors at a lower cost. But this trend is a double-edged sword: We not only need to figure out more complex devices, which people want, but we also must determine which complex devices lots of people want, as we have to sell many, many chips to amortize the significant design cost.

This push toward finding complex devices with large application classes was the driving force behind the creation of the early microprocessors. Remember, in the early 1970s Intel created the 4004 so it wouldn't have to create a new calculator chip for each company that wanted one. By programming the 4004, each company could configure the chip for its own application. For the past 30 years, the programmable processor model has been one of the most successful abstractions used in the semiconductor industry. Bear in mind that most of the processors sold (by more than an order of magnitude) don't go into PCs or other things we think of as computers; rather they are used to create some needed functionality in other electronic devices. The processor truly is a reconfigurable device; the configuration is done by executing a sequence of instructions.

As we have continued to scale technology, these processors have become increasingly sophisticated, consuming the available transistors to produce faster machines. This scaling led to the now famous corollary to Moore's law: Processor performance doubles every 18 months. What's remarkable about this performance growth is that the basic machine abstraction, sequentially executing instructions, remains constant. This stable programming abstraction makes it possible to run your old code on these faster machines, and to incrementally modify the code to create the desired more-complex systems.

As Nick Tredennick and Brion Shimamoto rightly point out in the accompanying article, "The Inevitability of Reconfigurable Systems," this dominance of the general-purpose processor is coming under pressure from a number of fronts these days, and it seems likely that other solutions will need to be constructed. (This does not mean, however, that the demand for simple processors will disappear. After all, how much computing does it take to run your microwave?)

As the authors note, one driving factor is power. Not only are we moving toward untethered systems, but also previous performance scaling unfortunately increased power dissipation along with performance. If you build the highest-performance solution you can think of in today's technology, you will likely be consuming more power than you can afford. Chips are now power-constrained rather than transistor-constrained, even in high-end desktop machines. When you look at general-purpose processors, they appear particularly power-inefficient compared with other approaches.

So now we face a true dilemma: What is the best computing platform going forward? Tredennick and Shimamoto claim that it will be reconfigurable systems. In some sense they are right. Clearly having a large enough market requires systems that can be programmed for a number of different applications. Users will have to be able to reconfigure their hardware. The authors are also correct that the resulting hardware will be explicitly parallel. There will not be a single processor running a piece of code.

It is well known that if there is explicit parallelism in an application, doing that computation in parallel consumes less power for a given level of performance than performing the computation sequentially. While we know that future computing substrates will be parallel and reconfigurable, less clear is what the basic reconfigurable block will be. The current field programmable gate arrays (FPGAs) provide their customers with a field of logic gates that may be configured and reconfigured into complex logic. Use of these logic gates, however, is not optimal for a couple of reasons. First, the memory present to implement the reprogrammable interconnect creates a substantial overhead in power, area, and delay. Second, programmers of the FPGAs typically use a register transfer level (RTL) language such as Verilog or VHSIC (for very high-speed integrated circuit) Hardware Description Language (VHDL) rather than a programming language such as C, which would be much more familiar to most programmers. For reconfigurable systems to succeed, they need to develop a computation model that a language compiler may use.

Following this argument often leads people to conclude that the larger functional blocks should be processors, especially considering how the overall system will be configured/programmed. While we know how to program at the gate level by writing a hardware description language like Verilog or VHDL, which can be synthesized into logic gates, we don't want to force software programmers to work at this level for the entire application. Clearly, programmers are going to need to work at a higher level of abstraction. Once programmers use these higher-level abstractions, they need to have a compiler that can translate these abstractions to configurable hardware. FPGAs are beginning to see higher-level abstractions through the use of libraries of larger components and Simulink-style block diagram editors. But it's not clear that gate-level FPGA is the best target for this type of programming.

Many people claim that the correct reconfigurable block is a processor as it matches the computation model the compiler understands. Yet it's up to the programmer to decide how all the processors can be used together to solve a large problem in this model—and history has shown that general parallel programming is not easy without sufficient tools.

There's no question that we can build reconfigurable hardware substrates, whether the blocks are processors or gates. The real question is how to program them. The key is to think about computation models, or programming abstractions, that fit a large class of applications and then find a computational substrate built out of transistors that they map well to. The programming abstraction most in use today is a synchronous data-flow model (sometimes called a stream computational model). This is the model that many of the tools mentioned in Tredennick and Shimamoto's article use; it is also the model used in the Simulink-style block diagram editors. The synchronous data-flow model works well for applications with large amounts of data parallelism, which is characteristic of many applications needing the higher-performance computation such as signal processing. Given this computation model, the question now is: What is the best computation substrate for a stream compiler? I don't think this will look like a reconfigurable system as we think about it today; nor will it look like a processor.

Many people are researching these types of machines, and my colleague, Bill Dally, is one of the leaders in this area. His proposal has a number of distributed simple processors, where the configuration looks more like a number of instructions merged together to form a very large instruction word (VLIW) program than FPGA bits [1]. Whether this turns out to be the "right" architecture for stream machines is still an open research question, but it clearly shows the solution may well look more like processors than FPGAs.

What will happen with memory devices is even harder to predict, but just as important to understand. While many exciting new technologies are on the horizon, displacing an existing standard is always difficult. The problem is that the standards for memory devices are extremely high. We expect dynamic RAM (DRAM) and erasable programmable ROM (EPROM) to have nearly 1 billion working memory bits on each device, dissipate less than one watt when active, and cost a few dollars. Getting any new technology to this point will take lots of money.

To make matters worse, designers have become clever about using a couple of devices to make the system look like it has a device with even better performance. For example, fast static-RAM (SRAM) caches mostly hide the fact that the DRAM is slow. In fact, they hide it so well that most people won't spend extra money/power to get faster DRAM. Similarly, you can use SRAM and EPROM to get a system that looks like it has nonvolatile SRAM. So keep your eyes open for new memory technology, but don't bet the farm on such new technology solving all your problems yet.

Clearly, in the future we will have chips that can be configured to perform a number of different functions. But like many projects, building the hardware is the easier part of the problem. We know how to build these chips with the reconfiguration performed at the gate level (FPGA), at the instruction level (chip-level multiprocessors), and anywhere in between. Unfortunately, programming any of these chips to yield efficient solutions is still not solved, and it is the solution of this "software" problem that will eventually select the chip organization of the future.
Q

NOTE

1. The VLIW instruction results from the compiler being able to statically schedule the many parallel execution units to operate concurrently each cycle. This approach is much simpler than the superscalar, out-of-order processor architectures of today's CPUs, which try to extract the parallel schedule dynamically in the hardware.

acmqueue

Originally published in Queue vol. 1, no. 7
Comment on this article in the ACM Digital Library





More related articles:

Michael Mattioli - FPGAs in Client Compute Hardware
FPGAs (field-programmable gate arrays) are remarkably versatile. They are used in a wide variety of applications and industries where use of ASICs (application-specific integrated circuits) is less economically feasible. Despite the area, cost, and power challenges designers face when integrating FPGAs into devices, they provide significant security and performance benefits. Many of these benefits can be realized in client compute hardware such as laptops, tablets, and smartphones.


Christoph Lameter - NUMA (Non-Uniform Memory Access): An Overview
NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system’s memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths.


Bill Hsu, Marc Sosnick-Pérez - Realtime GPU Audio
Today’s CPUs are capable of supporting realtime audio for many popular applications, but some compute-intensive audio applications require hardware acceleration. This article looks at some realtime sound-synthesis applications and shares the authors’ experiences implementing them on GPUs (graphics processing units).


David Bacon, Rodric Rabbah, Sunil Shukla - FPGA Programming for the Masses
When looking at how hardware influences computing performance, we have GPPs (general-purpose processors) on one end of the spectrum and ASICs (application-specific integrated circuits) on the other. Processors are highly programmable but often inefficient in terms of power and performance. ASICs implement a dedicated and fixed function and provide the best power and performance characteristics, but any functional change requires a complete (and extremely expensive) re-spinning of the circuits.





© ACM, Inc. All Rights Reserved.