If you were looking for lessons on energy-efficient computing, one person you would want to speak with would be Steve Furber, principal designer of the highly successful ARM (Acorn RISC Machine) processor. Currently running in billions of cellphones around the world, the ARM is a prime example of a chip that is simple, low power, and low cost. Furber led development of the ARM in the 1980s while at Acorn, the British PC company also known for the BBC Microcomputer, which Furber played a major role in developing.
In our interview this month he shares some of the lessons on energy-efficient computing he has learned through working on these and subsequent projects. He also fills us in on the innovative work he is doing at Manchester University, where he is a professor of computer engineering in the School of Computer Science. Furber's SpiNNaker (Spiking Neural Network Architecture) project is a massively parallel system designed to simulate the workings of part of the human brain. Composed of a million ARM processors, SpiNNaker could help unravel some of the mysteries of the brain and eventually could provide valuable lessons on energy-efficient, fault-tolerant computation.
Interviewing Furber is Queue editorial board member David Brown, who met Furber at Cambridge University, where they both received Ph.Ds. Brown, an engineer in Sun's Solaris Engineering Group, has also thought a lot about energy-efficient computing. He works on the Solaris operating system's core power-management facilities, with a particular focus on Sun's x64 hardware platforms. Brown's resume prior to coming to Sun includes stints at Silicon Graphics, which he cofounded, and DEC, where he helped build the team that developed the graphics architecture for DEC's MIPS workstations.
David Brown Can you tell us a little about your background and the early history of Acorn leading up to the BBC Microcomputer?
Steve Furber I was born and brought up in Manchester, U.K., and went to university in Cambridge. My first degree was in maths, and then I did a postgraduate year in maths before going into the engineering department to do a Ph.D in aerodynamics. In the course of my Ph.D work and the research fellowship in aerodynamics that followed, I got increasingly involved in using computers to get results from my aerodynamics experiments.
I got interested in how computers work, and I joined the Cambridge University Processor Group when it formed around 1977-78. This was a student society for those of us who liked building computers for fun. I built a machine using a Signetics 2650 microprocessor, an 8-bit micro that we had to order from California, which in those days was a bit exciting. I bought this and some other parts and wired a small machine together.
At that time, Hermann Hauser and Chris Curry were looking to form a company, which became Acorn Computers. They naturally went to the Cambridge University Processor Group to look for people with the technical knowledge to do some of the work, and I slowly got drawn into the embryonic Acorn.
In the final year of my fellowship, I was doing odd bits with Acorn and using machines built with Acorn parts in my aerodynamics research. Acorn got wind of the fact that the BBC was looking to use a particular microcomputer system to go with a TV program it was planning for early 1982. Around Easter 1981, Acorn persuaded the BBC to look at what it would offer, and then in a mad week, we built a prototype of what became the BBC Micro. We got it to work at 7 a.m. on the Friday morning before the BBC arrived at 10 a.m., and Acorn got the BBC Microcomputer contract. At the end of my research fellowship, I decided to join Acorn's full-time staff.
The BBC Micro was a 6502-powered 8-bit system. It had lots of expandability, so you could add a second processor onto it. You could add a receiver that would pick up software that the BBC broadcast through its Ceefax [the BBC's teletext information service] transmissions, and it was used extensively in schools. It was a big success in the U.K. and a few other markets, but not so visible in the U.S. market. Although Acorn did attempt to set up a U.S. sales operation, it was not terribly successful because we were competing with Apple on Apple's home ground, and that's always a fairly risky thing to do.
DB This represented the debut of personal computing in the U.K. in the same way that the Apple II might have done in this country—in fact, on a very parallel technology base: the 6502 processor. I think these are two social microcosms that didn't see one another all that well because of the respective companies and closures involved.
SF We were aware of the Apple II. The BBC Micro used the same processor, but it was built a little bit later so we pushed it a bit faster. We were running the 6502 at 2 megahertz, whereas the Apple II was running a 1-megahertz 6502.
The BBC Micro went on to be very successful, and that was a really exciting time to be building small microcomputers. In the U.S. companies such as Apple were flourishing. I remember our early discussions with the BBC, and they were convinced that on the back of this program we would be able to sell 12,000 BBC Micros, which was a big enough number to be exciting. Of course, it turned out to be wrong by two orders of magnitude. We sold 1.5 million BBC Micros before the enthusiasm began to wane. That was a remarkable period—sufficiently remarkable that the BBC recently made a comedy-drama about that period and the competition between Acorn and another U.K. company, Sinclair, for the BBC contract and the sales that resulted in that very early boom.
DB How did the initial excitement with the BBC Micro lead to the ARM?
SF Looking back, this was in a very compressed period of time: the BBC Micro first went on sale in January 1982, the design of the first ARM microprocessor started in late '83, and we had the first working ARM silicon in April 1985.
With the BBC Micro being so successful, it was clear that Acorn needed to continue product development to build on that success, and a whole strange set of things came together that resulted in the ARM. We had been playing in the lab with various 16-bit microprocessors, and we found that they didn't really quite cut it for us for two reasons.
First, they both were based on pretty complex instruction sets, and this meant that they had very poor realtime interrupt response. The BBC Micro had no hardware support in the form of DMA (direct memory access) controllers, and all the realtime I/O was handled by software using interrupts. What we found was that the 16-bit micros in the early 1980s had worse interrupt response time than the 6502.
Second, we had developed this model of performance where the principal determinant of any computer's performance was how much accessible memory bandwidth the processor could use. In the early 1980s, cache memories were not commonplace, so the available memory bandwidth was determined by the performance of commodity DRAM. The 16-bit microprocessors of the early 1980s couldn't use all the bandwidth that even DRAM could provide, and that struck us as the wrong answer. The memory bandwidth was the primary resource, and it was the processor's job to make the maximum possible use of that.
We were sitting there thinking about these two issues and were not sure which way to go when we got wind of the RISC papers published by Berkeley and Stanford in the early 1980s. Their story—building a simple processor that very well matched the memory and had very simple instructions so it had good interrupt response time—resonated very strongly. We began to think in terms of whether we might design our own processor along the Berkeley and Stanford RISC lines.
Another factor in this, which is completely unrelated to the technical story, is that Andy Hopper, who is now head of the Cambridge computer lab, was a director of Acorn and persuaded Hermann [Hauser], who led the technical work, that it was really important to get into silicon design. I remember him saying around that time that in the future there would be two sorts of computer companies: those that have learned to make chips and those that have gone out of business.
With Hopper's advice, Acorn chose VLSI Technology as its supplier of chip design tools. We recruited a small but experienced chip design team and bought the Apollo workstation, so we had the machines, the software, and the people—but they didn't have any chips to design.
Wilson and I were doodling processor designs on bits of paper, and since these chip design guys had nothing to do, we were authorized to occupy them with sketching out some processor designs. We figured this project was not likely to succeed, but we thought, "We'll set off designing this microprocessor, we'll learn lots, and we'll find out why it's not a good idea before we've actually made the silicon." With what we learned in the process, we figured we would be better positioned to decide which is the right chip to buy for our next product.
At that stage microprocessor design was a black art for people like us. It had a mystique, and we didn't really think it was something small companies could do. But we just set about doing this design work anyway, and it turned out it wasn't a black art; the microprocessor is just a piece of logic, like other bits of logic we had designed. The chip guys could put it all together, and, in fact, in 18 months we had a piece of working silicon.
Although it needed work, it was highly competitive while being much simpler than the commercial offerings. At that point, the company said, "Well, OK, we've got a good microprocessor here. Let's configure our product plans around this."
That really is how the ARM emerged: a set of ideas coming together, thoughts coming together, resources coming together, and I guess quite a lot of luck in that we got first-time working silicon. And it worked well.
DB A couple of interesting ideas occur to me. One is the emphasis on design simplicity that RISC had as part of its principles, but that also was a constraint on you guys. The other is the lack of hubris about this that you might have had if you'd had more resources and a bigger company.
SF That's absolutely true. I know Hermann has said a few times that he thinks the Acorn team had two advantages that the big semiconductor teams didn't have: first, we had no money so everything had to be done very cheaply; and second, we had no people. This is good management retrospective: by depriving us of resources of any sort, they forced us to make decisions in favor of simplicity.
The RISC idea was a good starting point, but the Berkeley-Stanford designs were academic and not intended for commercial use. The ARM is not quite as pure as the Berkeley RISC designs. It's got slightly denser instruction encoding and slightly richer instructions in the instruction set. Basically, whenever we were looking at a decision, it was clear that we had to decide in favor of simplicity or we would never get the design finished and we would never get it made—and if we did, it would never work.
DB One of the things that came out of that was that the ARM turned out to be a very low-power chip. There's an interesting story in how that came to be that perhaps you would like to tell.
SF The ARM was conceived as a processor for a tethered desktop computer, where ultimate low power was not a requirement. We wanted to keep it low cost, however, and at that time, keeping the chip low cost meant ensuring it would go in low-cost packaging, which meant plastic. In order to use plastic packaging, we had to keep the power dissipation below a watt—that was a hard limit. Anything above a watt would make the plastic packaging unsuitable, and the package would cost more than the chip itself.
We didn't have particularly good or dependable power-analysis tools in those days; they were all a bit approximate. We applied Victorian engineering margins, and in designing to ensure it came out under a watt, we missed, and it came out under a tenth of a watt—really low power.
Of course, all the previous arguments about keeping it very simple also push in this direction. The first ARM chip had only about 25,000 transistors. It was a tenth the complexity by transistor count of some of the processors at the time.
DB So some of it was just the scale of the chip and the simplicity of the design, since you didn't have too much circuitry on there. Were there any circuit-level considerations as well?
SF No, there was nothing particularly clever done at the circuit level to keep the power under control. It was a CMOS (complementary metal-oxide semiconductor) chip, and these were fairly early days for high-speed CMOS. The processor that was used in the BBC Micro—the 6502—was an NMOS (negative-channel metal-oxide semiconductor) processor. For the ARM, we went to CMOS, but that was a fairly simple decision at the time.
We pushed the technology for design convenience rather than for low power, so we opted for a two-level metal process, even though at the time two-level metal was still considered fairly risky. Using two-level metal made it much easier to modularize the design to get bits designed separately and then wired together at the top level, but I don't recall doing anything at the circuit level to control the power.
DB There was a subsequent evolution where as a result of its power efficiency, the ARM was adopted in spaces where that was very important—in particular, the cellphone business. I was also thinking about what happened with the StrongARM at DEC (Digital Equipment Corporation), where perhaps more attention was given to some of the circuit-level design techniques that had great consequence for higher-end computers.
SF During the 1980s the ARM was predominantly used in Acorn's products. The big change happened when the ARM was spun out of Acorn as a joint venture with Apple and VLSI Technology in 1990. The processor was still going in Acorn's desktop products, but it was also being tuned for the Apple Newton, a portable hand-held device, where power was all important. At that point, the company really started looking at the issues of low-power design.
StrongARM came out in the mid-'90s. It was absolutely leading in terms of power performance at that stage, and that was achieved principally through low-voltage operation, using process technology that suited low-voltage operation, and using circuit-level design optimization that would maintain the speed despite the low voltage.
DB Dan Dobberpuhl was one of the principals at DEC who was involved in that sort of work [see Queue's interview with Dobberpuhl]. What's interesting is this same management of the thermal envelope, which you described for the plastic package at Acorn in the very early days, turned out to be a driver for the high-end computer systems at Digital as well, when it was building these things based on 100K ECL (emitter-coupled logic) technology. Digital was focused on power to manage the thermal considerations, but then found out it could get quite a lot of performance out of much lower-power designs. I think you touched on the salient observation there: at much lower voltage and frequency you could have vastly lower power but still get a great deal more performance per unit of power. That became sort of the driving design consideration even for these very high-end systems.
SF StrongARM was quite a remarkable processor when it first came out, because compared with every ARM that had gone before, it achieved a much higher clock rate but still delivered within the one-watt envelope. I think Digital had the Newton and Newton-like product market in mind for it, and low voltage was the key.
DB Was the target for StrongARM still portable or mobile devices?
SF Yes. Acorn used it in its desktop machines, but that wasn't the target. What was remarkable was that you had a 200-megahertz processor still running within an envelope of about a watt and in low-cost packaging that wouldn't require a heat sink of any sort. The Digital work was remarkable, and its derivation of the power target from the Alpha was quite illuminating. Digital basically said if you start with an Alpha at whatever it was—20 watts—and go from a 64-bit to a 32-bit data path, that's a factor of two, and then you do this other thing, that's another factor of two. The arithmetic to get from Alpha to StrongARM was pretty straightforward.
DB We've focused a lot on the hardware, and I think that's a very important context in the framework of what's going on with energy and computing. Looking forward, is most of this pursuit to be energy efficient going to be achieved by advances in the component and hardware technology, or is it also going to have great consequence for software developers?
SF If you want an ultimate low-power system, then you have to worry about energy usage at every level in the system design, and you have to get it right from top to bottom, because any level at which you get it wrong is going to lose you perhaps an order of magnitude in terms of power efficiency. The hardware technology has a first-order impact on the power efficiency of the system, but you've also got to have software at the top that avoids waste wherever it can. You need to avoid, for instance, anything that resembles a polling loop because that's just burning power to do nothing.
I think one of the hard questions is whether you can pass the responsibility for the software efficiency right back to the programmer. Do programmers really have any understanding of how much energy their algorithms consume? I work in a computer science department, and it's not clear to me that we teach the students much about how long their algorithms take to execute, let alone how much energy they consume in the course of executing and how you go about optimizing an algorithm for its energy consumption.
Some of the responsibility for that will probably get pushed down into compilers, but I still think that fundamentally, at the top level, programmers will not be able to afford to be ignorant about the energy cost of the programs they write.
What you need in order to be able to work in this way at all is instrumentation that tells you that running this algorithm has this kind of energy cost and running that algorithm has that kind of energy cost. You need tools that give you feedback and tell you how good your decisions are. Currently the tools don't give you that kind of feedback.
DB Yes, right now we have very little instrumentation or observability of what the costs are, so you can't even empirically search around by trying different techniques to seek a relative optimum. It's not even open loop; it's just blind at the moment.
SF You can't even see the shape of the surface that you're trying to optimize. But I think folks who work in consumer electronics, where energy has been an issue for a decade or more, are getting pretty good at doing this top-to-bottom optimization. It's not perfect, but you can walk around with your iPhone, which will automatically find networks if they're there and not waste too much power looking for them if they're not.
In other areas where there are concerns about energy usage—for example, data centers—the energy issue is just beginning to become a focus. From what I've seen, data-center computing could achieve some very big wins without doing anything deeply original or technical. Techniques such as just moving the power supply outside the air-conditioned space have a first-order effect on the power consumption of the data center.
DB I want to touch on the tension between the kinds of things that have been done for energy efficiency in the mobile space versus the more general-purpose computing space. The mobile space has a much better-defined problem, which gives you enough constraints or well-known repeatable patterns that you may be able to optimize, whereas in much more general-purpose computing, you have this terrible difficulty of trying to be all things to all people, where you don't necessarily know what application is going to be run on the system and how it's going to behave.
SF That's right, and we know that making a single thread go as fast as possible involves complex management of pipelines and such, and that's very expensive in energy terms. One of the interesting issues at the moment as we're going to multicore computing at all levels is whether you want your system to have 10 or 100 complex, high-performance cores, or whether you can manage with 100 or 1,000 much simpler cores. If you can cope with parallelizing the problem so that you can run a larger number of threads on a larger number of simple cores, then you can get a real energy win.
DB Intel is building multicore chips, sometimes called heterogeneous multicore, which have some cores with complex CPU microarchitecture and a larger number with simpler CPU microarchitecture. Then there's the challenge of trying to identify those applications that can benefit from the more complex cores to get greater performance at the cost of more energy, and those that can't really get those benefits, and then to target those applications to the simpler cores and not waste energy. Unfortunately, we're not yet at the point where either the hardware or the system software is easily able to detect where the best fit is.
SF The introduction of multicore a few years ago demands a disruptive change in how software is built, and until that time comes, then all this optimization is extremely difficult. I guess you can do it extrinsically with system analysis techniques, but if you want to remain flexible, you've got to do it internally to the system using something more like the dynamic optimization techniques that are used in just-in-time compilation.
DB There's also the question of how much stuff is going to be happening in the components themselves to adjust performance for energy somewhat automatically: for example, the dynamic voltage and frequency scaling sorts of things where they say, "Well, based on how heavily utilized the hardware is, we'll adjust the performance level somewhat automatically." These things can be done fairly successfully almost autonomically in the hardware component, and that's because the changes can be made very quickly with very little energy cost.
SF The technology is going to push us farther down that path because as we continue shrinking transistors, the variability and reliability of the components are going to become serious problems. DVS (dynamic voltage scaling) is one of the few weapons we have to cope with this problem. Near-future chips are going to have to have dynamic voltage adjustments at at least the submodule level right across the chip, so that a component that has a long critical path because of a weak transistor in that path can be brought back into line by cranking its voltage up a bit.
Also, error management is going to require these sorts of techniques. Very similar to DVS is the Razor technique, which ARM has been developing. This allows the processor to run toward the corner of the voltage and frequency envelope until some level of errors starts to occur. They can basically push into the optimization corners, detect errors, manage the errors as they arise, and just sit right on the edge of acceptable error rates. DVS is kind of doing this, but doing it blind. With DVS you've characterized the process off-line in some sense, and you know that at this clock frequency you need that voltage, but you may have a very big margin in that. With the Razor technique you can actually go into that margin and find out where things really begin to go wrong and use that to get another factor of 1.5 or 2 improvement in power efficiency.
DB Speaking of new developments, can you tell us a bit about the SpiNNaker project that you're leading at Manchester, which is addressing this grand challenge to develop a new style of computation that's essentially more biologically based?
SF My current work is building a million-processor, massively parallel machine. It's a brain-modeling application, so we want to build very large event-driven neural networks. We find that using very large numbers of small processors is a power-efficient approach to this problem.
One of the interesting things to observe as you get into biologically inspired architectures is that even with the best electronics we know how to build, we're still many orders of magnitude less energy efficient than the biology we're trying to model.
DB Although we're closing the gap on that, which is exciting.
SF We are closing the gap, but there's a very long way to go. If you look at what's happening inside the brain, you see very low-voltage swings, 100-millivolt logic swings, and very slow processing and communication, but very large amounts of it running in parallel. What you also get there is good fault tolerance. In the brain you lose about a neuron a second, and it keeps working just the same. With the variability and loss of components that we're going to see in the sub-20 nanometer technology of the very near future, we're going to have to understand how to do this tolerance stuff, how to work with chips where transistors fail and leave the game in massive numbers. At the moment, I don't think we have much idea of how to do that, except in memories where we can put error correction around them.
What we're doing here is designing a chip multiprocessor on fairly old technology: it's 130 nanometer, on which we've been able to put about 20 ARM 968 cores with quite a lot of local memory, and each of these cores will be running realtime software, modeling fairly simple abstract spiking neurons. These are models of brain cells that have 1,000 or 10,000 inputs and generate a local output, and the output is purely an asynchronous spike event, so all the communication inside the machine is little packets that simply carry information about spike events as they occur. The simulation runs in realtime, so the machine has no requirement for any sort of global synchronization in each process. It just runs in its own realtime domain, receiving incoming spike events and sending outgoing spike events as the neural models dictate, and implementing local learning algorithms based on neuroplasticity, which is close to the biological models that the neuroscientists say describe what happens in the biology.
DB The initial goal of this is to attempt to model to some degree the way that human biology works so that you can perform some experiments once you have something that seems to reflect that well. To leap forward, what are the consequences for that model of computation if you were to use it for what we're trying to do today with computers?
SF There are basically two big research questions in this project. The first is, can massively parallel computing accelerate our understanding of how the brain operates? We still really don't know the fundamental principles at work in the information processing inside the brain, and that is a scientific grand challenge in its own right.
The second question reflects that back: as our understanding of brain function grows, are there lessons there that we can apply back into producing machines that are more power efficient, more reliable when built on unreliable components, and maybe easier to use? One of my standard frustrations with today's computer technology is every time I get a new operating system or buy a new machine, I have to spend a month learning how to use it. Why doesn't it spend a month learning how to be used by me so I can just carry on doing my job the same way? All of that demands that we have better models of how humans work, and it's simply an unknown area at the moment.
DB It's very exciting to think what all that might mean in terms of the way one develops systems and software going forward. We don't yet understand the computational model all that well, so one has to begin by looking at the first question before we can begin to think about how we might apply it for other purposes.
SF Just to be clear, a big parallel computer modeled on the way the brain operates is not going to answer the question of how the brain works. It's going to provide a platform on which I hope we can get neuroscientists and psychologists and other people to test their hypotheses. The machine is not going to solve the problem; it's going to provide the platform that I hope can be used to help solve the problem.
LOVE IT, HATE IT? LET US KNOW
© 2010 ACM 1542-7730/10/0200 $10.00
Originally published in Queue vol. 8, no. 2—
see this item in the ACM Digital Library
Andy Woods - Cooling the Data Center
What can be done to make cooling systems in data centers more energy efficient?
David J. Brown, Charles Reams - Toward Energy-Efficient Computing
What will it take to make server-side computing more energy efficient?
Eric Saxe - Power-Efficient Software
Power-manageable hardware can help save energy, but what can software developers do to address the problem?
Alexandra Fedorova, Juan Carlos Saez, Daniel Shelepov, Manuel Prieto - Maximizing Power Efficiency with Asymmetric Multicore Systems
Asymmetric multicore systems promise to use a lot less energy than conventional symmetric processors. How can we develop software that makes the most out of this potential?
Hennessy stopped by our Co at one time (8x8 then) and I asked him "Why 2 delay slots after branches and jumps?" answer: "We were just trying that out"
I don't know if any of you looked at GCC for delay slot handling, but @2 Mips-X was alone and it broke GCC very badly having 2 delay slots needing filling.
I remember Sophie Wilson telling me years ago about the ARM chip development. "We wanted BASIC to run as fast on the ARM chip as machine code did on the 6502 in a BBC micro."