Blurring Lines Between Hardware and Software

April 1, 2003
Volume 1, issue 2

Blurring Lines Between Hardware and Software
Homayoun Shahri, Tufon Consulting

Software development for embedded systems clearly transcends traditional "programming" and requires intimate knowledge of hardware, as well as deep understanding of the underlying application that is to be implemented.

Motivated by technology leading to the availability of many millions of gates on a chip, a new design paradigm is emerging. This new paradigm allows the integration and implementation of entire systems on one chip.

These complex systems typically contain application-specific hardwired parts, as well as application-specific programmable parts. The programmable units typically consist of microcontrollers, digital signal processors (DSPs), RISC processors, or the new breed of "reconfigurable" processors. With all their parts and units, these complex systems need to work together seamlessly and flawlessly.

Frequently, the code running on a programmable unit needs to interface with a hardwired unit and exchange information. Likewise a hardwired unit exchanges information with various programmable parts in these systems. It is, however, not easy to separate the programmable from the hardwired, as they both exist within these systems. The hardwired units are controlled by the programmable units, and the programmable units function and adjust themselves based on the information received from the hardwired units.

In today's complex systems, software designers must be cognizant of the internals of the hardware for which they are developing the software, and hardware engineers must be very aware of the applications that will run on their designs. Furthermore, both hardware as well as software designers must be intimately familiar with the internals of the applications that are to run on these devices. Clearly, development and design of these systems transcends traditional programming and hardware design and requires intimate knowledge of the hardware, as well as deep understanding of the underlying application that is to be implemented.

In its most basic form, software can be thought of as a way of controlling the hardware. Some of us still remember or have read that in the old days of computing, programmers used to program the computer by flipping switches to enter the microcode. As computer hardware advanced (integrating more gates on chips) and became more powerful, programming languages emerged and evolved. This eliminated many of the difficulties of the early flipping-switches approach. Programmers no longer had to be overly concerned with absolute efficiency of the software. Cycles were abundant, and tasks usually did not have to run in realtime. Programmers now were able to solve problems that could not be solved up to that point, such as managing large databases or forecasting weather.

Contrary to the popular analogy, it is not the CPU that functions as the brain in computers; rather, it is the software running on the CPU that should be thought of as the brain of computers. Recent advances in very large scale integration (VLSI) technologies and software has led to the emergence of the aforementioned paradigm, allowing complete and complex systems to be fully implemented on one chip. The so-called system on a chip (SoC) is everywhere. These systems contain programmable, as well as hardwired, units. As such they can also be called embedded systems. We need to look no further than our cars, watches, cellphones, PDAs, cameras, and household appliances to find these hybrid (software/hardware) systems.

Embedded systems are designed to perform a specific and dedicated function. They typically have tight constraints on functionality and implementation. They usually must guarantee realtime operations, as well as conform to size, power, and weight limits. They must satisfy safety and reliability requirements and meet cost targets, as well as time-to-market constraints. Such stringent constraints demand quite a lot from the designers of these systems. As mentioned previously, hardware designers need to understand the application, and software designers must understand the underlying hardware, as well as the specifics of the application. By its very nature, the SoC needs to be configurable. Various hardware units within the system need to perform different tasks. Programmers must be fully cognizant of the whole system. They must understand all the hardware resources, their limitations, and capabilities. They must be able to fully execute hardware/software partitioning.

An implication of this for configurable designs is that programmers must be able to fully identify all required hardware resources, because once the design is complete, it cannot be changed. In other words, programmers must take part in the hardware design. Similarly for the on-the-fly reconfigurable devices, designers must understand the algorithms they are to implement and must be able to map the algorithms to the hardware resources. Of course, this is very true of the field programmable gate array (FPGA) as well. The programmer needs to be a skilled hardware designer.

The following offers a brief account of the evolution of design approaches for specialized programmable devices that helps to illustrate just how blurred the lines between hardware and software have become.

DESIGNING SPECIALIZED PROGRAMMABLE DEVICES

Many of you are familiar with the multiply-accumulate (MAC) hardware that exists within DSPs. Those who have programmed DSPs know how this hardware speeds up signal-processing algorithms. Programmers had to realize how to implement the equation of the form [convolution, finite impulse response (FIR)]

using MAC operations. This equation can be implemented with N multiplies and N-1 adds. If the programmer uses the MAC instruction, however, the equation can be computed in only N MAC instructions. To take advantage of the MAC hardware, programmers have to write either assembly code or C code in a special way to flag the compiler to take advantage of the MAC hardware. DSPs have found applications in many devices.

Today more powerful processors support single-instruction stream-multiple-datastream (SIMD) and/or very long instruction word (VLIW), both of which are capable of executing many operations per clock cycle, or are superscalar, capable of issuing multiple instructions per clock cycle. For example, if a processor is capable of performing two multiplies per cycle, the programmer must implement the previous equations differently. If the processor also has an arithmetic logic unit (ALU) that can be accessed in parallel, it is easy to see that for N, even, it takes only about N/2 cycles to complete the computation. The programmer usually can take advantage of SIMD instructions and hardware by performing so-called loop unrolling. If a machine can perform two MAC operations in parallel, the loop in the following can be unrolled to efficiently implement one point of the FIR filter:

int n, k, y, x[N], h[N], y[N];
for (n=0; n<M; n++){
    y[n] = 0;
    for (k=0; k<N; k++)
        y[n] += h[k] * x[n-k];
}

It will change to:

int n, k, y1,y2, x[N], h[N], y[N];
for (n=0; n<M; n++){
    y1 = y2 = 0;
    for (k=0; k<N; k+=2) {
        y1 += h[k] * x[n-k];
        y2 += h[k+1] * x[n-k-1];
    }
    y[n] = y1 + y2;
}

Processors are also getting faster. This increase in speed necessitates more pipeline stages, which in turn demands more from programmers. Not only do programmers need to understand and modify the algorithms, but they now also need to understand the pipeline delays corresponding to different instructions, and the underlying hardware resources to achieve efficiency in the implementation. The emergence of high-density integrated circuits and devices is igniting a revolution in general-purpose processing as well. Tailoring and dedicating functional units and data paths to take advantage of application-dependent data flow is now becoming possible. Furthermore, machines have been proposed that dynamically change their configuration with changing data sets and algorithm needs. Early results in this area of reconfigurable computing are encouraging. They show that in a number of areas--including cryptography, signal processing, and searching--you can achieve 10 to 100 times the amount of computational density compared with more conventional processor solutions. This approach was taken by Malleable Technologies (PMC-Sierra) with its on-the-fly reconfigurable processor.

Malleable's processor was a programmable logic core, based on the Malleable Data Path (MDP) concept, in which you start with a CPU-like data path and modify it so that many control bits are fed to the data path on every cycle. The MDP presents programmers with a collection of hardware that could be used in every cycle. The programmer, however, must make certain that data reaches the function units and that the desired units are available. This allows specialized functionality to be implemented in a small number of cycles, somewhat analogous to the way hardware is defined in an FPGA.

The data path is not intended to replace a CPU at the system level. Instead, it is intended to perform all types of inner loop functions much more efficiently than any other programmable technology available, such as DSP chips or cores. At the system level, one or more MDP cores will be controlled by a traditional CPU, as is normally done in systems that use FPGAs and application-specific integrated circuits (ASICs). The CPU, which may be quite powerful or as simple as an 8-bit microcontroller, will be more efficient in its use of instruction bits for the outer loop control functions that it performs, thereby optimizing the use of memory throughout the system while keeping every element programmable.

The MDP does not replace the system CPU, but it replaces DSP cores and chips, FPGAs, and ASIC blocks that perform computational tasks (for example, forward-error-correction or cryptographic algorithms). It is designed to handle inner loops of complex algorithms in a stand-alone manner, without tight coupling to the controlling CPU. Inner loops may have primitive combinational elements that are very complex, such as large multiplications, complex linear feedback shift registers (LFSRs), or other bit-oriented functions. Examples of such algorithms are: the data encryption standard (DES) algorithm, which repeats a complex, bit-oriented step 16 times to scramble 64 bits of data; and the secure hatch algorithm-1 (SHA-1), which uses 32-bit additions, shifts, and Boolean logic operations repeatedly.

The MDP could be considered a VLIW/SIMD device. It contains an ALU capable of multiple 8-, 16-, or 32-bit operations. It also contains a multiply unit capable of multiple 16- and 32-bit multiplies in inner product form. What makes the MDP unique is the logic data path that is a 64-bit single-operand function unit performing general purpose Boolean logic and routing computations. This function unit implements cryptographic algorithms, Reed-Solomon/BCH codes, content addressable memory (CAM), etc. Furthermore, the MDP instructions pertaining to the logic data path need to be designed, as they control the interconnections among gates through direct control bits.

A similar approach is taken by Equator Technologies in the design of its MAP-CA and BSP-15 series processors. These processors support VLIW/SIMD-type instruction sets. Equator has positioned its devices for media processing, and instructions are designed to speed up these applications. The MAP-CA and BSP-15 processors are primarily programmed in C. Equator has also provided the designer with a large collection of powerful "Media Intrinsics" to take advantage of the parallelism inherent in the architecture of its devices.

Yet a different approach has been taken by Improv Systems. The configurable DSP processor core (Jazz DSP) allows just the right amount of processing performance to be delivered to an application with specialized computational units created with the designer's own logic. This capability is extended by combining these preconfigured heterogeneous VLIW processors into a scalable, multiple processor platform. The Jazz DSP processor incorporates overlaid data paths, a distributed register system, and code compression. Designers can incorporate custom register transfer level (RTL) blocks and instructions to create a designer-defined DSP core. Improv's architecture can scale from a single, uniquely configured Jazz DSP processor core to a system-level platform implementation that consists of many of these configured Jazz processors in an interconnected structure defined by shared memory maps between the processors. Improv uses a variant of Java to program its processors. An efficient compiler generates microcode from the high-level Java code.

Unlike the on-the-fly reconfigurability of some devices, such as the MDP, configurable architecture is fixed once it is implemented in silicon. Although the firmware can still change, the hardware resources remain the same. The advantage of this approach is that programmers can simulate their algorithms to find exactly the right number of hardware resources needed, and then synthesize the design, thus achieving a high degree of efficiency. Whereas the on-the-fly reconfigurability provides greater flexibility once the system is designed, the configurable systems are more efficient once the design is fixed.

FPGAs provide a high degree of flexibility to the designer through reconfigurability. Although a large number of reconfigurable systems developed during the past several years have demonstrated the potential for achieving high performance for a range of applications, the performance improvements possible with these systems typically depend on the skill and experience of hardware/software designers.

These devices are typically programmed using hardware description languages (HDLs), such as Verilog or VHDL. A challenge in this area is developing efficient tools to help the designer accomplish performance improvements without being involved in complex low-level manipulations. (Mature design tools do exist for logic and layout synthesis for these programmable devices.)

With the introduction of FPGAs with faster reconfiguration times and partial reconfiguration support, it is possible to use FPGAs in a dynamically reconfigurable environment and thus achieve partial on-the-fly reconfigurability. This technology makes possible the concept of unlimited hardware or "virtual hardware." The virtual hardware concept is implemented by timesharing a given reconfigurable processing unit (RPU). It needs a scheduler that is responsible for the configurations, execution, and communications among RPUs. Most FPGA vendors today offer products with embedded programmable cores, such as ARM or PowerPC.

Figure 1 shows the types of hardware resources that programmers of embedded systems are likely to see in their target systems. These include multipliers that perform multiplications of various widths in a SIMD fashion. They include barrel shifters and other ALU functions that are also organized in a SIMD style. A programmable logic unit can be programmed on a cycle-by-cycle basis, using variants of HDLs that will be used to implement Boolean equations.

Figure 1

There are also hardwired units that perform tasks such as variable length decoding (Huffman decoding, used in MPEG decompression), and hardware accelerators, such as those that accelerate Fast Fourier Transform (FFT, used in xDSL, 802.11a), Discrete Cosine Transform (DCT, used in MPEG compression/decompression), or Viterbi Algorithm (used in wireless and modem applications).

Programmers will likely see complex programmable address generators that can produce addresses corresponding to complex patterns, such as bit-reverse addressing used in computing FFTs, etc. Programmers will also see register files, which can be helpful for increasing system throughput. They must be able to manage the I/O to and from the register files and the system bus and external interface.

ACHIEVING MAXIMUM PERFORMANCE

Although most of these systems come with compilers, programmers must be prepared to program the systems in low-level languages as well. This is crucial for getting the most performance out of these systems. As discussed earlier, programmers must be intimately familiar with algorithms, so that they may optimally partition the implementation between hardware/hardwired and software/programmable. To illustrate the point further, consider an example that contrasts the more traditional methods of implementation with what is possible today in some reconfigurable devices. Those who are familiar with voice-over-IP (VoIP) systems know that to save bandwidth, these systems employ speech (or silence) detectors, so that when no speech is present in a given direction (path), no audio data is transmitted. If there is total silence, however, people may believe the line has gone "dead." Most designers of VoIP systems provide a measure of background noise and its spectral shape. The receiver then generates random noise and shapes the spectrum and intensity of this locally produced noise based on the information received. This technique is called comfort noise generation (CNG), which makes the overall experience much more pleasing.

Traditionally, designers used either samples of random noise stored in memory or a "random noise generator," which produced random noise with a given variance (intensity or power). This noise was then filtered to produce the "right" background noise. This is computationally expensive, however. Designers typically would want to reduce the computational load, so that they could use it for the encoding process and echo cancellation.

The approach taken at Malleable Technologies achieved precisely this goal. (See Figure 2.) The designers used a properly chosen "primitive polynomial" of degree 31. This polynomial determines the feedback taps of linear-feedback-shift-register (LFSR) and generates a pseudo-random noise sequence of length 231. The designers used the programmable logic unit of MDP (LDP) to implement this LFSR (similar to the way CRC is implemented) and produce 32 bits per cycle. This vector of 32 bits was then partitioned into a group of 4 bytes. The statistical properties of this new byte-based sequence and its variance were determined through simulation and used to generate the "right" noise power. This sequence of bytes was then used as input to the arithmetic unit of MDP (ADP) to shape its spectrum. MDP's arithmetic unit was capable of performing eight 8x16 multiplies. A FIR filter of length 8 was designed to approximate the background noise. This implementation had a throughput of one sample per cycle.

Figure 2

This example emphasizes most of the points made throughout this article. It demonstrates that programmers need to understand the guts of the algorithm and methods of modifying it. They must be able to describe and implement hardware (LFSR in this case) by writing the Boolean equations in a Verilog-like HDL and use the tools to generate the proper instructions. They also need to design filters using tools such as MATLAB/SPW/PTOLEMY. They must be able to simulate their designs before implementation to verify that they function correctly. They need to be familiar with the hardware resources at their disposal and with ways of using them efficiently.

UNDERSTANDING BOTH SIDES

The ever blurring lines between hardware and software have been brought to us partly as a result of recent advancements in VLSI technology, and partly because of the need for devices to operate under tight timing, size, power, weight limits, etc. One of the earliest examples of this blurring of lines between hardware and software came with the addition and incorporation of MAC units into general purpose processors. Programmers thus had to fully understand the algorithms, as well as be aware of the capabilities of the underlying hardware. They needed to be able to repartition and modify the algorithms to take advantage of the hardware resources at their disposal. The trend continued by adding multiple ALUs and MACs to the processors, which in turn gave rise to SIMD, VLIW, and superscalar designs. Recent designs and products from DSP vendors, such as Texas Instruments and Analog Devices, reflect this trend.

Although being aware of hardware and coding at very low levels is nothing new to practitioners of DSPs, it is certainly new to the rest of the community of programmers. The lines between hardware and software designs for embedded systems and SoC is getting blurred to the point that the software designer needs to be fluent not only in traditional high-level languages, but also in assembly code and HDLs. This is, of course, true of hardware designers, who can no longer design hardware in a semi-vacuum, as in the past. Another general purpose DSP is not useful anymore. Hardware designers need to work with software designers to create an efficient design that meets the stringent requirements.

Today programmers of embedded systems and system on chips must not only understand algorithms (theory), but also have the ability to specify hardware and describe it in a hardware description language such as Verilog HDL, if they are to survive in these competitive times.

References

Basoglu, Chris, Woobin Lee, and John O'Donnell. The Equator MAP-CA�™ DSP: An End To-End Broadband Signal Processor�™ VLIW. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, No. 8, August 2002.

Malleable Technologies (PMC-Sierra). MDP-1 Hardware Architecture and Programming Manual. 2000.

Priebe, R. and C. Ussery. A Configurable Platform for Advanced System-on-Chip Applications. ICSPAT 2000, Dallas, TX, October 2000.

HOMAYOUN SHAHRI, PH.D., is a principal partner at Tufon Consulting and an adjunct professor of electrical engineering at the University of Southern California (USC).

Originally published in Queue vol. 1, no. 2—
Comment on this article in the ACM Digital Library