Download PDF version of this article PDF

Mapping Algorithms to Architectures
HOMAYOUN SHAHRI, TUFON CONSULTING

Knowledge of both the algorithm and target architecture is crucial.

Our complex world is characterized by representation, transmission, and storage of information—and information is mostly processed in digital form. With the advent of DSPs (digital signal processors), engineers are able to implement complex algorithms with relative ease. Today we find DSPs all around us—in cars, digital cameras, MP3 and DVD players, modems, and so forth. Their widespread use and deployment in complex systems has triggered a revolution in DSP architectures, which in turn has enabled engineers to implement algorithms of ever-increasing complexity. A DSP programmer today must be proficient in not only digital signal processing but also computer architecture and software engineering.

This article focuses on mapping DSP algorithms to DSP architectures. The goal is to describe efficient methodologies and techniques to do so and to aid our readers in their work as DSP programmers. We discuss the types of algorithms that map well to DSPs and those that don’t easily map to DSP architectures, as well as the state of available tools that can facilitate simulation and mapping of algorithms to DSPs. We start by giving a short history and evolution of DSPs, then introduce DSP algorithms, look at DSP tools. We then describe efficient methodologies for mapping complex algorithms to DSP architectures, and to make our proposed methodology precise, we conclude with an example in which we map a well-known algorithm to a hypothetical architecture.

A SHORT HISTORY

The inventions of the transistor, beginning in the 1950s, and the integrated circuit, beginning in the 1960s, led to many new and improved electronic products. By the end of the 20th century, consumer electronics had become a major sector of the world economy. A major factor responsible for this rapid growth has been digital signal processing.

Signals are categorized as either analog or digital. Analog signals continuously vary with time, whereas digital signals are represented by binary numbers, a sequence of 0s and 1s. Some crucial advantages of digital signals are the ability to transmit and represent exactly, as well as manipulate such signals.

While seismic scientists had made use of digital filtering to solve certain interesting problems, it was not until the mid-1960s that a more formal theory of digital signal processing began to emerge. Some of the first major contributions to the field of digital signal processing were by James Kaiser1 at Bell Laboratories in the area of digital filter design and synthesis, and Hank McDonald in the area of A/D (analog-to-digital) and D/A (digital-to-analog) conversion, as well as Rader and Gold, who wrote the first book on DSP.2 At roughly the same time, this emerging field got a major impetus by Cooley and Tukey in their work on what is now known as the FFT (Fast Fourier Transform).3 This is a technique for computing the frequencies present in a signal.

By the mid-1960s the potential of integrated circuits was appreciated, and it was not unreasonable to imagine complete signal processing systems that could best be synthesized with digital components. It would take 20 years before this vision became a more prevalent and practical reality. In the meantime, throughout the 1960s and 1970s, most digital signal processing systems were implemented using hard-wired (fixed-function) solutions.

The invention of the microprocessor (computer on a chip, computing with low power) in the early 1970s changed this paradigm. For the first time it became possible to implement digital signal processing systems entirely in software. Although digital signal processing as a branch of engineering remains unknown to most people, many applications such as speech synthesis and recognition, and image, audio, and video compression have now become familiar. The techniques of digital signal processing, from the 1960s to the present, have played a large role in the expansion of electronic products.

DIGITAL SIGNAL PROCESSORS

Single-chip DSPs first appeared in early 1980s. Initially, DSPs were general-purpose microprocessors with certain specialized hardware and related instructions (single-cycle multiplies, multiply-accumulate, multiple-access-to-memory) that aided the implementation of DSP algorithms. When DSP algorithms were implemented on DSPs rather than microprocessors, the result was more than an order-of-magnitude increase in speed.

The tools initially available to DSP programmers were typically assemblers and debuggers. DSP engineers had to write programs in obscure assembly languages for these early DSPs. The resulting code was hard to write and maintain. The implementation of ever-more complex algorithms on DSPs resulted in the rapid evolution of DSPs in parallel with microprocessors, as well as the evolution of tools such as C compilers and integrated development environments. Now there are two general classes of architectures that achieve instruction-level parallelism and high performance: superscalar and VLIW (very long instruction word) architectures.

Superscalar architectures use special on-chip hardware to look through the instruction stream for independent operations that can be executed concurrently to maximize parallelism. These processors are designed to dynamically solve data dependencies in hardware and to issue parallel operations. Each instruction in superscalar processors performs a single operation only. To implement instruction-level parallelism, the on-chip hardware looks at several instructions at a time and tries to execute as many instructions concurrently as it can. To keep the pipeline busy and increase the throughput, the hardware performs branch predictions. This adds to the complexity of superscalar architectures.

VLIW processors have highly parallel architectures that offer an alternative to multiprocessors. Like superscalar processors, VLIW processors typically have many independent functional units, each sharing multiple banks of on-chip or external memory. Since each instruction contains an op code for each functional unit, many units can execute their assigned operations concurrently. As opposed to the superscalar approach, VLIW processors do not need a lot of on-chip hardware to resolve data dependencies or issue parallel operations. This is because the scheduling of all parallel operations is the responsibility of the compiler. Thus, the compiler must understand the target architecture intimately to be able to schedule any given algorithm for optimal performance. Even when coding in a high-level language like C, however, programmers must be familiar with details of the architecture. Thus, the burden of issuing parallel instructions is shifted to the software.

During the past few years, reconfigurable architectures have also emerged. These newcomers are saying that to adapt to the numerous fast-changing standards in the wireless and consumer arenas—coupled with restraints on die size and power consumption, and computational throughput needed—the next-generation communication and consumer products are demanding new levels of flexibility and programmability that cannot be afforded by traditional fixed DSP devices. Fully programmable DSP chips may be too inefficient for some of these applications. In reconfigurable architectures, the designer has flexibility in designing instructions that perform the needed task efficiently, or can add specialized hardware to perform part of the task very efficiently. The reconfigurable architectures, in short, are a hybrid between fully programmable DSPs and hard-wired ASICS.4

DSP ALGORITHMS

Webster defines algorithm as a procedure for solving a mathematical problem in a finite number of steps that frequently involves repetition of an operation, or a step-by-step procedure for solving a problem or accomplishing some end especially by a computer. The term DSP algorithm is very broad and covers a wide variety of algorithms—from relatively simple ones such as cyclic convolution, which is used in implementing FIR (finite impulse response) filters, to somewhat more complex algorithms such as FFT or DCT (discrete cosine transform), to very complex algorithms such as speech/audio, image/video coders, and modems.

Digital signal processing is the purposeful changing of digital signals so as to improve transmission, storage, or enhance the quality of the signals. Examples of these changes include filtering (in pre- and post-processing, as well as decomposition), transformation (in compression), estimation (in motion estimation for video compression), detection (in radar), analysis (in speech compression), recognition (in speech recognition), and synthesis (in speech synthesis and compression). As such, the digital signal processing algorithms have certain blocks in common. A digital signal processing system contains many of these small blocks tied together.

In achieving efficient implementation of complex DSP algorithms, therefore, we must start from efficient implementation of these small blocks, which typically consume a very large percentage of the total operation. Thus, DSP programmers need to consider various parameters such as the memory usage (important for portable and low-power devices), as well as complexity measured in number of operations. They must choose algorithms based on a proper measure of complexity for their target architectures. For example, an FFT algorithm that minimizes the number of multiplies at the expense of a drastic increase in the number of additions is not suitable for mapping to most DSP architectures, since in DSPs an add is as expensive as a multiply. A proper measure of complexity in DSPs is the number of multiply-adds or multiply-accumulates. But this metric is very device-dependent, and DSP engineers must be cognizant of it.

Once you have chosen the right algorithms for your DSP system, you can proceed to the implementation, which is typically done in a high-level language (typically C), and evaluate the correctness and functionality of the system. Because of power consumption and die-size considerations, still today, most DSPs perform fixed-point operations. If your initial design and implementation uses floating-point numbers, the next step is the conversion of the algorithms to fixed-point (integers). DSP programmers must be intimately familiar with the internals of the algorithms to carry out this task, and guard against overflow and underflow. This process is typically carried out block by block, and at the end of conversion of each block, its functionality and correctness is checked. Once the suitable algorithms for the device are chosen and implemented, and all the blocks are converted to fixed-point (for fixed-point devices), you are ready to embark on the next task: mapping the code to the target device. You must be aware of the capabilities of the DSP development tools before you start mapping the algorithms.

Note that bit-oriented algorithms, as well as algorithms that contain complex state machines, because of their inherent sequential nature, do not map well to DSP architectures. Reconfigurable DSPs usually do give users the ability to design and include their own bit-level manipulation engines. Some specialized DSPs also include bit-level manipulation engines for implementing bit-oriented algorithms, such as Huffman decoders.

THE STATE OF DSP TOOLS

Over the past two decades DSP development tools have evolved significantly. When DSPs emerged in the late ’70s and early ’80s, the only tools available to DSP engineers were assemblers and debuggers. The first DSP architectures had many anomalies that made it very difficult to generate compact code. Today, however, DSP engineers have access to not only assemblers, but also compilers for high-level languages such as C and C++, optimizers, and profilers—as well as high-level design and simulation tools.

In the early days of DSPs, programmers were responsible for filling the delay slots corresponding to branches and setting up the loops to fill the pipeline (software pipelining), performing loop unrolling, and in-lining of certain critical functions, but today the compiler handles most of this complexity. For some DSPs with simpler architectures, as well as superscalar architectures, a well-designed compiler could function at upwards of 80 percent efficiency as compared with hand-optimized assembly, which is sufficient for most applications. The compilers for VLIW architectures may not perform as well, since it is difficult to extract parallelism from algorithms without direct feedback from the programmer. The compilers for this class of DSPs typically perform better if the number of pipeline stages is small; their performance degrades as the pipeline deepens.

Today several vendors provide high-level tools for design and simulation of complex digital signal processing algorithms and systems. Using high-level tools such as Matlab, SPW (Signal Processing Worksystem), and Mathematica,5 algorithm designers can divide the DSP algorithm into smaller interconnected blocks, which can be modified or replaced until the desired performance is reached (for example, change the FFT size at will for a DMT [discrete multitone]). Algorithm designers can represent their algorithms in block diagrams and easily analyze the input and output of these blocks.

These high-level simulation tools also provide a rich library of DSP primitives, functions, and algorithms that greatly help designers and can serve as building blocks in more complex algorithms or systems. Optional packages are also provided by these tools that generate DSP code for certain fixed-target devices. Although the generated code could serve as a good starting point for further optimization, the performance of such automatically generated code rarely meets the expectations of the programmer and falls short of the application requirements.

To assist programmers, most DSP vendors provide a library of optimized functions and algorithms, such as FFT, DCT, motion estimation, FIR and IIR filters, as well as adaptive filtering algorithms, and so forth. As discussed earlier, these are the kinds of blocks that consume the large majority of cycles. Most DSP vendors also provide along with their compilers special high-level APIs (application programming interfaces), usually called intrinsics, which are function calls that translate into one assembly instruction. The intrinsics are most useful for VLIW architectures and vector processors, and they can help the compiler to optimally schedule instructions. It is also through the use of intrinsics that programmers can gain access to specialized hard-wired blocks (motion estimation, DCT, or Huffman decoders, etc.) from a high-level language such as C. Often these intrinsics are assembly instructions.

For programmers of reconfigurable architectures, the tools typically also provide them with the ability to design their own intrinsics onto the compiler, and thereby use the custom blocks from high-level languages such as C or C++. In effect, the tools are modified to match the characteristics of the reconfigurable architecture. Some of these tools also give programmers the ability to ask “what if?” and assess how adding their intrinsics (representation of hardware) affects the performance of the algorithm.

The types of optimization performed by the compilers include, but are not limited to, the following:

Improving the performance of VLIW compilers is an active field of research.

VERIFICATION DSP ALGORITHMS AND SYSTEMS

Once the DSP algorithms are designed, typically using either high-level design and simulation tools or a high-level programming language such as C, a set of test vectors and possibly reference software are generated. The test vectors and reference software are crucial for verification of implementation of DSP algorithms on target DSP architectures. For example, test vectors and reference software exist for most standards-based algorithms (ITU, and ISO-based algorithms for speech, audio, image, and video compression and decompression). The provided reference software could use fixed-point or floating representation for implementing the DSP algorithms. A proper metric is typically also provided, along with the test vectors and reference software to measure the correctness of the implementation. If only a set of test vectors and relevant metric are provided, it is prudent that the DSP programmers write the software that conforms to the test vectors (produces the right results) running on general-purpose computers. This software will then serve as the reference software during the course of mapping the algorithm to the target architecture. Otherwise, verification becomes more difficult.

Using this reference software and test vectors, DSP programmers can isolate each block (in terms of input and output of the block) in the algorithm and verify its implementation, as the performance of each block can be measured against that of the reference software using the appropriate test vectors. The advantage to this is that DSP programmers can verify the implementation as they map the algorithm to the target architecture. The reference software could also serve as the starting point for mapping the DSP algorithm to the target architectures.

MAPPING DSP ALGORITHMS TO DSP ARCHITECTURES

Throughout this article, we have articulated the need to develop a detailed understanding of the algorithms to be mapped to the target DSP architecture, as well as the underlying architecture itself. It is important as a first step to study the architecture of the target DSP in detail and understand the hardware resources at our disposal. For example, you need to identify the number of multipliers and adders, the pipeline depth, memory buses and number of memory ports (allowing simultaneous memory access), number of cycles it takes to load data from memory, the amount of instruction and data cache, DMA (direct memory access) controller, existence of specialized hard-wired blocks, instruction dependencies, branch penalties, underflow and overflow protection (saturation), and flags, register file(s), data paths, and so forth.

For reconfigurable architectures, you need to understand your budget in terms of power, die size, development time, and performance. You need to analyze the algorithm carefully and understand the trade-offs of adding specialized blocks—such as SAD (sum of absolute difference) for motion estimation, etc.—in terms of additional die size, power, and performance. New instructions and blocks should be added very judiciously. The new blocks must substantially increase the performance and not increase the die size above the budget. Since addition of these blocks increases the verification time of the silicon and creates more risks, this is an important pitfall to watch out for.

The next step is to understand the tools and determine their capabilities. Since today almost all DSP vendors provide C compilers, we do not consider assemblers in this article, even though they do have their proper applications and are still used. For example, programmers frequently review the generated assembly to ensure proper optimizations. You need to understand what aspects of the target hardware are accessible from high-level tools, which is important for memory partitioning, as well as mapping of portions of the algorithm to specialized blocks. You also need to understand to what extent the tools perform optimization, such as filling the branch delay slots, loop setup, loop unrolling, and instruction scheduling to fill up the pipeline, as well as keeping the VLIW instructions filled (all units busy). Finally, you need to understand the intrinsics that are provided by the compiler. Proper use of these intrinsics greatly aids the compiler in producing efficient code.

For reconfigurable architectures, you need to design intrinsics that use the newly designed blocks, as well as the latencies and data dependencies of these new blocks. Today most vendors give programmers the ability to iterate the design (configuration) until they find a near-optimal solution (in terms of power, MIPS, and die size), given the budget and the underlying architecture.

You should now be ready to start mapping to the target architecture. In my experience, to the extent possible, you should start from code that compiles correctly and performs properly on the target architecture. Since today most DSP vendors provide compilers, what I mean by mapping a DSP algorithm architecture is optimizing the code for the target architecture. For each portion of the algorithm, begin by identifying the parts that consume most of the cycles, and target these first for optimization.

Once you have identified a block for optimization, start removing unnecessary branches and/or combining branches. Look for loops, especially inner loops containing few instructions, and judiciously unroll them, as well as identifying certain critical functions for in-lining. Unrolling is a technique frequently used to remove loop overhead, since at every iteration loop variables must be tested. Certain architectures can implement zero overhead loops using specialized counters if the loop count is a constant. This is accomplished by filling the pipeline before entering the loop and flushing the pipeline after exiting the loop (software pipelining). Loops with small constant counts are also good targets for unrolling. With loop unrolling, you sacrifice code size for performance.

It is also always a good idea to study the generated assembly code to observe the optimizations performed by the tools and look for other potential gains. For devices with very deep pipelines, you may have to hand-optimize certain critical inner loops. It is also important to point out that DSPs have highly optimized data paths. One of the most significant causes of performance degradation is misalignment and improper loading of data.

The best way to demonstrate optimizations is by giving an example. Consider the front-end portion of a MPEG Layer III (MP3) encoder,6 in which incoming audio is read 32 samples at a time and is windowed (512 samples, with a Hann window—i.e., multiplied by a special vector in a point-by-point fashion) and then divided into equal frequency bands (sub-band analysis) using polyphase filters (a special way of implementing the filter banks). Implementing sub-band analysis with polyphase filters when possible is known to be very efficient from the complexity point of view (choosing the “right” algorithm). Assume that the code has been integerized, or converted to fixed-point representation. The below formula illustrates how a number r from IR (the set of real numbers) in Q format is represented.

Thus, a Q.31 contains 31 binary digits to the right of the binary point. We assume that the window is represented in Q.31 (sufficient resolution for this part of the code) format by multiplying the window coefficients by 0x7fffffff. The coefficients are all less than one in magnitude and thus the resulting 32-bit integer coefficients can be considered as fractions. The filter coefficients have been similarly converted to Q.31 format. Data is received as 16-bit integers (see figure 1 for the integerized code).

Figure 1 Integerized Code

// this is helper function that is used to // implement 32 bit multiply and shift. // // perform 32 bit multiply and return the most // significant 32 bits of the result // returns Q.30 #define HAN_SIZE 512 #define SBLIMIT 32 // for an optimized implementation this function should be inlined. long mul(long a1, long a2) { long long A; A = (long long) a1 * (long long) a2; return (long) (A >> 32); } // this code is from the shine implementation of MP3 encoder. void L3_window_filter_subband(short **buffer, long s[SBLIMIT] , int k) { long y[64]; int i,j;  // read 32 new samples and convert to Q.31 for (i=31;i>=0;i--) x[k][i+off[k]] = ((long)*(*buffer)<<16; // Q.31  // Hann window is: .5*{1-cos[2*PI*(i+1)/HAN_SIZE+1} // window (ew[i]) the new vector, HAN_SIZE = 512 for (i=HAN_SIZE; i--; ) z[k][i] = mul(x[k][(i+off[k])&(HAN_SIZE-1)],ew[i]); // Q.30  // get ready for the next block, and modify the offset off[k] = (off[k] + 480) & (HAN_SIZE-1);  // implement the polyphase filters, accumulate first for (i=64; i--; ) for (j=8, y[i] = 0; j--; ) y[i] += z[k][i+(j<<6)]; // Q.30  // the coefficients are: cos[2*(i+1)*(j-16)*PI/64] // now filter the data and produce 32 new subband coefficients for (i=SBLIMIT; i--; ) for (j=64, s[i]= 0; j--; ) s[i] += mul(fl[i][j],y[j]); // Q.31 * Q.30, result is Q.29 } 

Now assume that our target hardware has two execution units, each one capable of performing two 32-bit multiply-adds per cycle when both operands are packed in a 64-bit register, as well as having the ability to perform a single 32-bit multiply-add. Furthermore, assume that the hardware can also perform a dot product of vectors of length two when packed in two 64-bit registers, and optionally adds a 64-bit scalar to the result. Let’s further assume that the compiler provides special types to represent 64-bit signed and unsigned integers, and directives to align the data across 64-bit boundaries. The compiler also provide intrinsics for right shifting with optional rounding. Since the hardware has two execution units, using intrinsics you can address either one. If both execution units are addressed (indexed by exe_1 and exe_2), the compiler schedules them in parallel. Assume they are as depicted in figure 2.

Figure 2 Intrinsics

// returns a+((b*c)>>31) signed, operation is performed in execute unit  1. int = multiply_add_exe_1(int a, int b, int c); // returns a+(b1*c1 + b0*c0) signed, operation is performed in execute unit  1. u64 = dotproduct_add_exe_1(u64 a, u64 b, u64 c); // returns (a >> b) sign extended u64 = right_shift_round_exe_1(u64 a, int b); 

The new optimized code is shown in figure 3.

Figure 3 Optimized Code

// ew is a 64bit aligned Hann window of length 512  // x, and y are 64bit aligned vectors of length 512  void L3_window_filter_subband(short **buffer, long s[SBLIMIT] , int k)  { int i, j, n, m, p, r; u64 s64_0, s64_1; u64 y[64/2]; long *x_ptr32; long *y_ptr32; long *ew_ptr32; long y0, y1; x_ptr32 = (long *) x[k]; // &x[k][0] /* shift samples and place them at the proper window positions */ for (i=31;i>=0;i--) x_ptr32[i+off[k]] = ((long) *(*buffer)++) << 15; // result is Q.30 /* convert the u64 pointer to long pointer for the next block */ ew_ptr32 = (long *) ew; y_ptr32 = (long *) y; for (i=0, j=63; i<HAN_SIZE ; i+=16, j-=2) { n = 448; m = j+n; p = j+n-1; y0 = y1 = 0; // this loop is an excellent choice for unrolling, which we have avoided // to unroll to aid readability for (r=0; r<8; r++) { y1 = multiply_add_exe_1(y1, x_ptr32[(m+off[k])&(HAN_SIZE-1)],  ew_ptr32[m]); // Q.30 y0 = multiply_add_exe_2(y0, x_ptr32[(p+off[k])&(HAN_SIZE-1)],  ew_ptr32[p]); // Q.30 // the compiler will execute these in both execute units! m -= 64; p -= 64; } y_ptr32[j] = y1; y_ptr32[j-1] = y0; } off[k] = (off[k] + 480) & (HAN_SIZE-1); /* offset is modulo (HAN_SIZE)*/ for (i=SBLIMIT-1; i>=0; i-=2) { s64_0 = 0; s64_1 = 0; // this loop should also be unrolled to increase performance for (j=64/2; j--; ) { // Q.31 * Q.30 = Q.61 s64_0 = dotproduct_add_exe_1(s64_0, fl[i][j], y[j]); // Q.31 * Q.30 = Q.61 s64_1 = dotproduct_add_exe_2(s64_1, fl[i-1][j], y[j]); } // Q.61 >> 32 = Q.29, s[i] = (long) right_shift_round_exe_1(s64_0, 32); // Q.61 >> 32 = Q.29, s[i-1] = (long) right_shift_round_exe_2(s64_1, 32); } } 

Note that we have not unrolled the loop here in order to help the readability of the code. In the last loop, unrolling or preconditioning—if supported by hardware (not checking the loop conditions at every iterations)—of the inner loop will greatly enhance the performance. Alternatively, if the hardware supports zero overhead looping, you can achieve the same performance increase by using special counters.

Furthermore, notice how the second and third loops in figure 3 are combined to reduce data loads and stores. Because of nonconsecutive addressing, we cannot use the vectorized form of the multiply_add() instruction, but we can still achieve some parallelism by using both execution units. In the last loop, note how the dotproduct_add() instruction is used to effectively get a factor of at least 4, but potentially much higher, improvement over the code in figure 1.

Certain architectures also provide specialized address generators that greatly help with nonconsecutive addressing of data. These address generators can greatly enhance the performance of the computations in the second loop of figure 3. These types of address generators are usually programmed using intrinsics and then used as indexes into arrays, thus avoiding array index computations. Furthermore, in devices with separate ALUs and MAC (multiply-accumulate) units with separate data paths, the post-decrements of array indexes of the second loop could take place simultaneously with the computations. This of course depends on the particular pipeline stage at which addresses are generated. This optimization should be transparent to the programmer, since the compiler usually does a good job in performing these types of optimizations. Again, it is a good idea to study the generated assembly code, especially the inner loops, to ensure that proper optimizations have taken place.

Our example clearly shows the procedures discussed in the course of this article. Obviously, there is no single optimal methodology for mapping DSP algorithms to DSP architectures. What we have discussed are some issues that a programmer of DSPs is faced with when attempting to map algorithms to target architectures. The advantage of the approach taken is that it is methodical, and if followed, it guarantees a good solution, although not necessarily optimal. A possible pitfall to be aware of is spending unnecessary time optimizing portions of the algorithm that are not parts of the inner loops. It is therefore very important that the DSP programmer profile the code on the target architecture, as opposed to profiling the reference code running on a general-purpose computer to determine the portions that should be targeted for optimization first. The results may be quite different.

Although most compilers for VLIW architectures can achieve some level of parallelism and performance increase by unrolling the code in figure 3, they fail to combine loops and take advantage of indexing tricks that come with a detailed knowledge of the algorithm.

NOT A TRIVIAL TASK

We claimed at the beginning of this article that DSPs are everywhere. We showed that in order to map DSP algorithms effectively to DSP architectures, we must be very familiar with the algorithms and have an intimate understanding of the target architecture. DSP design, development, and simulation tools have evolved alongside DSPs and have made mapping of algorithms less painstaking, but they are still not at a level of performance to completely automate the mapping and porting of DSP algorithms.

In the applications where DSPs are in widespread use, the issue of utmost importance is, to the extent possible, optimal mapping of the algorithms to target architectures, since this directly translates into reduction of power and die size. Mapping of DSP algorithms to target architectures is not a trivial task. It requires several different skill sets. It brings together knowledge of DSP algorithm development, software engineering, and computer architecture.

REFERENCES

1. James Kaiser oral history transcript. IEEE History Center, February 11, 1997; http://www.ieee.org/organizations/history_center/sloan/ASSR_Oral_Histories/jkaiser_transcript.html.

2. Rader, C. M., and Gold, B. Digital Processing of Signals. McGraw-Hill, New York: NY, 1969. 3. Cooley, J. W., and Tukey, J. W. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation 19 (1965) 297-301.

4. Abbott, C. MDP-1 Architecture and Programming Manual. Malleable Technologies (MPC-Sierra).

5. Matlab: http://www.mathworks.com; SPW: http://www.cadence.com; Mathematica: http://www.wolfram.com/products/mathematica.

6. ISO/IEC 11172-3. Information Technology. Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s Part 3: Audio.

HOMAYOUN SHAHRI, Ph.D., is a principal partner at Tufon Consulting and an adjunct professor of electrical engineering at the University of Southern California (USC), where he teaches digital signal processing.

acmqueue

Originally published in Queue vol. 2, no. 1
Comment on this article in the ACM Digital Library





More related articles:

William J. Dally, Ujval J. Kapasi, Brucek Khailany, Jung Ho Ahn, Abhishek Das - Stream Processors: Progammability and Efficiency
Many signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pJ/op (picoJoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs.


W. Patrick Hays - DSPs: Back to the Future
From the dawn of the DSP (digital signal processor), an old quote still echoes: "Oh, no! We’ll have to use state-of-the-art 5µm NMOS!" The speaker’s name is lost in the fog of history, as are many things from the ancient days of 5µm chip design. This quote refers to the first Bell Labs DSP whose mask set in fact underwent a 10 percent linear lithographic shrink to 4.5µm NMOS (N-channel metal oxide semiconductor) channel length and taped out in late 1979 with an aggressive full-custom circuit design.


Gene Frantz, Ray Simar - Of Processors and Processing
Digital signal processing is a stealth technology. It is the core enabling technology in everything from your cellphone to the Mars Rover. It goes much further than just enabling a one-time breakthrough product. It provides ever-increasing capability; compare the performance gains made by dial-up modems with the recent performance gains of DSL and cable modems. Remarkably, digital signal processing has become ubiquitous with little fanfare, and most of its users are not even aware of what it is.





© ACM, Inc. All Rights Reserved.