## New FPGAs tackle real-time DSP tasks for defense applications

## by Rodger Hosking, Pentek

As software defined radio technology further penetrates large communication systems, the need to accommodate a large number of agile frequency channels for radio receivers is quite apparent. These market forces drive the need for better DSP processing performance.

■ The defense engineering community is dealing with more complex waveforms, more channels in radar and beam-forming applications, and communication systems with wider bandwidths. As software-defined radio technology further penetrates large communication systems for battlefield military radio networks, wireless systems, manned and unmanned aerial vehicles, and monitoring facilities for SIG-INT and COMINT, the need to accommodate a large number of agile frequency channels for radio receivers is quite apparent. These market forces drive the need for better DSP processing performance, high-speed interconnects, and increased data sampling-rates.

Without exception, the latest device offerings from major FPGA vendors now offer second or third generation DSP blocks. They include extended precision multiplier/accumulators, advanced arithmetic units, logic engines, and flexible memory structures that can be tailored into block memory, dual-port RAM, FIFO memory and shift registers. Competition among FPGA vendors is stronger than ever, leading to an exciting race for features that deliver maximum performance and specific benefits. With so many different types of resources including block RAM, distributed RAM, DSP blocks, logic blocks, microcontrollers, gigabit ports, I/O drivers and pins, etc., determining a single optimum ratio is futile because each application requires a different blend. For example, the design engineer selecting the best part for a logic-intensive application will avoid a FPGA heavily burdened in cost and power

| Sub-Family Specialty<br>Product No. XC | Virtex-4<br>LX Devices<br>LOGIC |          |          | Virtex-4<br>SX Devices<br>DSP |          | Virtex-4<br>FX Devices<br>MULTI-PURPOSE |          |          | Virtex-5<br>LX Devices<br>LOGIC |          |          |
|----------------------------------------|---------------------------------|----------|----------|-------------------------------|----------|-----------------------------------------|----------|----------|---------------------------------|----------|----------|
|                                        |                                 |          |          |                               |          |                                         |          |          |                                 |          |          |
|                                        | Process Geometry                | 90 nm    | 90 nm    | B0 nm                         | 90 nm    | 90 nm                                   | 90 nm    | 90 nm    | 90 nm                           | 65       | 65       |
| Core Voltage                           | 1.2 v                           | 1.2 v    | 1.2 v    | 12v                           | 1.2 V    | 1.2 v                                   | 1.2 v    | 1.2 v    | 1.0 v                           | 1.0 v    | 1.0 v    |
| CLB Logic Cells                        | 59,904                          | 110,592  | 200,448  | 34,560                        | 55,298   | 56,880                                  | 94,896   | 142,128  | 82,944                          | 110,592  | 331,776  |
| Look Up Table Inputs                   | 4                               | 4        | 4        | 4                             | 4        | 4                                       | 4        | 4        | 6                               | 6        | 6        |
| XtremeDSP Sices                        | 64                              | 96       | 96       | 192                           | 512      | 128                                     | 160      | 192      | -                               | -        | -        |
| DSP48E Slices                          |                                 |          | -        | -                             | -        | -                                       | -        | -        | 48                              | 64       | 192      |
| Multipliers                            | 18 x 18                         | 18 x 18  | 18 x 18  | 18 x 18                       | 18 x 18  | 18 x 18                                 | 18 x 18  | 18 x 18  | 25 x 18                         | 25 x 18  | 25 x 18  |
| Distributed RAM (kbits)                | 410                             | 768      | 1392     | 240                           | 384      | 395                                     | 659      | 987      | 840                             | 1120     | 3420     |
| Block RAMFIFOs                         | 160 x 18                        | 240 x 18 | 336 x 18 | 192 x 18                      | 320 x 18 | 232 x 18                                | 376 x 18 | 552 x 18 | 90 x 30                         | 128 x 36 | 288 x 36 |
| Total Block RAM (kbits)                | 2,880                           | 4,320    | 6,048    | 3,456                         | 5,760    | 4,176                                   | 6,768    | 9,935    | 3,456                           | 4,605    | 10,368   |
| Configuration Memory<br>(bits)         | 18315K                          | 31818K   | 50648K   | 14476K                        | 24088K   | 22282K                                  | 35122K   | 50900K   | 21800K                          | 29100K   | 79700K   |
| Max SelectiO <sup>ne</sup>             | 640                             | 980      | 960      | 448                           | 640      | 576                                     | 768      | 896      | 560                             | 800      | 1200     |

with a wealth of powerful DSP blocks. As a solution, vendors have developed multi-pronged product offerings, each targeting different classes of applications. One example of such a family is the Xilinx Virtex-4 FPGA. Unlike the previous Virtex-II Pro family, Xilinx has split the seventeen Virtex-4 product offerings into three sub-families, each emphasizing distinct strengths. For the recently announced Virtex-5 family, there are a total of four distinct sub-families, but detailed information on only the first of these has been released. Referring to the table in figure 1, the Virtex-4 uses a 90 nm process and a core voltage of 1.2 V, while the Virtex-5 shrinks the feature size down to 65 nm and drops the core voltage down to 1.0 V. This allows an improvement in maximum clock speed to 500 and 550 MHz, respectively, while reducing power consumption. Configurable Logic Blocks (CLBs) are the basic elements used for implementing state machines, combinatorial logic, controllers, and sequential circuits. They are composed of logic "slices" with flip-flops, look-up-tables (LUTs), multiplexers, Boolean

Figure 1. DSP resources for representative devices from Virtex-4 and -5 sub-families

logic blocks, and adder/subtractors with carrylook-ahead functions. The Virtex-5 uses 6input LUTs instead of the 4-input LUTs in the Virtex-4, providing additional logic functions, fewer levels of logic for faster speed and less power due to simpler routing.

Another essential resource for DSP is memory, which has become much more flexible in these latest generation FPGAs and comes in different forms. Distributed memory is used for LUTs, FIFOs, single- and dual-port RAMs, and shift registers. For larger memory structures, 18-Kbit block RAMs can be used for deep FIFOs, large circular delay memory buffers, deep caches, as well as bigger single- and dual-port RAMs. The Virtex-5 offers both 18- and 36-Kbit block RAMs, to support wider memory structures of up to 72 bits within a single block. One of the more significant advances in the Virtex-4 family is the new XtremeDSP slice. Following the market demand for more powerful signal processing structures, Xilinx surrounded the popular 18 x 18 hardware multipliers first intro-



Figure 2. Quad 4k-point complex FFT IP core 404 processes four parallel streams

duced in the Virtex-II series with a 48-bit adder/subtractor capable of acting as a registered accumulator. Due to tight, dedicated logic, this facility can operate at clock speeds up to 500 MHz and can propagate the results between XtremeDSP slices with 48-bit precision at the same rate. The 48-bit path allows this fast, fixed-point hardware to rival the precision of floating-point engines by preserving the 36-bit multiplier outputs with plenty of overhead for bit growth as results propagate through cascaded slices. Each XtremeDSP slice features 40 dynamically controlled logical and arithmetic modes and supports mode changes during runtime without the need to recompile the FPGA.

In this way, each XtremeDSP slice behaves like a miniature DSP processor, and there are as many as 512 of these in a single FPGA. With so much demand for DSP capability, the Virtex-5 family features the DSP48E slice that boosts 18 x 18 hardware multiplier to 18 x 25 so that single-precision floating-point arithmetic can be implemented within two DSP48E slices instead of the four slices required with the Virtex-4 XtremeDSP. Also, the adder has been enhanced with a logic stage to save the need for an external logic block.

The Virtex-4 SX sub-family aims at DSP with 512 XtremeDSP slices, while the LX sub-families for the Virtex-4 and Virtex-5 deliver the most logic and I/O. The FX sub-family is a moderate blend of logic and DSP capability, along with some other key resources important to most DSP systems: an on-board IBM 405 PowerPC microcontroller, built-in gigabit serial transceivers to support the new switched fabrics, and Ethernet MACs (media access controllers) to simplify communication links to other system components. To take advantage of the high DSP power available, some example IP cores for key DSP algorithms are now presented.

As one of the most classic algorithms for DSP benchmarking, the FFT is deployed in a wide range of communications, radar, and signal intelligence applications. One of the most efficient methods of performing the FFT calculation is an iteration of the radix-4 "butterfly" algorithm. Each butterfly consists of mullipliers and adders that accept four input points and compute four output points based on suitably chosen coefficients from a sine table. For a 4k-point FFT, six stages of butterflies are required, representing a total of 60 DSP slices. Because the FFT is inherently a block-oriented algorithm, the FFT operates most efficiently when quick access to all input and output samples is supported by a freely addressable RAM. However, this ideal model of random data availability is contrary to the sequential input data samples streaming from the A/D converter. By utilizing a proprietary memory structure implemented by configuring the block RAM resources of the FPGA, four input data memory ports feed the butterfly engine in parallel, thus solving the data availability problem. This unique memory architecture allows subsequent input blocks to be processed in a continuous, systolic manner so that all of the multipliers in all six stages can be productively engaged all the time. Figure 2 illustrates the data flow scheme for the core. For each FPGA clock, the core accepts one real or complex sample from all four input data streams, for example from four A/D converters. Each 4k point input block is completed in 4096 clocks, with the first sample of a block in each stream offset from the next stream by 1024 clocks. At the output, each clock delivers four parallel streams with similar offsets. If the core is driven from a single common input source, this arrangement allows 75% input overlap processing for a single input channel. In this popular DSP function, the core produces a new FFT every 1024 clocks derived from an input block of 25% new data and 75% data from the previous block. Alternatively, this core can handle 50% overlap processing on two channels or zero overlap processing on four independent channels. At the output of the FPGA, a multiplexer allows the results of each signal processing stage to be directed to the processor interface. Several proprietary techniques were em-



Figure 3. 256-channel narrowband digital down converter Pentek IP core 430

ployed to reduce rounding and truncation errors due to integer arithmetic, yielding a calculation dynamic range of better than 90 dB. Since the FFT engine is synchronous with the FPGA clock, the faster the clock the faster the FFT. In the original Virtex-II design using a -7 speed grade part, the maximum clock speed is 160 MHz yielding a calculation time of 6.4 µs per FFT. With the newer Virtex-4 and -5 devices, clock-rates up to 250 MHz and higher are possible. Compared with programmable RISC and DSP processors, the FPGA-based FFT is at least ten times faster. IP core DDCs like the LogiCore DDC from Xilinx can be scaled for various levels of SFDR (spurious-free dynamic range) performance to use more or fewer FPGA resources. For example, a complex DDC with 84 dB SFDR consumes approximately 1700 slices in a Virtex-II Pro. In a mid-sized FPGA, the XC2VP50 with 24,000 available slices, only about 14 of these DDCs can be accommodated.

For applications requiring several dozen or even hundreds of channels, this approach can become impractical. This is because each conventional DDC requires its own local oscillator (phase accumulator and sine table) and complex mixer (two multipliers), and these blocks must operate at the full input sample clock-rate between 100 and 200 MHz in many software radio applications. Since this is the same clock range rating for the multipliers inside the DSP blocks, the critical hardware resources required for one channel cannot be shared across other channels. However, imagine that the input data sample-rate is reduced by a factor N. By operating the critical DDC hardware resources required for one channel at the full clock-rate, those same resources can then be multiplexed (time shared) across N channels. Of course, provisions must be made for buffering the data for all channels while multiplexing. This can be done in RAM or in delay memory, a common feature in FPGAs. Figure 3 shows an FPGA-

based 256-channel DDC IP core that combines a channelizer stage with a multiplexed DDC stage. It accepts a single wideband input stream and delivers a channel bank of 1024 output bands equally spaced in frequency, but with significant overlap between adjacent bands.

The output sample-rate of each band equals the input sample-rate (Fs) divided by 256, rather than 1024, as it would be with a simple FFT approach to channelization. In fact, inside the channelizer are four high-speed 1024-point FFTs running in parallel using a proprietary windowing and overlap processing technique. The outputs of the four separate FFTs deliver samples at a rate of Fs/1024. These outputs are combined to form a single output at a sample-rate of Fs/256, supporting the wider bandwidth that will sufficiently overlap adjacent bands.

A crossbar switch matrix accepts 1024 inputs from the channelizer and delivers 256 outputs, one for each DDC channel. The switch is nonblocking so that any of the 256 outputs can be independently sourced from any of the 1024 channelizer bands with no restrictions. The 256 compensated switch matrix outputs feed a bank of 256 independently tuned DDC sections, each with its own local oscillator mixer and FIR filter. Because the channelizer has dramatically reduced the input sampling-rate to each DDC section by a factor of 256, the DDCs are implemented using highly multiplexed hardware resources and block RAM to preserve the data for each channel. A gain stage, output multiplexer and data formatter complete the design. Overall performance of the complete 256-channel FPGA-based DDC IP core includes a spurious-free dynamic range of 75 dB, a pass band ripple of 0.4 dB, a pass band edge droop of 1.0 dB, and frequency tuning resolution of Fs/232. The maximum clock frequency depends on implementation details, but can be as high as 185 MHz in a Virtex-4 FPGA with speed grade 12.