Programmable and Configurable Digit-Serial Signal Processors
Contract Number: DA/DABT63-96-C-0050 (AO: E335)
Program Directors: Jose Munoz, Col. Mark Swinson, Larry Carter
Quarterly Report: Oct. 1, 1998 - Dec 31, 1998

Technical POC:

Keshab K. Parhi
Professor, Dept. of Electrical & Computer Engineering
Univ. of Minnesota
Minneapolis, MN 55455
Tel: 612-624-4116
Fax: 612-625-4583
Email: parhi@ee.umn.edu

-----------------------------------------------------------------------------
1. SIGNIFICANT ACCOMPLISHMENTS
-----------------------------

1.1 Low Power Bit-Serial Viterbi Decoder

The fabricated chips are back from MOSIS. We measured the power dissipation of them and it turned out that the chip is the lowest power Viterbi decoder though the measured power is a little bit bigger than the simulated value. This work will be no longer continued.

1.2 Turbo decoder for wireless communication

The performance of Turbo decoder depends on various factors such as the algorithm of SISO (Soft In Soft Output) decoders, the length of interleavers and deinterleavers, number of states and the word length of soft inputs or intermediate registers. We have developed simulation models of Turbo decoders and simulated the performance of Turbo decoders with various parameters using MATLAB and C. We have tried the original MAP (maximum a posteriori) algorithm, the log-MAP algorithm and the sliding window version of log-MAP algorithm. In terms of VLSI implementation, the sliding window log-MAP algorithm is the most suitable because it requires much less memory than the other two algorithms. According to our simulation results, it turned out that the performance degradation of the sliding window log-MAP algorithm is negligible compared to the other two algorithms. As for number of states, we have simulated the performance of Turbo decoder with 4 states and 16 states. The 4 state decoder achieved SNR=1.2dB for BER=10**(-4) with 8 iterations and 1K interleaver. Compared to the 16 state decoder, the performance degradation of the 4 state decoder is 0.4dB at BER=10**(-4). Moreover, the 4 state decoder achieves better performance than Viterbi decoder with 64 states. We have also simulated the Turbo decoder performance with 1K interleaver and 4K interleaver and it turned out that the difference between the performance of the 1K interleaver decoder and that of the 4K interleave decoder is 0.4dB at BER=10**(-4).

1.3. Digit-Serial Fast Fourier Transform (FFT):

Pipelined fast Fourier transform (FFT) architectures for application specific real-time DFT computation utilize fast algorithms and are particularly suited to sequentially presented input data. Prior work on FFT architectures can be categorized by the the decimation method (decimation-in-time vs decimation-in-frequency), radix size (radix-2 vs radix-4), or commutator scheme (delay feedback or delay forward). Delay feedback commutator scheme can use the registers more efficiently by storing most of butterfly outputs into the feedback shift registers. However, the utilization for this scheme will not exceed 75% for multipliers and 50% for adders respectively. On the other hand, delay forward commutator architectures can take the best advantage of trivial factors in the radix-4 FFT algorithms although it demands more data memory for the storage of incoming data and intermediate results. Our work has resulted in a novel FFT architecture based on radix-4 decimation-in-time FFT with the use of digit-serial functional units. By splitting the input data stream into parallel digit-serial data streams and combining two commutator schemes, the proposed architecture not only achieves nearly 100% hardware utilization, but also requires much less memory compared with previous digit-serial FFT processor. Furthermore, FFT processors demand several modules of ROM to store the twiddle factors. The overall ROM size is in the order of N, where N is the transformation size. The size of ROM can be effectively reduced to half by exploiting the redundancy of the twiddle factors.

1.4. CORDIC:

Due to a flaw in the redundant CORDIC architecture, the architecture has to be modified. The location of the flaw is detected, however, it cannot be solved trivially. Hence, the architecture is being reexamined to remove the error. Furthermore, constant and non-constant scale factor architectures have been once more exploited to improve speed and power consumption of the implementation. The most promising architectures are currently evaluated.

1.5. Annihilation-Reordering Based Pipelined RLS Filters

The effects of finite precision arithmetic on Annihilation-Reordering Pipelined RLS filters are being investigated. We are trying to find out the minimum number of bits that can be used to reduce hardware while achieving satisfactory filtering performances. Stochastic error analysis approach is being used, specifically, we've found the expressions for the variables who have a steady-state value and the expressions for the mean and variance of the deviations of the elements stored in the systolic array, the sine and cosine of rotation angles and the filtering errors when they are represented in fixed-point arithmetic. These results from M-level pipelined RLS adaptive filter are used to compare with those from non-pipelined RLS filter.

1.6 Parallel/Pipelined Orthogonal IIR Digital Filter Architectures

CORDIC based cascade orthogonal IIR digital filters have sharp transition band and exhibit low sensitivity in both pass band and stop band which are suitable for VLSI implementations. However, the achievable sample rate of these filters is limited due to the presence of feedback loops. To overcome the speed limitation and achieve high-speed/low-power implementations, We have proposed a novel algorithm transformation technique based on retiming and orthogonal matrix decomposition techniques which can increase the maximum filter sample rate to O(1) level which is independent of the filter order. The proposed parallel and pipelined architectures consist of only Givens rotations which can be mapped onto CORDIC arithmetic based processors.

1.7. FPGA DCTs:

This research provides the performance trade-offs of Xilinx FPGA implementation for digit-serial distributed arithmetic (DA) and the digit-serial flow graph (DSFG) architectures (for digit-sizes 2,3 and 6) for an 8 Point 1-D discrete cosine transform (DCT) algorithm. The performance of DSFG architectures is paremetrized against digit-size and some useful FPGA design rules are pointed out.

Building on the research done previously our experiments show that the distributed arithmetic architectures are better than digit-serial flow graph in every respect. The bit-serial DA has a 60% better area*delay product, 88% lower latency and 48% lower clock energy per computation as compared to a 2-bit digit-serial flow-graph (DSFG) architecture for a throughput of about 50 MPix/Sec. For an HDTV throughput of 111 MPix/Sec however the 3 bit digit-serial DA architecture is the architecture of choice. It provides 10% better area*delay product, 50% lower latency and 14% lower clock energy per computation as compared to 6 bit DSFG architecture.

1.8. Digit-Serial FPGA for DSP Applications

The layout of the basic tile for the Digit-Serial FPGA (DS-FPGA) was implemented using a full-custom VLSI design approach. The layout was done using a 0.6 micron CMOS process having 3 metal layers. The major components of the tile include the digit-serial logic block (DLB), a connection block and a switch block.

The performance of the DS-FPGA is mainly determined by the delay of the programmable interconnection network. The delay increases quadratically with the number of transistor switches in series. It can be improved by inserting repeaters composed of pairs of inverter buffers. An optimal trade-off was found for the buffer strength and the sizing of the NMOS pass-transistor switch in terms of the speed and area required. We found that using a total of 6 switches between adjacent repeaters resulted in the minimal overall propagation delay.

The area of the DLB core is 230um X 200um, while the area of complete tile is 600um X 420um. Thus, the DLB core consumes only 19% of the total area while the routing resources make up the remaining 81%.

We have estimated the path delays for the DS-FPGA tile using Hspice. The results show that the total delay from a flip-flop output to another flip-flop input that is within a routing distance of 1 DLB is 6.1 ns.

We have compared the area and delay of the DS-FPGA with those of the Xilinx 4000-series FPGAs. The results show that the normalized area for a range of digit-serial DSP functions on the DS-FPGA is only 16%-33% of the area required on the Xilinx FPGA. Also, for this class of functions, the DS-FPGA in a 0.6um technology has 1-2 times the speed of than Xilinx FPGA in a 0.35um technology. This is primarily due to the larger and optimized functionality of the DLB core and the smaller routing delay of the DS-FPGA for these types of functions.

1.9. Elliptic Curve Cryptosystem Implementations:

After successful VHDL simulation of a class of elliptic curve cryptosystems characterized by a size parameter m, the designs have been synthesized onto the Xilinx XC4000XL-series FPGAs. The results for FPGA area utilization and throughput are shown below:

-----------------------------------------------------------------------------
Value of m | XC4000XL Device | CLB usage | Throughput(scalar mult./sec)
-----------------------------------------------------------------------------
5 | XC4010XL | 68% | 8x10^5
11 | XC4013XL | 83% | 1x10^5
29 | XC4028XL | 93% | 1x10^4
53 | XC4044XL | 84% | 2670
-----------------------------------------------------------------------------
To explore the potential improvement that may be available using our DS-FPGA, we have also made estimates for the complexity of these systems on that chip, with the following comparative results:

-----------------------------------------------------------------------------
Value of m | Xilinx FPGA | Xilinx FPGA | DS-FPGA | DS-FPGA
| CLB count | area(mm^2) | DLB count | area(mm^2)
-----------------------------------------------------------------------------
5 | 219 | 78.8 | 125 | 31.5
11 | 292 | 105.1 | 179 | 45.1
29 | 510 | 183.6 | 342 | 86.2
53 | 799 | 287.6 | 559 | 140.9
-----------------------------------------------------------------------------
Thus, using the DS-FPGA, it may be possible to reduce the area by as much as 50%. The smaller area may also lead to a corresponding reduction in the propagation delays.

2. Next Period Activities
---------------------------

2.1 Turbo decoder for wireless communication

We have simulated the performance of Turbo decoders using floating point arithmetic. The next step of our research is to estimate the quantization effects using fixed point arithmetic and to find the word precision which compromises both the performance and the hardware amount.

Another candidate for the SISO algorithm is Soft Output Viterbi Algorithm (SOVA). We will simulate the performance of Turbo decoders with SOVA as the SISO algorithm and compare it to the performance of the sliding window log-MAP algorithm.

2.2. Digit-Serial FFT

We will look into the detailed implementation issues of the FFT processor and investigate the impact of finite wordlength on the digit-serial architectures.

2.3 CORDIC:

Research on CORDIC architectures will continue.

2.4 Pipelined RLS Adaptive Filters:

This work will continue. A MATLAB based simulation will be carried out to verify the theoretical analysis.

2.5 FPGA DCTs

Following tasks will be carried out next.

1. Experiment with different partial product accumulation schemes for the 3 bit DA architecture.
2. Map the DSFG architecture to the digit-serial FPGA architecture being developed in the group and compare the performance with the implementation in Xilinx FPGA.
3. Analyze ways of reducing the ROM size in DA architectures
4. Experiment with Systolic architecture based DCTs

2.6. Digit-Serial FPGA

We will continue to evaluate and optimize the routing structure for the digit-serial FPGA. Further benchmark comparisons with existing FPGA devices will also be made.

2.7 FPGA Cryptosystem

We will also continue to evaluate the relative merits of the elliptic curve cryptosystem implementations on different FPGA devices.

3. Documentation:
----------------------

J.-G. Chung, K.K. Parhi, Y.-B. Kim, Z. Wang, and H.-G. Jeong, "Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design", Submitted to IEEE Trans. on Circuits and Systems, Part-II: Analog and Digital Signal Processing, Oct. 1998

A. Karandikar and K.K. Parhi, "Low-Power SRAM Design Using Hierarchical Divided Bit-Line Approach", Proc. of 1998 IEEE Int. Conf. on Computer Design, pp. 82-88, Austin, Oct. 1998

M. Kuhlmann and K.K. Parhi, "Fast Low-Power Shared Division and Square-Root Architecture", Proc. of 1998 IEEE Int. Conf. on Computer Design, pp. 128-135, Austin, Oct. 1998

Y.-N. Chang, and K.K. Parhi, "`High-performance digit-serial complex-number multiplier-accumulator", Proc. of IEEE Conf. on Computer Design, pp. 211-213, Austin, TX, Oct 1998.

M. Kuhlmann and K.K. Parhi, "Power Comparison of Flow-Graph and Distributed Arithmetic Based DCTs", Proc. of 1998 Asilomar Conf. on Signals, Systems and Computers, Nov. 1-4, 1998, Pacific Grove (CA)

H. Suzuki, Y.-N. Chang, K. K. Parhi, "Performance Tradeoffs in Digit-Serial DSP Systems", 32nd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, Nov. 1998.

L. Gao, S. Shrivastava, H. Lee and G. Sobelman, ``Implementation of Elliptic Curve Cryptosystems on FPGAs,'' submitted to the 1999 Symposium on Field-Programmable Custom Computing Machines (FCCM '99).

4. TRAVEL
----------------

Y.-N. Chang, "IEEE Int. Conf. on Computer Design (ICCD)", Oct. 3-6,98, Austin
M. Kuhlmann, "IEEE Int. Conf. on Computer Design (ICCD)", Oct. 3-6,98, Austin
H. Lee presented a paper at the 1998 IEEE Workshop on Signal Processing Systems (SiPS 98), October 8-10 in Boston, MA.
K.K. Parhi, Monterey (Calif), Asilomar Conf., Nov. 1-4, 1998