# THE IMPACT OF DATA CHARACTERISTICS AND HARDWARE TOPOLOGY ON HARDWARE SELECTION FOR LOW POWER DSP

Gareth Keane

Jonathan Spanier

Roger Woods

School of Electrical Engineering and Computer Science, The Queen's University of Belfast Ashby Building, Stranmillis Road, Belfast BT9 5AH, Northern Ireland +44 1232 274275

G.Keane@ee.qub.ac.uk

J.R.Spanier@ee.qub.ac.uk

R.Woods@ee.qub.ac.uk

# 1. ABSTRACT

Adders and multipliers are key operations in DSP systems. The power consumption of adders is well understood but there are few detailed results on the choice of multipliers available. This paper considers how the power consumption of a number of multiplier structures such as Carry-Save array and Wallace Tree multipliers varies with data wordlengths and different layout strategies. In all cases, results were obtained from EPIC PowerMill<sup>™</sup> simulations of actual synthesised circuit layouts. Analysis of the results highlights the effects of routing and interconnect optimization for low power operation and gives clear indications on choice of multiplier structure and design flow for the rapid design of DSP systems.

# 1.1 Keywords

Low power DSP systems, optimum hardware selection, multiplier structures.

# 2. INTRODUCTION

The need for low power technologies has been prompted by the proliferation of portable computing and communications [2] and the cost of current packaging technologies [1]. Power can be reduced by manipulation at the technology, circuit, architectural and algorithmic levels [3]. For semi-custom design flow such as those used here, these options are not available as the circuits and technology are defined, so savings must be achieved using algorithmic and architectural optimization techniques. Here, the designer can reduce the power consumption either by minimising the switched capacitance or dropping the supply voltage. In the latter case, transformations can be used to speed up a system's throughput beyond what is necessary. This can then be traded off for low power operation by reducing the supply voltage [2,3]. As the voltage is determined by the silicon foundry, the designer must find other methods for saving power.

In logic-based synthesis, it is essential to be able to determine the power consumption accurately as soon as possible in the design flow rather than at circuit layout level as is done at present. This becomes increasingly important in the Intellectual Property (IP) arena where the focus is to accelerate the design flow. In IP approaches for DSP, designs are synthesised using pre-defined VHDL cores ranging from multipliers and adders to more complex blocks such as ADPCM blocks [5]. If accurate power models for these blocks are available for the various specified parameters e.g. wordlength, then it should be possible to perform power estimation at an early stage in the design flow. The aim of the work is to develop accurate, parameterised models for pre-designed multiplier cores in order to allow accurate power estimation to be carried out at an early stage in the design process.

As the percentage of power consumption due to interconnect in deep submicron designs can be as high as 90% [4], there is considerable scope for applying interconnect optimization as a means of power reduction. In particular, increasing regularity and locality at the silicon level should reduce power consumption in a standard-cell based design flow. Indeed, a high-level approach to implementing locality which binds closely associated logical operations to adjacent hardware units during the scheduling and allocation stages of synthesis, has been reported to be an effective method for power reduction [6,7]. The work here also examines the impact of applying locality at the circuit layout level on the power performance of the multipliers. The paper is organized as follows: Section 3 discusses the structures used for power comparison in this investigation. The design flow used in this work is then briefly described. Section 4 presents some of our results and highlights the consequences of using architectural transformations (namely parallelism and pipelining). Finally, the conclusions are presented in section 5.

## 3. BACKGROUND

The impact on power consumption of different data streams, silicon layout and number representations has been investigated for four multiplier structures, a Boothencoded multiplier, a Booth-encoded Wallace Tree multiplier, a Carry-Save array multiplier and a Signed Binary structure. To examine the effects of enforcing circuit layout locality (i.e. regularity), the Carry-Save structure was synthesized in two different ways, both flat and regular. The multiplier structures presented all have different capabilities e.g. the Wallace-Tree structures can operate at much higher frequencies than the array structures. This factor was taken into consideration when comparing the hardware units. The structures were optmised for an operating speed of 20MHz, to allow a consistent power consumption comparison.

To help with the investigations presented here we have developed a design flow based around commercial tools such as Synopsys<sup>™</sup>, Compass Design Automation<sup>™</sup> and EPIC Design Technology's PowerMill<sup>™</sup> simulator. Designs are described in VHDL (both structurally and behaviorally) and then synthesized using Synopsys<sup>™</sup>. Synthesis is targeted to a 0.35µ standard cell CMOS library, and layout is generated using the Compass DA<sup>™</sup> toolset. Physical netlists which include accurate processmeasured characteristics for interconnect capacitance and resistance are extracted. An essential component of the design flow is the availability of detailed SPICE descriptions of the standard cell library. These were provided by the vendor, along with HSPICE models for each of the transistors available. A glue framework has been developed using Perl to facilitate the transfer of data in the design flow.

## 4. RESULTS

#### 4.1 Multiplier cores

The performance of the different multiplier cores was first examined. Tables 1 and 2 give some idea of the physical characteristics of the different multiplier cores. Table 1 gives details on the multiplier area for wordlengths of 8, 16 and 24 bit structures, whilst table 2 gives details on netlengths and interconnect distribution for the 16 bit versions. The power consumption of these cores is shown in table 3. The reason for the difference in performance between the respective multiplier cores can be explained by

examining the distribution of netlists throughout the different structures.

| Name                                      | 8-Bit $(10^{-6} \text{ mm}^2)$ | 16-Bit<br>(10 <sup>-6</sup> mm <sup>2</sup> ) | 24-Bit<br>(10 <sup>-6</sup> mm <sup>2</sup> ) |
|-------------------------------------------|--------------------------------|-----------------------------------------------|-----------------------------------------------|
| Carry Save (Regular)                      | 0.20                           | 0.81                                          | 1.81                                          |
| Carry Save (Flat)                         | 0.10                           | 0.48                                          | 1.15                                          |
| Two's Comp. Carry Save                    | 0.10                           | 0.48                                          | 1.22                                          |
| Booth Encoded                             | 0.12                           | 0.48                                          | 1.32                                          |
| Booth Enc. Wallace Tree                   | 0.12                           | 0.56                                          | 1.42                                          |
| Two's Comp. Booth<br>Encoded Wallace Tree | 0.10                           | 0.49                                          | 1.19                                          |
| Signed Bin. No. Rep.                      | 0.18                           | 0.76                                          | 1.51                                          |

 Table 1. Silicon area for multiplier structures

Extracting information on the netlist distributions from table 2, this has been translated into the graphs of figures 1 and 2. Figure 1 shows the netlist distribution across three of the cores, the flat synthesized Carry-save, the Wallace Tree and the Signed Binary multipliers.

| Name                                    | Number of<br>Nets | Longest<br>Net (λ) | Avg. Net<br>Length ( $\lambda$ ) |
|-----------------------------------------|-------------------|--------------------|----------------------------------|
| Carry Save (Regular)                    | 3584              | 332                | 130                              |
| Carry Save (Flat)                       | 1072              | 8488               | 472                              |
| Two's Comp. Carry<br>Save               | 1087              | 6999               | 446                              |
| Booth Encoded                           | 979               | 7087               | 558                              |
| Booth Enc. Wallace<br>Tree              | 1053              | 10657              | 693                              |
| Two's Comp. Booth<br>Encoded Wall. Tree | 950               | 9786               | 654                              |
| Signed Bin. No. Rep.                    | 1356              | 11693              | 744                              |

Table 2. Net information for 16-bit structures

| Name                                      | 8-Bit<br>(mW) | 16-Bit<br>(mW) | 24-Bit<br>(mW) |
|-------------------------------------------|---------------|----------------|----------------|
| Carry Save (Regular)                      | 5.99          | 23.19          | 56.92          |
| Carry Save (Flat)                         | 3.23          | 25.94          | 67.96          |
| Two's Comp. Carry Save                    | 3.65          | 27.31          | 80.79          |
| Booth Encoded                             | 5.09          | 27.95          | 88.58          |
| Booth Encoded Wallace Tree                | 5.50          | 37.10          | 93.60          |
| Two's Comp. Booth Encoded<br>Wallace Tree | 4.00          | 32.24          | 86.12          |
| Signed Bin. No. Rep.                      | 7.776         | 48.45          | 138.24         |

Table 3. Power consumption of multiplier structuresprocessing random data at 20MHz

Figure 2 provides information on the average activity on netlists for two of these structures, the Carry-Save and the Wallace Tree. It can be seen from figure 1 that the relative power dissipation of the three structures can be approximated as the integral of the interconnect distribution curve, while figure 2 shows how the more distributed structures like the Wallace Tree have a greater incidence of switching on longer (i.e. more capacitive) nets. The increase in activity on the longer nets provides an indication that regular structures will provide optimal performance, provided that they meet speed requirements.



Figure 1. Detail of Netlength distributions for 16-bit multipliers



Figure 2. Average activity on different netlengths for 16bit Carry-save (top) and Wallace-Tree (bottom).

## 4.2 Architectural Transformations

In the case where regular multiplier structures do not meet performance requirements, a designer is given the choice of either using a single fast multiplier core or applying speed up transformations to a regular structure such as the Carry-Save array to achieve the required speed up. To examine the effects of these transformations, four different implementations were examined to provide a multiplier operating at 100MHz. These were a single Wallace-Tree two Carry-Save multipliers operating in multiplier, parallel at 50MHz, four Carry-save multipliers operating in parallel at 25MHz and a pipelined Carry-save multiplier. Results for these structures are shown in table 4. It can be seen that the irregular structure outperforms the transformed implementations when power-area product is considered.

| Name                        | Power<br>(mW) | Area<br>$(10^{-6}$ mm <sup>2</sup> ) |
|-----------------------------|---------------|--------------------------------------|
| Wallace Tree                | 140.91        | 0.56                                 |
| Pipelined Carry-Save        | 144.54        | 0.59                                 |
| Parallel Carry-Save (2 PEs) | 135.06        | 1.06                                 |
| Parallel Carry-Save (4 PEs) | 133.84        | 2.11                                 |

Table 4. Performance of transformed structures.

## 5. CONCLUSIONS

The importance of regular structures when designing low power systems with a synthesis based design flow has been highlighted. The activity on long nets inherent in the more irregular structures which have been examined makes regular implementation attractive in cases where the necessary performance can be achieved. In cases where the performance requirements cannot be met, parallel and pipelined solutions provide low power but inferior power-area products. By developing accurate models for the performance of each fundamental building block of a DSP system under different operating conditions a high-level power prediction/estimation capability can be constructed.

#### 6. ACKNOWLEDGMENTS

The technical assistance of ISS Ltd., and the financial assistance of the European Union ESPRIT program, European Social Fund and Engineering and Physical Science Research Council are gratefully acknowledged.

#### 7. REFERENCES

- Brodersen, R., Chandrakasan, A., and Sheng, S. "Low-Power Signal Processing Systems", VLSI Signal processing V, pp 3-13, 1992.
- [2] Chandrakasan, A. and Brodersen, R. Low Power Digital Design, Kluwer Academic Publishers, 1996.
- [3] Chandrakasan, A., Sheng, S., Brodersen, R. "Low Power CMOS Digital Design", IEEE JSSC, Vol. 27, pp 473-484, 1992.
- [4] Lee, T. and Cong, J. "The new line in IC design", IEEE Spectrum, Vol 34, No. 3, pp 52-58, 1997.
- [5] McCanny, J., Ridge, D., Hu, Y. and Hunter, J. "Hierarchical VHDL Libraries for DSP ASIC Design", Proceedings ICASSP-97 Munich, Vol. 1, pp 675-679.
- [6] Mehra, R., Guerra, L. and Rabaey, J. "Low-Power Architectural Synthesis and the Impact of Exploiting Locality", Journal of VLSI Signal Processing Systems, Vol. 13, pp 239-258, 1996.
- [7] Mehra, R. and Rabaey, J. "Exploiting Regularity for Low-Power Design", Proceedings of the International Conference on Computer-Aided Design, 1996.