# Power Analysis and Implementation of a Low-Power 300-MHz 8-b x 8-b Pipelined Multiplier

Jinn-Shyan Wang Department of Electrical Engineering National Chung Cheng University 160, San-Hsing, Ming-Hsiung,

Chia-Yi, 621, Taiwan Tel : 886-5-2720411 ext 5321 Fax : 886-5-2720862 e-mail : ieegsw@ccunix.ccu.edu.tw

Abstract - This paper analyzes the power consumption of an array pipelined multiplier. To precisely realize a low power pipelined multiplier, the analytical model for a clocking system is presented. Simulation results show that the storage element is the key-component in a high performance pipelined multiplier macro. Compared with the conventional DFF and latch, the new low power DFF as PTTFF [6] achieves total power reduction ranging between 34 and 62 percents in a pipelined multiplier macro.

### I. Introduction

It is recognized that a high-performance DSP or CPU chip usually requires a high-speed and low power multiplier. In previous works, design of the multiplier mainly focused on speed improvement. The pipelined architecture [1] is the common method used to obtain high speed; especially the heavily pipelined-to-bit architecture is adopted for very-high-speed multipliers [2]-[4]. In recent years, high speed and low power have become two equal-weight goals in VLSI chip design. In this paper, we study the design considerations and derive the design methodologies for a low power heavily pipelined-to bit multiplier.

Owing to the heavily pipelined structure, carry-save array is usually adopted [2]-[4] in the architecture design without any encoding techniques. A pipelined array multiplier is constructed by four kinds of basic components: clock driver, storage element, adder and other logic gates, respectively.

In the previous papers [2]-[4], power consumption of the multiplier usually accounts for the array part only, and power consumption of the clock driver is not shown explicitly or even not included in the calculation. Actually, the power consumption of the clocking system, including the clock driver and storage elements may occupy a large portion of total power, and it cannot be overlooked in the design of a low power pipelined multiplier macro.

This paper is organized as follows. Section II analyzes the power sources in the heavily pipelined multiplier to derive design concept. Section III presents whole-chip simulation and comparison results for a low power multiplier design. The experimental results are described in Section IV, and the last section gives conclusions. Po-Hui Yang Department of Electrical Engineering National Chung Cheng University 160, San-Hsing, Ming-Hsiung,

Chia-Yi, 621, Taiwan Tel : 886-5-2720411 ext 5321 Fax : 886-5-2720862 d8442005@ccunix.ccu.edu.tw

### II. Power Sources in the Pipelined Multiplier

A simplified block diagram of the pipelined multiplier is drawn in Fig. 1 with the parasitic capacitance of the clock distribution network shown explicitly. Based on a previous analysis for the clocking system [6], the total power  $P_{total}$  can be derived as follows.

 $Ptotal = Pclock + n1 \times Pstorage + n2 \times Padder + n3 \times Pgate$ 

$$= Pcd + Pgw + Plw + Pg + n1 \times Pstorage + n2 \times Padder + n3 \times Pgate$$
$$= (1 + \mathbf{b}^{-1} + \mathbf{b}^{-2} + ... + \mathbf{b}^{-N}) \times VDD^{2} \times f \times (Cj + Cgw + n1 \times Clw + n1 \times Cg)$$
$$+ n1 \times Pstorage + n2 \times Padder + n3 \times Pgate$$
(2)



Fig. 1 The simplified block diagram of the pipelined multiplier.

In the above derivation,  $P_{cd}$ ,  $P_{gw}$ ,  $P_{lw}$ , and  $P_g$  denote the power consumption of the clock driver itself, global clock wiring capacitance, local clock wiring capacitance, and the total clocked gate capacitance, respectively. The *n1*, *n2* and *n3* are the total number of the storage, adder and gate in a pipelined multiplier. Other parameters are defined as follows.

- N: Number of tapering stages in the tapered clock driver
- **b**: Tapering factor of the clock driver
- $V_{DD}$ : Power supply voltage
  - f: Clock frequency
- *Cj* : Source-drain junction capacitance of the clock driver output
- $C_{gw}$ : Total parasitic capacitance of global clock wires
- $\tilde{C}_{lw}$ : Parasitic capacitance of local clock wires in one storage element
- $C_g$ : Gate capacitance of clocked transistors in one storage element

In Equation (2), the clock driver is designed as a tapered buffer with a tapering factor **b**. For the purpose of low power operation, **b** should be chosen as large as possible [7]. If b = 10 is chosen, Equation (2) can be simplified and derived further as:

 $Ptotal = 1.11 \times VDD^2 \times f \times (Cj + Cgw + n1 \times Clw + n1 \times Cg) + n1 \times Pstorage$ 

```
+ n2 \times Padder + n3 \times Pgate
```

 $= VDD^2 \times f \times [1.11 \times (Cj + Cgw + n1 \times Clw + n1 \times Cg) + n1 \times Cstorage$ 

 $+ n2 \times Cadder + n3 \times Cgate]$ 

- $= VDD^2 \times f \times (1.11 \times Cclock + n1 \times Cstorage + n2 \times Cadder$ 
  - $+ n3 \times Cgate$  )

The four variables in equation (3) are defined as follows.

(3)

 $C_{clock}$ : Equivalent cap. of the clock signal @ frequency f  $C_{storage}$ : Equivalent cap. of one storage element @ frequency f  $C_{adder}$ : Equivalent cap. of one bit adder @ frequency f

 $C_{gate}$ : Equivalent cap. of one logic gate @ frequency f

# III. Low power Pipelined Multiplier Clocking system

The different storage elements shown in Fig. 2, 3 and their pre-simulation are summarized in Table 1. The prelayout simulation results are also listed in Table 1. Supply voltage is set to be 3.3 V.

In simulating each circuit to obtain its power consumption, the performance of the clocking system has been taken into consideration. It means that the parasitic capacitance of the clock distribution for each circuit was estimated first and then the clock buffer was designed according to this capacitance. For easing the comparison task, we only finish the layout of the multiplier to be actually implemented.

After layout, the capacitance of the clock distribution is extracted from the layout, and then adding it back suitably to each net list according to the clocking strategies, as shown in Fig. 4.



Fig. 2 (a) 9T-DFF, (b) 8T-DFF, (c) 6T-NDL, (d) 6T-PDL, (e) 5T-NDL, and (f)  $C^2MOS$ 



Fig. 3 Pulse-triggered TSPC flip-flop

| Table | 1 Pre-layout Simulation Results           |
|-------|-------------------------------------------|
|       | $(f=300 \text{MHz}, V_{DD}=3.3 \text{V})$ |

| Pipelined                                | t <sub>rf</sub>                       | # of<br>Clocked   | Average<br>Power |      |       |
|------------------------------------------|---------------------------------------|-------------------|------------------|------|-------|
| Storage Element and<br>Clocking          | Clock                                 | Cgw<br>estimation | (ns)             | Τr   | (mW)  |
| 9T-DFF; Fig. 2(a)                        | f                                     | Fig. 4(a)         | 1.65             | 1728 | 72.48 |
| $C^2MOS$ ; Fig. 2(b)                     | $oldsymbol{f}$ and $oldsymbol{ar{f}}$ | Fig. 4(c)         | 1.65             | 1770 | 57.43 |
| 8T-DFF; Fig. 2(a)                        | f                                     | Fig. 4(a)         | 1.20             | 864  | 47.72 |
| 6T- PDL + 6T -NDL;<br>Fig. 2(c) and 2(d) | f                                     | Fig. 4(a)         | 1.65             | 864  | 46.30 |
| 6T-NDL; Fig. 2(c)                        | $oldsymbol{f}$ and $oldsymbol{ar{f}}$ | Fig. 4(b)         | 1.65             | 864  | 42.03 |
| PTTFF; Fig. 3                            | f                                     | Fig. 4(a)         | 0.8              | 432  | 38.68 |



Fig. 4 Clock distribution capacitances for (a) TSPC, (b) twophase, and (c) two-phase with dynamic FA.

In this study, in order to compare the power consumption on the same ground-line, all the clock drivers in each design are optimized to drive each pipelined multiplier for operating at a maximum frequency of 300 MHz. The clock driver sizing is dependent not only on the clock loading but also on the required rise/fall time of the clock signal. The maximum rise/fall time of each case is also shown in Table 1. We can see that the PTTFF design needs a sharper rise/fall time, and this requirement may induce a negative effect on its power consumption. Fortunately, this negative effect can be over compensated because the PTTFF induces the smallest loading to the clock signal.



Fig. 5 Comparison of power disspation for each multiplier.

From Table 1 and Fig. 5, several observations are described as follows.

- A. The power consumption of the clock driver of each case ranges from 45% to 70% of the total power. This means that it cannot be overlooked in the design of a low power heavily pipelined circuit.
- *B.* The power consumption of adders and storage elements occupy almost the rest of the total power consumption.
- *C.* The analysis in Section II does lead to right guidelines to design a pipelined multiplier, and the design and use of PTTFF does result in the superior performance.
- D. As compared to the pipelined multiplier using S&D FA [4] and  $C^2MOS$ , the proposed design save up to 33% of the power. This is because that the clock loading of the former design is much larger than that of the latter design.

## IV. Experimental Results

To verify the low power design concept, an 8-b x 8-b pipelined array multiplier [5] is implemented in a single-poly double-metal 0.6- $\mu$ m CMOS technology. Simulation waveforms are shown in Fig. 6. The photomicrograph of the chip is shown in Fig. 7. Post-layout simulations results and experimental results are summarized in Table 2.

Table 2 Characteristics of the implemented  $8-b \times 8-b$  heavily pipelined multiplier.

| Process          | 0.6-μm SPDM CMOS                                |
|------------------|-------------------------------------------------|
| Core Area        | $420\mu\text{m} \times 720\mu\text{m}$          |
| Transistor Count | 3849                                            |
| Power Supply     | 3.3 V                                           |
| Clock Frequency  | 300 MHz                                         |
| Pipeline Latency | 17 cycles                                       |
| Power            | RMS: 58.7 mW (post-sim);                        |
| Consumption      | RMS: 52.4 mW (measurement)                      |
| Power            | Clock Gen: 60.8 %: Array: 39.2 % (BMS post-Sim) |

Distribution Clock Gen.: 59.6 %; Array: 39.2 % (RMS post-Sim)

The chip operates at 300 MHz successfully with the rise fall time of the clock signal designed to be 0.8 ns. Actually, the operating frequency can be improved to be above 500 MHz when the rise/fall time of the clock signal is reduced from 0.8 ns to 0.6 ns. However, such an operating frequency will exceed the capability of the tester, and we finally decided to use the current design. It is found that the experimental results agree with the power analysis in section III. The clock generator is about 60 percent of the total power consumption. In this chip design, the PTTFF that only one transistor be triggered reduce the clock generator and total power significantly.



Fig. 6 Simulation results.



Fig. 7 Chip photomicrograph.

# V. Conclusions

This paper describes analysis and experimental results for a low power pipelined multiplier. Analysis and simulation results show that the power consumption of the clocking system cannot be overlooked in the design of such a multiplier. In the meantime, the design of the storage element, actually, is the key to overall low power. The power of the latter design corresponds to only 47% of the multiplier designed using the conventional S&D full adders and  $C^2MOS$  latches at VDD=3.0V. The chip has been fabricated, and the experimental results agree with the simulation results.

# References

[1] D. A. Henlin, M. T. Fertsch, M. Mazin, and E. T. Lewis, "A 16 bit x 16 bit pipelined multiplier macrocell," *IEEE J. Solid-State Circuits*, vol. 20, no. 2, pp. 542-547, Apr. 1985

[2] F. Lu and H. Samueli, "A 200-MHz CMOS pipelined multiplier-accumulator using a quasi-domino dynamic full-adder cell design," *IEEEJ. Solid-State Circuits*, vol. 28, no. 2, pp. 123-132, Feb. 1993.

[3] D. Somasekhar and V. Visvanathan, "A 230-MHz half-bit level pipelined multiplier using true single-phase clocking," *IEEE Trans. VeryLargeScaleIntegration(VLSI)Systems*, vol. 1, no.4, pp.415-422, Dec. 1993.

[4] S.-J. Jou, C.-Y. Chen, E.-C. Yang, and C.-C. Su, "A pipelined multiplier-accumulator using a high-speed, low-power static and dynamic full adder design," *IEEE J. Solid-State Circuits*, vol. 32, pp. 114-118, Jan. 1997.

[5] C. R. Baugh an B. A. Wooley, "A two's complement parallel array multiplication algorithm," *IEEETrans. Comput.*, vol. C-22, pp. 1045-1047, Dec. 1973.

[6] Jinn-Shyan Wang and Po-Hui Yang, "A pulse-triggered TSPC flip-flop for high-speed low-power VLSI design applications," in *Proceedings of IEEE ISCAS 1998*, pp. II-93~II-96.

[7] H. J. M. Veendrick, "Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits," *IEEE J. Solid-State Circuits*, vol. SC-19, no.4, pp. 468-473, Aug. 1984.
[8] M. Afghahi and C. Svensson, "A unified single-phase clocking scheme for VLSI systems," *IEEE J. Solid-State Circuits*, vol. SC-25, no.1, pp.225-233, Feb. 1990.