# Energy Delay Analysis of Partial Product Reduction Methods for Parallel Multiplier Implementation 

R.V.K. Pillai, D. Al-Khalili ${ }^{\dagger}$ and A.J. Al-Khalili<br>Concordia University, Montreal, CANADA; ${ }^{\dagger}$ Royal Military College, Kingston, CANADA


#### Abstract

This paper examines the energy delay implications of partial product reduction methods employed in parallel multiplier implementations. Radix 4 Modified Booth Algorithm (MBA) is currently the most popular choice for partial product reduction in parallel multipliers although 4:2 compressors can also produce equivalent results. Our energy delay analysis of these two schemes taking into account the architectural as well as circuit implementation issues suggests the superiority of the 4:2 compressor based partial product reduction technique as far as circuit delays, power consumption and architectural regularity are concerned. SPICE simulations of partial product generation using these schemes for an 8 bit multiplier suggest a worst case energy delay advantage of the order of $36 \%$ and $15 \%$ respectively for the 4:2 compressor based scheme in comparison with two different implementations of MBA. The corresponding figures for power reduction are of the order of $26 \%$ and $11 \%$ respectively.


## I Introduction

Traditionally, Booth encoding techniques [1] [2] had been widely employed for partial product reduction in parallel multiplier implementations. For parallel multipliers using radix 4 MBA , the number of partial products are reduced to $n / 2$ for an $n X n 2$ 's complement multiplication. This same task can be accomplished by the use of 4:2 compressors [3] also, as proposed by D. Villeger et al. [4]. The authors in [4] have shown that $4: 2$ compressors can reduce the number of partial products to $n / 2$ in less time and using fewer gates compared to radix 4 MBA while C. Nagendra et al. [5] also arrived at similar results. Norio Ohkubo et al. [6] acknowledged the equivalence of partial product generation using MBA and $4: 2$ compressors, though they favored partial product generation using the MBA scheme owing to the area advantage offered by such a scheme. Even though these authors speculated that partial product generation need not be accomplished by using the MBA scheme alone, they didn't weigh the energy delay measures of various schemes for reaching their conclusions. This paper deals with the evaluation of partial product reduction using radix 4 MBA
and $4: 2$ compression technique on the basis of the energy delays of the relevant schemes.

## II Circuit Realization

In radix 4 MBA, partial product reduction is accomplished by selecting $0,+1,-1,+2$ or -2 times the multiplicand ( Y ) as the partial product in accordance with the values of the relevant multiplier (X) bits. There are two possible schemes to implement MBA, MBA I and MBA II. The gate level representations of these schemes are shown in Figs. 1 and 2. (In Figs. 1, 2 and 3, the triangular blocks represent 2X1 multiplexors). The MBA I makes use of three control signals $X 1, X 2$ and $N$ for effecting partial product bit generations while the MBA II makes use of five control signals, viz. $X 1,-X 1, X 2,-X 2$ and $Z$. Figs. 1 and 2 also lists the signal probabilities $\left(P_{g}\right)$ and fanouts ( $F_{g}$ ) of various circuit nodes. Fig. 3 shows the schematic and signal probabilities of a multiplexor based 4:2 compressor [6] [7]. The circuits had been implemented using CPL building blocks [7].

In Fig. 1, the generation of the negation control signal $N$ differs from popular implementations, in which case the logic is over-simplified as $N=X_{i}$ where $X_{i}$ represents the MSB of the multiplier bit group being scanned. The partial product must be zero for the bit combination $X_{i}=X_{i-1}=X_{i-2}=1$. The assertion of control signals for this condition with $N=$ $X_{i}$ are $X 1=0, X 2=0$ and $N=1$. This condition generates a partial product with all the bits asserted high. The addition of a correction bit of $l$ (for 2's complementation) at bit position zero results in a partial product that is effectively zero. Though the final partial product is effectively zero and the arithmetic result is correct, this condition generates unwanted activity, (which ripples through the entire carry save adder array as well as the final carry propagate adder) resulting in power wastage. The probability of occurrence of this situation is $1 / 8$ (considering a signal probability of $1 /$ 2 for the multiplier bits) for every partial product and hence the resulting power wastage is significant enough to warrant a review of the design decisions related to the generation of $N$.

## III Delay Model

Figs. 4 and 5 illustrate the delay models of MBA and 4:2 compressor based partial product generation schemes. Our delay estimate of MBA I is given by $\tau_{\max }=10.5 \tau+3 S \tau$ while that of MBA II and $4: 2$ based schemes are respectively $\tau_{\max }=9.76 \tau+3 S \tau$ and $\tau_{\max }=5.2 \tau+3 S \tau$ where $\tau$ represents the delay of a minimum sized inverter driving an identical inverter and $S$ represents the stage ratio
of the drivers driving the control signals for MBA and multiplier/multiplicand bits for 4:2 compressor based schemes respectively. In our delay analysis, we considered a delay of $1.3 \tau$ for a single stage of CPL gate while the delay of the 5 X 1 CPL multiplexor was estimated as $1.5 \tau$. The stage ratio $S$ of the drivers captures the effect of parasitic capacitance on circuit delays. The larger the parasitic loading, the higher the value of $S$. The value of stage ratio $S$ is different in various implementations owing to the differences in layout geometries. In the above delay analysis, a three stage driver implementation is assumed for the distribution of the control signals for MBA as well as multiplier/multiplicand bits for $4: 2$ compressor based scheme. The worst case fanout of multiplier bits for MBA I is four and hence a single stage driver is envisaged for the distribution of these signals resulting in a delay of $4 \tau$. For MBA II, the worst case fanout of multiplier bits is eight, resulting in a driver delay of $5.66 \tau$ considering a two stage driver for the distribution of these signals.

## IV Energy Delay Analysis

The time averaged power consumption at the output of a CMOS logic structure is given by, $P=A F f C_{L} V_{D D}^{2}=P_{g}\left(1-P_{g}\right) f C_{L} V_{D D}^{2}$, where $A F$ is the activity factor, $f$ is the operating frequency, $P_{g}$ is the probability of finding a logic high at the node under consideration, $C_{L}$ is the capacitive loading at the node and $V_{D D}$ is the power supply voltage. The load capacitance at the output of a gate is proportional to the fanout of the gate. The total energy consumption (during one cycle of operation) due to signal dynamics at circuit nodes, in any logic structure can be expressed by the following relation.

$$
\begin{equation*}
E \propto \sum_{\forall g} P_{g}\left(1-P_{g}\right) F_{g} \tag{1}
\end{equation*}
$$

where $F_{g}$ represents the fanout of the $g$ th gate. The right hand side of the above equation represents the energy consumption measure of logic circuits. The energy delay product of the logic structure is given by

$$
\begin{equation*}
E D \propto \sum_{\forall g} P_{g}\left(1-P_{g}\right) F_{g} \tau_{\max } \tag{2}
\end{equation*}
$$

where $\tau_{\max }$ represents the delay of the critical path of the circuit. The above relation can be used for evaluating the energy delay measures of gate level logic representations on the basis of signal probabilities, fanouts and circuit delays. The following expression gives an estimate of the energy delay measure of MBA I, for the generation of two partial products.

$$
\begin{align*}
E D M_{I} & =\left[\frac{527}{256} n+\left(\frac{1+k}{\eta_{1}}\right) \frac{15}{8} n\right](10.5 \tau+3 S \tau) \\
& +\left[\left(\frac{1+k}{\eta_{2}}\right) \frac{15}{32}(n+5)+\frac{1491}{256}\right](10.5 \tau+3 S \tau) \tag{3}
\end{align*}
$$

where $n$ represents the bitwidth of multiplier and multiplicand data, $k$ represents the co-efficient of parasitic loading (for long interconnects, in which case $C_{L}=(1+k) C_{g}$ where $C_{g}$ represents the actual gate capacitance loading the interconnects), $\eta_{1}$ represents the efficiency of the driver interfacing the control signals $X 1, X 2$ and multiplicand bits while $\eta_{2}$ represents the efficiency of the driver interfacing $N$, and S represents the stage ratio of the drivers. In the above analysis, the signal probabilities of the multiplier and multiplicand bits had been assumed to be $P_{1}=P_{0}=1 / 2$. The corresponding energy delay measures for MBA II and 4:2 compressor based schemes are given by

$$
\begin{align*}
E D M_{I I}= & {\left[\left(\frac{1+k}{\eta}\right)\left(\frac{41}{16} n+\frac{69}{32}\right)\right](9.76 \tau+3 S \tau) } \\
& +\left[\frac{15}{16} n+\frac{47}{4}\right](9.76 \tau+3 S \tau)  \tag{4}\\
E D M_{4 \rightarrow 2} & =[3.575 n-1.9832](5.2 \tau+3 S \tau) \\
& +\left(\frac{1+k}{\eta}\right) 2 n(5.2 \tau+3 S \tau) \tag{5}
\end{align*}
$$

The co efficient of parasitic loading $k$ and fanout $F_{g}$ decide the value of stage ratio $S$, as given by $S=\exp [(\ln$ $\left.\left.\left((1+k) F_{g}\right)\right) / 3\right]$ for a three stage driver. The driver efficiencies $(\eta)$ can be evaluated from a knowledge of the stage ratio as well as number of stages.

## V Circuit Simulation

CPL realizations of the various partial product generation schemes for an 8X8 multiplier had been simulated using HSPICE with level 3 device models of Nortel 0.8 micron BATMOS process. The simulations were carried out for the generation of two partial products using each scheme. The test vectors ( X and Y ) for the simulation had been generated using random number generation. In our experiments, we generated the random test vectors using a uniform distribution and the circuits were simulated for forty cycles of operation. The HSPICE measurements of time averaged power and delays of critical paths had been used for the computation of the energy delays of various schemes.

## VI Results

Figs. 6 and 7 give the $\%$ reduction in energy delay of the $4: 2$ compressor based partial product generation scheme in comparison with MBA schemes, as given by the energy delay models of equations (3) (4) and (5). Fig. 6 shows the reduction with respect to MBA I while Fig. 7 shows the reduction with respect to MBA II. From Figs. 6 and 7, it is clear that the energy delay improvement of $4: 2$ compressor based scheme is better than $20 \%$ for a 24 X 24 bit unsigned multiplier, considering a parasitic capacitance contribution that is equal to gate load capacitances for longer interconnects. The relevant figures for $8 \mathrm{X} 8,16 \mathrm{X} 16$ and 32 X 32 multipliers are of the order of $39 \%, 27 \%$ and $18 \%$ respectively.

Figs. 8 and 9 illustrate the performance improvement of the 4:2 compressor based scheme on the basis of HSPICE simulations of the various schemes. From Fig. 8, it can be seen that the $4: 2$ compressor based scheme offers an energy delay reduction of the order of better than $36 \%$ in comparison with MBA I and $15 \%$ against MBA II while the reduction in power consumption is better than $26 \%$ and $10 \%$ respectively, for various instances of parasitic loading on long interconnects. Fig. 9 gives the corresponding performance advantages with voltage scaling, considering a parasitic capacitance loading of 0.2 pF . The energy delay advantage of the $4: 2$ compressor based scheme is better for lower operating voltages, suggesting that this scheme is better suited for low voltage operation. SPICE simulations of the various schemes give performance figures that are not appreciably different from that theoretically predicted. MBA schemes are vulnerable to power losses due to glitching - owing to the dissimilar delays of control signals and multiplicand bits, by virtue of which the power consumption of these schemes is greater than that theoretically anticipated. Another subtle difference from theoretical analysis occurs as far as signal transmission delays are concerned. The CPL cells used for this analysis exhibited delays greater than $1.3 \tau$ because of which the delay of MBA I scheme had been appreciably greater than that of the $4: 2$ compressor based scheme while the delay of MBA II had been found comparable. Because of these issues, the energy delay and power advantages of $4: 2$ compressor based scheme over MBA I scheme is better than that theoretically anticipated while these figures are less than that theoretically anticipated in comparison with MBA II.

## VII Conclusion

In contrast to MBA schemes, the 4:2 compressor based partial product generation scheme offers the best results as far as circuit delays, power consumption and energy delay are concerned. Partial product reduction using MBA I uses six control signals for the generation of two partial products while MBA II uses ten control signals. The effect of parasitic loading on these lines is much greater than that on the four multiplier bit lines of the $4: 2$ compressor based scheme and hence this scheme is extremely suitable for the implementation of wide multipliers. With MBA, an extra partial product is generated for integer multiplication when $n$ is even. This, together with the sign extension correction bit and 2's complementation bit, effectively increases the number of partial products. Post processing of the extra partial products comes with added power as well as delay penalties. For design synthesis applications, the layout generation can be much faster for the $4: 2$ based schemes because of their structural regularity. In addition to the power and delay advantages, the computation of the 'sticky' bit for IEEE floating point multipliers turns out to be much simpler with the $4: 2$ compressor based partial product generation scheme, because of the absence of negative partial products.

## References

[1] Andrew D. Booth, "A signed binary multiplication technique", Quarterly J. Mechan. Appl. Math., 4: 236 240 (1951).
[2] O.L.MacSorley, "High speed arithmetic in binary computers", Proc. IRE, Vol. 49, pp. 67-91, 1961.
[3] A.Weinberger, "4:2 Carry save adder module", IBM Tech. Disclosure Bulletin, 1981, 23
[4] D. Villeger and V. G. Oklobdzija, "Evaluation of Booth encoding techniques for parallel multiplier implementation", Electronics Letters, Vol. 29, No. 23, pp. 2016-2017, 11th November 1993.
[5] C. Nagendra, R.M. Owens and M.J. Irwin, "Design Tradeoffs in High Speed Multipliers and FIR Filters", Proceedings of the 9th International Conference on VLSI Design, January 1996, Bangalore, India.
[6] Norio Ohkubo, Makoto Suzuki, Toshinobu Shinbo, Toshiaki Yamanaka, Akihiro Shimizu, Katsuro Sasaki, and Yoshinobu Nakagome, "A 4.4nS CMOS 54X54-b Multiplier Using Pass-Transistor Multiplexer," IEEE Journal of Solid State Circuits, Vol. 30, pp. 251-257, March 1995.
[7] J. C. H. Latour, "High - Speed Complex Multiplier Integrated Circuit with Emphasis on Low Power Design", Masters Thesis, Royal Military College, Kingston, CANADA, 1995.

Table 1: Signal
Probabilities and Fanouts

| Node | $\mathrm{P}_{\mathrm{g}}$ | $\mathrm{F}_{\mathrm{g}}$ |
| :---: | :---: | :---: |
| 1 | $1 / 2$ | n |
| 2 | $1 / 2$ | 1 |
| 3 | $1 / 4$ | n |
| 4 | $3 / 4$ | 1 |
| 5 | $3 / 8$ | $\mathrm{n}+5$ |
| 6 | $1 / 4$ | 1 |
| 7 | $1 / 8$ | 1 |
| 8 | $3 / 8$ | 1 |
| 9 | $15 /$ | 2 |
| 10 | $1 / 4$ | 1 |
| 11 | $7 / 16$ | 2 |
| 12 | $1 / 8$ | 1 |
| 13 | $13 /$ | 4 |
| 32 |  |  |

(d) - Generation of the nth, $(\mathrm{n}+1)$ th and $\mathrm{C}_{-1}$ bits

Fig. 1 - Implementation of modified Booth algorithm - Scheme I

(a) - Generation of control signals

(b) Partial product bit generation

Table 2: Signal Probabilities and Fanouts

| Node | $\mathrm{P}_{\mathrm{g}}$ | $\mathrm{F}_{\mathrm{g}}$ |
| :---: | :---: | :---: |
| 1 | $1 / 2$ | 2 |
| 2 | $1 / 4$ | $\mathrm{n}+1$ |
| 3 | $1 / 4$ | $\mathrm{n}+2$ |
| 4 | $1 / 4$ | 2 |
| 5 | $1 / 8$ | $\mathrm{n}+1$ |
| 6 | $1 / 4$ | 2 |
| 7 | $1 / 8$ | $\mathrm{n}+2$ |
| 8 | $1 / 4$ | $\mathrm{n}+1$ |
| PP <br> bit | $3 / 8$ | 2 |
| 9 | $3 / 8$ | 4 |


(c) - Generation of sign and carry in $\left(\mathrm{C}_{-1}\right)$ bits

Fig. 2 - Implementation of modified Booth algorithm - Scheme II


Fig. 3-4:2 compressor


Fig. 5 - Delay model of 4:2 Compressor based Partial product generation


Fig. 6-\% Reduction in ED


Fig. 7 - \% Reduction in ED


Fig. 8 - \% Reduction in ED and Power


Fig. 9-\% Reduction in ED and Power

