# Decreasing Low-Voltage Manufacturing-Induced Delay Variations with Adaptive Mixed-Voltage-Swing Circuits

L. Richard Carley Carnegie Mellon University Dept. of Electrical and Comp. Eng. Pittsburgh, PA 15213 USA 01-412-268-3597 carley@ece.cmu.edu

Akshay Aggarwal, Carnegie Mellon University Dept. of Electrical and Comp. Eng. Pittsburgh, PA 15213 USA 01-412-268-7946 akshaya@ece.cmu.edu Ram K. Krishnamurthy Microprocessor Research Labs Intel Corporation Hillsboro, OR 97124 USA 01-503-696-3275 ramk@ichips.intel.com

## 1. ABSTRACT

One of the major problems faced by the designer when operating CMOS static logic circuits at low power supply voltages (normalized to  $V_{T}$ ) is that the delay spread introduced by today's IC manufacturing variations can increase dramatically. In this paper we describe an approach for decreasing the delay spread and power spread in ICs based on adaptively servoing the circuits between static CMOS operation and QuadRail operation. An on-chip series-regulator employing a dummy delay path is used to generate the adaptive low swing power supply rails making this approach fully compatible with a standard CMOS IC design methodology. Simulation results are presented demonstrating that for a 16\*16+36-bit multiplier-accumulator designed in 0.5µm CMOS process the proposed approach decreases the delay spread from 3.9X to 2.3X and the power spread from 3.6X to 1.8X.

## 1.1 Keywords

Low power CMOS logic, mixed-swing CMOS logic, manufacturing variations, low voltage logic circuits.

## 2. INTRODUCTION

Reducing the power dissipation of digital ICs is becoming one of the most important challenges facing the designer today. Voltage scaling of the power supply is one of the most efficient ways to reduce the power dissipation of static CMOS logic circuits. And, although some degree of voltage scaling is a necessary corollary to decreasing process feature size, many advanced processes are scaled even further in operating voltage primarily in order to decrease power dissipation. CMOS device threshold voltages are difficult to scale down with decreasing power supply voltages because of subthreshold leakage currents. In addition, the manufacturing variability in the threshold voltage is also not scaling down proportionally with power supply voltages; therefore, these variations are becoming a larger fraction of the supply voltage. And, as the power supply voltages decrease, manufacturing variations cause unacceptably large variations in the delay and power dissipation of CMOS logic gates. Note, reduction of the delay spread is important since the maximum system clock rate is typically limited by the delay on a given critical path. Hence, the specified clock rate for a give part must be based on the worst-case delay. In some applications selling different "speed" graded parts is possible (e.g., microprocessors) while in other applications operating speed is fixed (e.g., processor for a video rendering card). Reducing the power spread is important as the IC package has to be designed to handle the worst-case power.

Multiple voltage techniques have been reported for lowering the power dissipation by operating non-critical path gates at reduced voltages; e.g., [2]. These techniques employ multiple voltages while retaining the static CMOS based logic gate structure unaltered. A four-power-rail methodology called *Mixed-Swing QuadRail* has been proposed previously to construct standard digital logic gates using multiple voltages in a fine-grained way (within a gate) [3], [4].

In [13], an on-chip series-regulated version of Mixed Swing QuadRail was presented. This new methodology was used to design a 16\*16+36-bit DSP MAC fabricated in a 0.5µm CMOS process. This technique locally generated the lowswing supply rails from the regular, high-swing supply rails. In addition, [13] demonstrated an efficient sleep mode of operation for the series regulator. However, the inner rail voltages in [13] were adjusted in order to provide a fixed ratio between the off leakage current in the underdriven logic gate transistors and their on drive current. In this paper we will explore the advantage of also taking the delay of the logic circuits into account in controlling the inner power supply rail voltages.

In Section 3, we will present a very brief background on series regulated mixed-swing QuadRail [13]. In Section 4, we will present the impact on delay and power spread of using delay to set the power supply voltages for QuadRail.



Figure 1. Mixed Swing QuadRail (3,2) counter.



Figure 2. 16\*16+36-bit MAC architecture.

#### **3. SERIES-REGULATED QUADRAIL**

Fig. 1 shows the Mixed Swing QuadRail gate topology for a (3,2) counter, consisting of a logic stage operating between the high-swing power rails (Vd1-Vs1 =  $V_{logic}$ ) and a driver/ buffer stage operating between the low-swing power rails  $(Vd2-Vs2 = V_{buffer})$ .  $V_{logic}$  and  $V_{buffer}$  are approximately centered to maximize high and low noise margins and to equalize rising and falling delays in either stage. In [13] a simple series-regulator circuit based on maintaining a fixed ratio of off- to average on-drive current  $(I_{off}/I_{on})$  in the OuadRail circuit was used in order to balance static and dynamic power. This allowed local generation of the lowswing power rails (Vd2 and Vs2) from the regular, off-chip high-swing power rails (Vd1 and Vs1). This achieves the same goal of minimizing total power as [6] but without mandating any process modifications. The value used for the  $I_{off}/I_{on}$  was 150 for the Wallace Tree multiplier structures described in [13].

In this paper, we will also focus our attention on a 2's complement, fixed-point 16\*16+36-bit MAC. The MAC consists of an overlapped bit-pair Booth-recoded, (3,2) counter-based Wallace tree 16\*16-bit multiplier [7] and a 36-bit Block Carry Lookahead final accumulator [8], with a single pipeline stage between the multiplier and accumulator for enhanced throughput (Fig. 2). The Wallace tree multiplier is the most power-critical MAC component,

consuming 75% of total power [13]. This is due to the substantial interconnect capacitances driven by the 28-transistor-based (3,2) counters [9] within the Wallace tree. In order to lower the multiplier power, three versions of the MAC were fabricated with the multiplier constructed in series-regulated QuadRail, off-chip regulated QuadRail, and conventional static CMOS to study the relative power-delay trade-offs. For measured performance see [13]. The series-regulated QuadRail MAC incurs an 18% area penalty due to the having wells at two different potentials in the adder cells and due to the on-chip decoupling capacitors needed on the internally generated low swing power supply rails [13].

Fig. 3 shows the multiplier power-delay comparisons for static CMOS vs. the QuadRail methodologies over a range of operating voltages (2.5-1.5V) and process corners. Power savings result from both the capacitive load power dissipation (point-to-point net capacitance extracted from the Wallace tree multiplier layout is 58fF and the activity factor is extremely high) and from reduction of short circuit currents. Simulations indicated that 28% of the dynamic power within the multiplier was due to short-circuit power dissipation, despite the multipliers being optimally sized to maintain steep input rise/fall times. This is comparable to the short-circuit power reported for a similar multiplier in [10]. The reduced buffer stage swing offers a nearly cubic reduction in its short-circuit power.

To study the impact of series-regulated QuadRail on manufacturability, worst-case process and temperature corner analysis is performed across industrial Slow-NMOS-Slow-PMOS and Fast-NMOS-Fast-PMOS corners on the CMOS and Series-Regulated QuadRail multipliers in the 0.5µm process (Fig. 3). QuadRail demonstrates similar power\*delay dispersions as CMOS at high voltages. With voltage scaling, the dispersion remains well controlled and at Vlogic=1.5V, Vbuffer=0.8V, the power\*delay dispersion is 1.8X lower than CMOS, demonstrating improved low-voltage parametric yield.

#### 4. DELAY SERVOING

In this section we describe the new method proposed for decreasing the delay variations that occur at low operating voltages. Employing a dummy circuit to estimate the delay of the critical path in a logic network in order to control and **external** high efficiency power supply has been proposed by many researchers; e.g., [1]. However, in this work we use a dummy circuit representative of the critical path delay in order to adjust the voltage generated by the **on-chip** series regulators, which changes the operation of our circuit from single supply static CMOS (lower delay, higher power) to QuadRail (higher delay, lower power) based on the manufacturing and operating range constraints on the IC.

Because the difference between the main Vdd and the inner Vdd is not the same as the difference between Vss and the inner Vss, we define a factor K that determines the voltage applied to the inner logic rails as follows. The factor K is a potentiometer on the voltage regulator for inner rails. It controls the degree of QuadRail and Static CMOS operation



Figure 3. Static CMOS vs. series-regulated QuadRail power\*delay dispersion analysis in 0.5µm process.



Figure 4. Delay and Power as a function of the Servo-Factor K

such that the circuit operates at the desired clock frequency, and hence acts as a delay control servo factor. Fig. 4 shows the power and delay of the 16x16+36-bit MAC operating at 1.5V as the servo- factor K is varied for three process points. Fig. 4 indicates that for a 16\*16+36-bit MAC designed in 0.5µm CMOS process the proposed approach decreases the delay spread from 3.9X to 2.3X and the power spread from 3.6X to 1.8X relative to static CMOS.

For this example the delay spread is minimized by operating the SNSP case as normal CMOS and the other two cases as QuadRail as indicated by the circles on Fig. 4. Devices somewhere in between would operate with K's between 0.1 and 1. The low end is 0.1 because of series losses in the voltage regulator. Note, the maximum delay of both CMOS and the proposed scheme are approximately the same, but the fast end for CMOS has a much smaller delay. To first order, this is not an advantage. However, by slowing down the fast and typical cases using QuadRail, we saved power for those cases which can result in a less expensive package or longer battery life.

### 5. CONCLUSIONS

A novel way of controlling the on-chip series-regulator used to generate the low swing power supply rails for the QuadRail logic circuit methodology has been developed that combats manufacturing induced delay variations and power variations.

#### 6. ACKNOWLEDGEMENTS

This work was supported in part by DARPA (Order A564), the NSF (Grant MIP9408457), and the Semiconductor Research Corporation.

#### 7. REFERENCES

- L. Nielsen et al, "Low-Power Operation Using Self-Timed Circuits and Adaptive Scaling of Supply Voltage," Trans. VLSI Sys., vol. 2, no. 4, pp. 391-397, Dec. 1994
- [2] K. Usami et al, "Automated Low-Power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor", *CICC*, May 1997, pp. 131-134.
- [3] R.K. Krishnamurthy, I. Lys, and L.R. Carley, "Static Powerdriven Voltage Scaling and Delay-driven Buffer Sizing in Mixed Swing QuadRail", Proc. Intl. Symposium on Low Power Electronics and Design, August 1996, pp. 381-386.
- [4] R.K. Krishnamurthy and L.R. Carley, "Exploring the Design Space of Mixed Swing QuadRail for Low Power Digital Circuits", IEEE Trans. on VLSI Systems, Vol. 5, December 1997, pp. 388-400.
- [5] A.J. Strojwas et al., "Manufacturability of Low Power CMOS Technology Solutions", Proc. Intl. Symposium on Low Power Electronics and Design, August 1996, pp. 225-232.
- [6] J.B. Burr and J. Shott, "A 200mV Self-Testing Encoder/ Decoder using Stanford Ultra Low Power CMOS", Digest of technical papers, IEEE Intl. Solid State Circuits Conference, February 1994, pp. 84-85.
- [7] J.F. Ardekani, "MxN Booth Encoded Multiplier Generator Using Optimized Wallace Trees", IEEE Trans. on VLSI Systems, Vol. 1, June 1993, pp. 120-125.
- [8] J.F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw Hill, 1984.
- [9] R.K. Montoye et al, "An 18 ns 56-bit multiply-adder circuit", Digest of technical papers, IEEE Intl. Solid State Circuits Conference, February 1990, pp. 336-337.
- [10] M.Izumikawa et al., "A 0.25mm CMOS 0.9V 100MHz DSP Core", IEEE J. Solid-State Circuits, Vol. 32, Jan. 1997, pp. 52-61.
- [11] S. Shigematsu et al, "A 1-V High-speed MTCMOS Circuit Scheme for Power-down Applications", Digest of technical papers, Symposium on VLSI Circuits, June 1995, pp. 125-126.
- [12] T. Kobayashi and T.Sakurai, "Self-Adjusting Threshold-Voltage Scheme for Low-Voltage High-Speed Operation", Proc. IEEE Custom Integrated Circuits Conference, May 1994, pp. 271-274.
- [13] R.K.Krishnamurthy, H. Smidt, and L.R.Carley, "A Lowpower 16-bit MAC using Series-Regulated Mixed-Swing Techniques," *CICC*, May 1998.