# High-Speed Dynamic Logic Styles for Scaled-Down CMOS and MTCMOS Technologies

Mohamed W. Allam

Mohab H. Anis

Mohamed I. Elmasry

VLSI Research Group, University of Waterloo, Waterloo, ON, CANADA N2L3G1 mwaleed, manis, elmasry@vlsi.uwaterloo.ca

#### **ABSTRACT**

A new high-speed Domino circuit, called HS-Domino is developed. HS-Domino resolves the trade-off between performance and noise margins in conventional CD-Domino logic while dissipating low dynamic power with minimal area overhead. A dual-threshold (MTCMOS) implementation of HS-Domino and DDCVS logic is also devised. This implementation achieves low leakage values during standby, while maintaining high performance and low dynamic power during the active mode.

## 1. INTRODUCTION

In a digital CMOS circuit, dynamic power dominates the total power dissipation. Reducing the supply voltage  $V_{dd}$  is the most effective approach to reduce dynamic power dissipation. Lowering  $V_{dd}$  is also important in Deep SubMicron (DSM) technologies to avoid reliability problems [1]. However, reducing the supply voltage alone causes serious degradation in the circuit's performance. One way to maintain performance is to scale down both  $V_{dd}$  and the threshold voltage  $V_{th}$ . However, reducing  $V_{th}$  increases the subthreshold leakage current exponentially. Dynamic logic circuits such as Domino and Domino Differential Cascode Voltage Switch logic (DDCVS) have significantly worse tolerance to device subthreshold leakage compared to static CMOS [2]. This makes them risky to utilize the low threshold voltage (LVT) devices in order to improve the critical path delay [3]. A trade-off therefore exists between improving the gate's reliability and enhancing its speed. Noise margins (NM) are a good measure of reliability in dynamic logic styles.

A new Domino logic style; High-Speed Domino (HS-Domino) is therefore devised to resolve this speed-NM trade-off in Domino circuits. It extends the Domino's operation into the DSM regime, with no degradation to the gate's NM. A new multi-threshold (MTCMOS) scheme for Dynamic logic styles is then presented. This scheme is applied to both HS-Domino and DDCVS logic. The MTCMOS implementation substantially reduces the subthreshold leakage current dur-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ISLPED '00, Rapallo, Italy.

Copyright 2000 ACM 1-58113-190-9/00/0007...\$5.00.

ing the standby mode, while attaining high performance and low dynamic power values during the active mode.

An overview of the operation of CD-Domino logic is first presented and the speed-NM problem is explained in detail.

#### 2. CD-DOMINO LOGIC

Figure 1 shows an 8-input Clock Delayed Domino (CD-Domino) OR gate.



Figure 1: An 8-input Clock-Delay Domino OR gate

CD-Domino operates as follows: during the precharge phase (CLK is LOW), the Domino node is charged to "1", and the keeper transistor  $Q_2$  turns ON to maintain the voltage of the Domino node. When the CLK goes HIGH (evaluation mode), depending on the inputs, the Domino node is either discharged to GND or remains HIGH.

At the beginning of evaluation, the keeper is ON, charging the Domino node to "1". At the same time, the pull-down devices are trying to discharge the Domino node. This is called contention, where one device is trying to charge a node while another device is trying to discharge it. Contention slows down evaluation, and increases dynamic power dissipation because of the large current flowing from  $V_{dd}$  to GND during evaluation. Therefore, it is preferred to size down the keeper to reduce the contention current flowing through  $Q_2$  and thus enhancing the evaluation speed. A small keeper size, however, would not be able to compensate for any leakage currents or charge sharing when all inputs are "0"'s. This ultimately degrades the gate's NM. This is

<sup>&</sup>lt;sup>1</sup>CD-Domino is widely known in the industry as D2 Domino, while conventional Domino is known as D1 Domino



Figure 2: Behavior of CD-Domino

the basic speed-NM trade-off in conventional CD-Domino, which becomes severer at lower  $V_{th}$ 's. Figure 2(a) shows how the Domino node starts switching while the keeper is still ON (Contention problem).

Figure 2(b) shows the NM of an 8-input conventional Domino OR gate for different  $V_{th}$  values while keeping the ratio  $W_{keeper}/W_n$  constant and equal to 1/10, where  $W_{keeper}$  and  $W_n$  are the widths of the PMOS keeper and NMOS pull-down devices respectively. The 8-input OR gate has been used, because Domino logic is usually used for wide fan-in OR gates. OR gates experience the worst case leakage current when all the inputs are "0"'s.

Figure 2(b) shows that the NM drops by 1mV for every 1mV decrease in  $V_{th}$ . The NM is defined as the input voltage change that causes a 10% drop of  $V_{dd}$  at the Domino node [2]. The NM is set to 10% of  $V_{dd}$  in this work. Figure 2(c) illustrates the normalized delay of a 3-stage chain of 8-input CD-Domino OR gates with a fan-out of 3 versus  $V_{th}$  for two cases: constant  $W_{keeper}/W_n$  ratio (i.e. ignoring the NM), and controlled NM (at least 10% of  $V_{dd}$  by increasing the  $W_{keeper}/W_n$  ratio).

The constant  $W_{keeper}/W_n$  curve shows that the performance of CD-Domino increases as  $V_{th}$  decreases. However, this curve neglects NM, which leads to impractical design. On the other hand, to control the NM, the keeper has to be sized up leading to more contention current. This degrades the gate's speed particularly at low  $V_{th}$ 's and ultimately increases the delay (Controlled NM curve)

Therefore, Domino circuits are not suitable for DSM technologies because of their high leakage currents and degraded performance. HS-Domino resolves this speed-NM trade-off.

## 3. HS-DOMINO LOGIC

The architecture of an 8-input HS-Domino OR gate is shown in Figure 3. It is similar to CD-Domino except that the gate O(P) innected to the keeper through an NMOS  $(N_1)$  DS  $(P_1)$  device, whose gates are connected to the

delayed clock signal.



Figure 3: An 8-input HS-Domino OR gate

HS-Domino operates as follows: when the clock is LOW during precharge, the Domino node is precharged to  $V_{dd}$ . Transistor  $N_1$  is OFF,  $P_1$  is ON charging the gate of the keeper transistor  $Q_2$  to  $V_{dd}$ , thus turning  $Q_2$  OFF.  $Q_2$  is therefore OFF at the beginning of evaluation phase. Contention is thus eliminated between the keeper and the pull-down devices during evaluation. Therefore, the Domino gate evaluates faster and no contention current exists.

When the delayed clock becomes "1", if the Domino node evaluates to "0", the gate O/P is "1", and  $N_1$  is ON thus keeping  $Q_2$  OFF. On the other hand, if all the pull-down devices are OFF, the Domino node stays "1", causing the gate O/P to be "0", which in turn discharges the keeper's gate through  $N_1$ . Therefore, the keeper turns ON to maintain the voltage of the Domino node at  $V_{dd}$  and to compensate for any leakage currents. HS-Domino thus solves the contention problem by turning the keeper OFF at the start of the evaluation cycle. Therefore, the keeper width can now be sized up as  $V_{th}$  scales down to maintain a controlled NM, without worrying about increasing the contention, and speed degradation.







- (a) Waveforms of HS-Domino gate  $(V_{dd}=2.5\,\mathrm{V})$
- (b) Normalized delay vs  $V_{th}$
- (c) Normalized dynamic power vs  $V_{th}$

Figure 4: Behavior of HS-Domino

The contention-free operation is clearly demonstrated in Figure 4(a) where during the precharge phase (CLK goes LOW), the keeper's input (gate) remains HIGH.

#### 3.1 Speed and Power Comparison

In order to compare the performance of HS-Domino with that of CD-Domino, a 3-stage chain of 8-input OR gates with a fan-out of 3, was simulated in 0.25  $\mu m$  CMOS technology. The normalized delay versus  $V_{th}$  is shown in Figure 4(b).

The delay curves of the CD-Domino with controlled NM and constant  $W_{keeper}/W_n$  ratio in Figure 2(c) are re-plotted to illustrate the speed advantage of HS-Domino. Figure 4(b) shows how the delay of HS-Domino circuit continues to decrease as  $V_{th}$  is scaled down, without tampering the NM. A slight speed difference starts to develop between HS-Domino and CD-Domino at constant  $W_{keeper}/W_n$  ratio as  $V_{th}$  decreases. This is because the loading at the Domino node increases as the keeper is sized up to keep the NM intact. HS-Domino has a 30% speed advantage over CD-Domino at very low  $V_{th}$ 's, while controlling NM.

Figure 4(c) compares HS-Domino with CD-Domino in terms of dynamic power at  $500 \mathrm{MHz}$  using the same OR gate chain.

Although the HS-Domino with controlled NM introduces slightly higher clock loading, it reduces power dissipation by 15% to 24% compared to CD-Domino at controlled NM. This is attributed to the elimination of contention, meaning that there are no short circuit currents flowing from  $V_{dd}$  to GND.  $N_1$  and  $P_1$ , are minimum size devices which have a minor effect on the total power dissipation. At very low  $V_{th}$  values, the power of HS-Domino increases, because the keeper is sized up to maintain the NM which increases the loading at the Domino node. Nevertheless, HS-Domino proves to consume least dynamic power.

Leakage power of HS-Domino is equal to that of CD-Domino (constant  $W_{keeper}/W_n$  and controlled NM). Leakage current by an order of magnitude for every  $85\,\mathrm{mV}$  reduction

Therefore, HS-Domino resolves the speed-NM trade-off, but leakage power still has large values. A Multi- $V_{th}$  (MTC-MOS) implementation of HS-Domino (MHS-Domino) is therefore devised which achieves low values of leakage power while maintaining the high performance and low power of single  $V_{th}$  HS-Domino circuits.

Recently, MTCMOS technology has been proposed to reduce leakage current in DSM processes. MTCMOS uses low  $V_{th}$  (LVT) devices to reduce delay during normal operation and uses high  $V_{th}$  (HVT) devices to reduce leakage current during standby [4].

The next two sections explain the MTCMOS implementation of HS-Domino and its superiority over the conventional MTCMOS implementation of CD-Domino logic.

#### 4. MTCMOS CD-DOMINO LOGIC

In a recent report, an MTCMOS CD-Domino implementation has been devised [5]. This implementation focuses on reducing the leakage power while achieving high performance. However, no attention has been paid to noise margins, which is impractical in dynamic logic. Figure 5 illustrates the schematic of the MTCMOS CD-Domino gate, where the LVT devices are highlighted.



Figure 5: An 8-input MTCMOS CD-Domino OR gate

During evaluation, the gate might have a "0" to "1" tran-

sition at the gate O/P node. To speed up evaluation, all the devices involved in this transition should be LVT. These devices are: the pull-down transistors, transistor  $Q_4$ , the NMOS transistor of  $I_2$ , and the PMOS transistor of  $I_3$ . During standby, the gate operates in evaluation mode and all the inputs are forced HIGH. Therefore, the HVT transistors  $Q_1, Q_2, Q_5$ , the PMOS of  $I_2$ , and the NMOS of  $I_3$  are OFF, reducing the leakage current. Therefore, this scheme reduces the evaluation delay by using the LVT devices, while HVT devices are used to reduce leakage power. To keep the gate O/P HIGH during the standby mode, all inputs to the gate must be forced HIGH. This ensures that the entire pipeline of gates will be evaluated, and will remain in a low leakage state. However forcing all the inputs to be HIGH requires a gating circuitry at each input which would increase the dynamic and leakage power, delay, and area of the total circuit. This implementation does not take noise margin into consideration either, and suffers from contention explained in section 2. To resolve the speed-NM trade-off, and thus eliminate the contention, as well as achieving ultra low leakage values with minimal area overhead, the MTCMOS HS-Domino (MHS-Domino) circuit is devised.



Figure 6: An 8-input MHS-Domino OR Gate

The MHS-Domino gate shown in Figure 6 is similar to the HS-Domino gate, except that the keeper's source is connected to a  $\overline{SLEEP}$  signal instead of  $V_{dd}$ . The MHS-Domino circuit employs a Sleep Signal Generator (SSG), which is realized by an inverter as shown in the schematic. The SLEEP signal is "0" for active operation and "1" for standby.

During active mode ( $\overline{SLEEP}$  is "1"), the operation of MHS-Domino gate is the same as HS-Domino. Therefore, the keeper's source is connected to  $V_{dd}$  through the SSG block. In this scheme, all the transistors involved in the evaluation are chosen to be LVT devices to reduce delay. These devices are the NMOS logic block,  $Q_2, Q_4, P_1, N_1$ , the NMOS transistor of  $I_2$ , and the PMOS transistor of  $I_3$ . LVT devices are highlighted in Figure 6.

In standby mode, the  $\overline{SLEEP}$  becomes "0", and the clock is HIGH. The Domino node has two possible values; "1" or "0". If the node is "0", the gate output is "1" and therefore the input to the following gate is "1". When the Domino node is "1", the output is "0",  $N_1$  is ON, which turns ON . Since,  $\overline{SLEEP}$  is "0", the Domino node is through the keeper, causing the gate output to

switch to "1". This transition is fast because  $N_1$  and the keeper are LVT. Therefore, at the beginning of standby operation, all MHS-Domino gates change their output to become "1" regardless of the input values. Therefore, input gating circuitries are not needed which reduces hardware and power. During standby mode, all HVT transistors  $Q_1$ ,  $Q_5$ , the PMOS of  $I_2$ , the NMOS  $I_3$ , and the PMOS of the SSG turn OFF to reduce the leakage current.

# 5.1 Speed and Power Comparison

The normalized delay of a 3-stage chain of 8-input Domino OR gates with a fan-out of 3 for the MHS-Domino circuit versus  $V_{th}$  is shown in Figure 7(a). The delay curves of the MTCMOS CD-Domino [5] with controlled NM and constant  $W_{keeper}/W_n$  ratio are also plotted on the same graph to illustrate the speed advantage of MHS-Domino. Figure 7(a) shows that the delay of the MHS-Domino circuit at controlled NM, continues to decrease as  $V_{th}$  is scaled down because there is no contention current. HVT is taken to be 550mV, while LVT is varied and is denoted by  $V_{th}$  in the three graphs of Figure 7.

Figure 7(b) compares the MHS-Domino circuit with the MTCMOS CD-Domino circuit in terms of dynamic power at 500MHz CLK. Although the MHS-Domino circuit introduces slightly higher loading, and an SSG, it actually has significantly lower power dissipation than the conventional MTCMOS version. This is attributed to the following: 1. Elimination of the contention in the MHS-Domino gate, which means that there are no short-circuit currents during switching. 2.  $N_1$  and  $P_1$  are minimum sized devices, contributing to a very small loading effect. 3. The sleep signal generator hardly consumes any dynamic power because its output is always "1" during the active mode, and "0" during standby. Thus, no switching occurs except during transition from one mode to the other. These transitions are not frequent and their power dissipation penalty becomes less significant if the system stays most of the time in the idle state (95%)

A comparison between the normalized leakage power of the MTCMOS CD-Domino and MHS-Domino is also shown in Figure 7(c). MHS-Domino consumes slightly higher leakage power than the MTCMOS CD-Domino (controlled NM). For low  $V_{th}$  values, the MHS-Domino consumes less leakage power compared to the conventional MTCMOS version at controlled NM. The difference in leakage power between the two cases is negligible since the leakage current is in the order of pico-amperes. Therefore, MHS-Domino eliminates the contention, enhances the performance, reduces dynamic power dissipation and exhibits very low leakage current during standby with minimal area overhead.

The scheme used to convert HS-Domino to an MTCMOS logic style is generic and may be applied to other dynamic logic styles. Although the speed-NM problem is solved in Domino Dual Cascode Voltage Switch logic (DDCVSL), the new MTCMOS scheme is extended to DDCVSL (MDD-CVSL) to reduce leakage power.



Figure 7: Behavior of MHS-Domino (LVT of the MHS-Domino circuit is denoted by  $V_{th}$  in the three Figures)

# 6. MDDCVS ARCHITECTURE AND OPERATION

MDDCVS has a similar implementation to the conventional double-branched DDCVS [6], except for a sleep signal generator (SSG), which connects the sources of the PMOS keeper devices  $(Q_1 \text{ and } Q_2)$  to  $V_{dd}$  during the active mode (conventional DDCVSL), or to GND during the sleep mode. Figure 8 shows a typical clock-delayed two input XOR MDDCVS stage.



Figure 8: A two input MDDCVS XOR logic gate

During normal operation  $\overline{SLEEP}$  signal is HIGH and the MDDCVS circuit operates exactly like a conventional DDCVS circuit. Because precharge time is not critical, transistors involved in the precharge process should be HVT to reduce leakage during standby. These transistors are  $Q_3$ , MMOS of inverters  $I_1$  and  $I_2$ , PMOS of  $I_3$ , NMOS he PMOS of the SSG. On the other hand, when

CLK is HIGH (when the logic is evaluated), transistors involved in the evaluation should be LVT to speed-up the logic gate. The transistors responsible for evaluation are the NMOS block transistors, NMOS of  $I_3$ , PMOS of  $I_4$  and the PMOS transistors of  $I_1$  and  $I_2$ . LVT devices are highlighted in Figure 8.

During the standby mode, both the clock and SLEEP signal are HIGH, turning OFF the HVT devices. Whether standby occurs right after precharge or evaluation, this is not an issue. In both cases, whatever the input values to the gate, a state will always be reached where one branch is ON (a path to GND exists) and the other is OFF. Assuming that the value of the inputs to the N-block cause node  $N_1$  to be LOW and thus  $N_2$  to be HIGH,  $Q_2$  is therefore turned ON, allowing node  $N_2$  to start discharging through  $Q_2$ , until it eventually reaches "0". If the inputs were to cause  $N_2$  to be LOW and  $N_1$  to be HIGH,  $Q_1$  will turn ON allowing node  $N_1$  to start discharging through  $Q_1$ , until it eventually reaches "0".

Therefore, regardless of the inputs to the gate, both  $N_1$  and  $N_2$  will be "0" during the standby mode. The time taken to reach "0" for both  $N_1$  and  $N_2$  was calculated to be  $\approx$  200psec. Both nodes F and  $\overline{F}$  will thus go HIGH, which cause the input NMOS devices in the successive stage to turn ON completely, and pull-down the two internal nodes  $(N_1 \& N_2)$  of next stage) to GND very quickly. In the second stage, the discharging time takes  $\approx$  60psec. Similarly with any other cascaded gates in the pipeline. Therefore, the first stage usually takes the longest time to reach the standby state, while the other consecutive gates in the pipeline take shorter times. This is not critical, especially in mobile systems, where over 95% of the time is spent as idle time. The 200psec is by far negligible compared to the long minutes a mobile system could be idle for.

An important advantage of the MDDCVS is that it does not require specific input values to the gate at the standby mode. This eliminates any increase in area, power, or delay as a result of gating the inputs to the DDCVS gate.



Figure 9: Behavior of MDDCVSL (LVT of the MDDCVS circuit is denoted by  $V_{th}$  in the three Figures)

# 6.1 Speed and Power Comparison

To verify the functionality and benefit of the MDDCVS, simulations were performed on a pipeline of 3 MDDCVS XOR gates with a fan-out of 3 operating at 500MHz and using  $0.25 \mu m$  CMOS technology at 2.5 V supply voltage. XOR gates were used as a test vehicle because DCVS logic is normally used to implement XOR and MUX circuits due to its differential nature. Figure 9(a) shows the normalized delay of the 3-stage chain of DDCVS gates versus  $V_{th}$  for 3 cases: single  $V_{th}$  DDCVS at constant  $W_{keeper}$  (ignoring NM), single  $V_{th}$  DDCVS at controlled NM, and MDDCVS at controlled NM.  $W_{keeper}$  is the size of the keeper transistors  $Q_1$  and  $Q_2$ . HVT is also taken to be 550mV, while LVT is varied and is denoted by  $V_{th}$  in the three graphs of Figure 9.

Similar to the Domino logic case, the NM was defined as the input voltage above ground that causes a 10% drop from  $V_{dd}$  at the DDCVS internal nodes  $(N_1 \text{ or } N_2)$ . The NM was set to 10% of  $V_{dd}$ . Figure 9(a) shows that MDDCVS at controlled NM has similar performance to the single  $V_{th}$  at controlled NM case. The constant keeper curve shows the maximum possible gain in speed by lowering  $V_{th}$ , which wrongfully ignores the NM. Again, a slight difference in speed starts to develop as  $V_{th}$  decreases between MDDCVS and single  $V_{th}$  DDCVS with constant  $W_{keeper}$ . This is due to the increased loading at the DDCVS internal nodes  $(N_1 \text{ and } N_2)$  as the keeper transistors are sized up to keep the NM controlled.

Figure 9(b) shows that the MDDCVS consumes approximately the same dynamic power as the single  $V_{th}$  DDCVS at controlled NM. This is because the SSG hardly consumes any dynamic power during the active mode as no switching occurs. The leakage power of the three cases is also shown in Figure 9(c). MDDCVS has much smaller leakage consumption over the single  $V_{th}$  designs. This advantage is important to reduce leakage currents of HVT devices and enhance the speed by using LVT devices in the evaluation path.

#### 7. CONCLUSION

A modified Domino circuit, called HS-Domino is developed. HS-Domino resolves the trade-off between performance and noise margins in conventional Domino logic. This circuit can now benefit from the scaling down of the technology and supply voltages since it could now tolerate the lower threshold voltages. The speed of the new Domino logic continues to improve as the threshold voltages are scaled down, while controlling the noise margin. HS-Domino has a speed advantage of 30% at low  $V_{th}$  values. The new circuit also dissipates up to 24% less dynamic power.

An MTCMOS implementation of the new Domino logic style is also devised. This dual-threshold implementation achieves low leakage values during standby, when the HVT devices are turned OFF. High speed and low dynamic power are also maintained during the active mode when the LVT devices turn ON during evaluation. A dual-threshold implementation of the DDCVS logic gate is also presented. It achieves substantially low leakage values during standby, while attaining high performance and low dynamic power during the active mode.

# 8. REFERENCES

- H.Iwai, "CMOS Technology-Year 2010 and Beyond", *IEEE JSSC*, pp. 357–366, 1999.
- [2] S.Thompson et al., "Dual Threshold Voltage and Substrate Bias: Keys to High Performance, Low Power, 0.1μm Logic Designs", IEEE Symp. on VLSI Tech., pp. 69-70, 1997.
- [3] Z.Chen et al., "0.18 $\mu$ m Dual  $V_t$  MOSFET Process and Energy-Delay Measurement", IEDM Tech. Digest, pp. 851–853, 1996.
- [4] S.Mutah et al., "1-V Power Supply High-Speed Digital Circuit Technology with Multi-Threshold Voltage CMOS", IEEE JSSC, pp. 847–853, 1995.
- [5] J.Kao, "Dual Theshold Voltage Domino Logic", Proc. of IEEE 25th ESSCIRC, pp. 118-121, 1999.
- [6] P. Ng et al., "Performance of CMOS Differential Circuits", IEEE JSSC, pp. 841-846, June 1996.