# A Unified Approach in the Analysis of Latches and Flip-Flops for Low-Power Systems Vladimir Stojanovic University of Belgrade, Yugoslavia Bulevar Revolucije 73 11000.Beograd, Yugoslavia +381 11 310 3306 sv01793d@kiklop.etf.bg.ac.yu Vojin G. Oklobdzija Integration, Berkeley, CA 1285 Grizzly Peak Blvd. Berkeley, CA, 94708 (510) 486-8171 vojin@nuc.berkeley.edu Raminder Bajwa Semiconductor Research Laboratories, Hitachi America Ltd San Jose, CA (408) 922-4112 rbajwa@hmsi.com #### Abstract In this paper we propose a set of rules for consistent estimation of the real performance and power features of the latch and flip-flop structures. A new simulation and optimization approach is presented, targeting both high-performance and power budget issues. The analysis approach reveals the sources of performance and power consumption bottlenecks in different design styles. Certain misleading parameters have been properly modified and weighted to reflect the real properties of the compared structures. Furthermore, the results of the comparison of representative latches and flip-flops illustrate the advantages of our approach and the suitability of different design styles for low-power and high-performance applications. ## Keywords Master-Slave latch, flip-flop, power measurement, timing, optimization ## 1. INTRODUCTION Interpretation of published results comparing various latches and flip-flops has been very difficult because of different simulation methods used for generation and presentation of results. Certain approaches, [1], [2], etc., did not illustrate real performance and power features of the presented structures. The main reason for that was the improper consideration and weighting of relevant parameters. In this paper we establish a set of rules in order to make comparisons fair and realistic: first, definition of the relevant set of parameters to be measured and rules for weighting their importance; and second, a set of relevant simulation conditions, which emphasize the parameters of interest. The primary goal of simulation and optimization procedures was the best compromise between power consumption and performance, given that the limitation in performance is usually imposed by the available power budget. ## 2. ANALYSIS ## 2.1 Power Considerations Data activity rate, $\alpha$ , presents the average number of output transitions per clock cycle. We have applied four different data sequences where: ...010101010..., $\alpha=1$ , reflects maximum internal dynamic power consumption; however, depending on the structure, the sequence ...111111... can in some cases dissipate more power. Pseudo-random sequence with equal probability of all transitions (data activity rate $\alpha=0.5$ ) is considered to reflect the average internal power consumption given the uniform data distribution. Sequence: ...111111..., $\alpha=0$ , reflects the power dissipation of precharged nodes while ...000000..., $\alpha=0$ , reflects leakage power consumption and power spent on internal clock processing. Dynamic power consumption can be estimated by: $$P_d = fC_{eff}Vdd^2$$ , where $C_{eff} = \sum_{i=1}^{N} \alpha_i k_i C_i$ - $\alpha_i$ is the switching probability of node i (in regard to the clock cycle) - $k_i$ is the swing range coefficient of node i ( $k_i = 1$ for rail to rail swing) - C<sub>i</sub> is the total capacitance of node i - f is the clock frequency - Vdd is the rail to rail voltage range (supply voltage) Figure 1 describes differences in switching activity, and therefore power consumption, for different design styles. Capacitances $C_{lotal}$ , $C_{precharge}$ and $C_{out}$ are calculated taking into account the $C_i$ and $k_i$ coefficient of each node in the circuit. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED98, Monterey, CÂ, USÂ © 1998 ACM 1-58113-059-7/98/0008..\$5.00 Semi-Dynamic structures are generally composed of dynamic (precharged) front-end and static output part. Thus we designated two major effective capacitances: $C_{precharge}$ and $C_{out}$ , each representing the corresponding part of the circuit. It is shown on Figure 1 that these two capacitances have different charging and discharging activities. Precharge, Differential $C_{prechQ}$ - precharge nodes on one side of differential tree $C_{outQ}$ - single-output nodes $C_{eff} = C_{prechQ}(p(0 \rightarrow 1) + p(1 \rightarrow 0) + p(0 \rightarrow 0) + p(1 \rightarrow 1)) + 2C_{outQ}(p(0 \rightarrow 1) + p(1 \rightarrow 0))$ Figure 1. Sources of internal, dynamic power consumption Total effective precharge capacitance of semi-dynamic, differential structures is comprised of two effective capacitances of the same size: $C_{prechargeQ}$ and $C_{prechargeQb}$ , which actually represent the two complementary halves of the precharged differential tree. We used the .MEASURE average power statement in HSPICE to measure the power dissipation of interest. Results were compared with the earlier power measurement method presented in [3] and showed the same level of accuracy. There are three main sources of power dissipation in the latch: - Internal power dissipation of the latch, including the power dissipated for switching the output loads - Local clock power dissipation, presents the portion of power dissipated in local clock buffer driving the clock input of the latch - Local data power dissipation, presents the portion of power dissipated in the logic stage driving the data input of the latch The parameter *Total power* refers to the sum of all three measured kinds of power. ## 2.2 Timing Stable region, Figure 2, is the region of Data-Clk (the time difference between the last transition of Data and the latching Clock edge) axis in which Clk-Q delay does not depend on Data-Clk time. As Data-Clk decreases, at certain point, Clk-Q delay starts to rise monotonously and ends in failure. This region of Data-Clk axis is the Metastable region. The Metastable region is defined as the region of unstable Clk-Q delay, where the Clk-Q delay rises exponentially as indicated by Shoji in [7]. Changes in Data that happen in the Failure region of D-Clk are not transferred to the outputs of the circuit. The question arises of how much we can let the Clk-Q delay be degraded in the Metastable region and still have the increase in performance (due to the minimum in D-Q) and insured reliability? Figure 2. StrongArm110 flip-flop, Stable, Metastable and Failure regions $D_{cq}$ , [6], is the value of *Clk-Q* delay, Figure 2, in the *Stable region*, and U, [6], is the minimum point on D-Clk axis which is still a part of the Stable region. In *Metastable region D-Q* curve has its minimum as we move the last transition of data towards the latching edge of the clock. It is clear that beyond that *minimum D-Q* point it is no longer applicable to evaluate the Data closer to the rising edge of the clock. We refer to D-Clk delay at that point as the optimum setup time, the limit beyond which the performance of the latch is degraded and the reliability is endangered. Our interest is to minimize the D-Q delay (or $D_{c\varrho}+U$ , as defined by Unger and Tan, [6]) which presents the portion of time that the flip-flop or Master-Slave structure takes out of the clock cycle. Since $D_{c\varrho}+U > minimum \ D-Q$ (as defined in Figure 2) it is obvious that the cycle time will be reduced if it is allowed for the change in Data to arrive no later than the *Optimum setup time* before the trailing edge of the clock. In the light of the reasons presented above, we accepted the *minimum D-Q* delay as the *Delay* parameter of a flip-flop or Master-Slave latch. Metastable region consists of Setup and Hold zones. Last data transition can be moved all the way to the optimum setup time. First or late data transition is allowed to come after the hold zone. Hybrid design technique, [9], [13], [14], shifts the reference point of hold and setup time parameters from the rising edge of the clock to the falling edge of the buffered clock signal which ends the transparency period. In this way the setup and hold times measured in reference to the rising edge of the clock (as conventionally defined for flip-flops) are functions of the width of transparency period since their real reference point is the end of that period (just like in custom transparent latches). ## 2.3 Power Delay Product The point of minimum Power-Delay Product exists and presents the point of optimal energy utilization. The PDP<sub>tot</sub> parameter is the product of the *Delay* and *Total* power parameters. We have chosen the PDP<sub>tot</sub> as the overall performance parameter for comparison in terms of speed and power. ## 3. SIMULATION #### 3.1 Test Bench Figure 3. The simulation test bench Buffering inverters on Figure 3 provide realistic Data and Clock signals, while themselves fed from ideal voltage sources. Capacitive loads simulate the fan-out signal degradation. Since buffering inverters dissipate power even without any external load (due to their internal capacitances) we made the corrections of measured power of the shaded inverters, Figure 3, by interpolating the power over the wide range of loads. In case of the Data inverter, the correction took into account not only the inverter's intrinsic capacitance, but also the load Cl. Parameters of the MOS model used in our simulations are shown in Table 1. For given technology, load capacitance Cl =200fF equals the load of 22 minimal inverters (wp/wn = 3.2u/1.6u). Dependence of power consumption on clock frequency appeared to be nearly linear (since the throughput was increased accordingly), so we decided to fix the frequency at 100MHz. | Technology: | | |-------------------------|-------------------------| | Channel length | .2 µm | | Min. gate width | 1.6 µm | | Max. gate width | 22 μm | | Vtp,n | 0.7V | | MOSFET Model: | • | | Level 28 modified BSIM | Model | | MOS Gate Capacitance M | 1odel: | | Charge Conservation Mod | del | | Conditions: | | | Nominal | $Vdd=2V, T=25^{\circ}C$ | Table 1. MOS transistor model parameters ## 3.2 Transistor Width Optimization All structures were optimized both in terms of speed and power. We used the Levenberg-Marquardt optimization algorithm embedded in HSPICE. A variety of other optimization algorithms is available today, like the ones presented by Yuan and Svensson, in [11] and [12]. Both algorithms will eventually lead to good results when applied to logic structures, but they do not take into account the setup time parameter and therefore the effective time taken from the cycle. First step is the optimization of both *Clk-Q* delay and *Total power*, which essentially presents the optimization in terms of PDP with the addition of the *Total power* parameter. Next step is the calculation and correction of the *minimum D-Q* taken as the *Delay* parameter. The problem arises in how to calculate the *Delay* and find the minimum PDP<sub>tot</sub> in one step. Several iterations are needed to achieve satisfying results. New automated tools are needed especially because the existing ones consider the *Clk-Q* delay as a relevant parameter for the optimization. If we try to optimize MS latch in terms of the classical PDP (*Clk-Q* \* Internal Power) the result will be minimal Master latch optimized for low power, and Slave latch optimized for both speed and power. The "optimized" structure will have excessively large setup time thus requiring the larger clock cycle to meet the timing requirements. The reason for such result is that the optimizer does not "see" the real performance through *Clk-Q* delay. ## 4. RESULTS We have chosen a set of representative latches and flipflops which have been designed for use either in highperformance or in low-power processors. Results of the simulations are shown in Table 2. Power dissipation parameters presented in Table 2 are for the pseudo-random data sequence with equal probability of all transitions. Main advantages of PowerPC 603 MS latch, Figure 5, presented in [4], are short direct path and low-power feedback. But, it has a big clock load which greatly influences the total power consumption on chip. Modification of standard dynamic C<sup>2</sup>MOS MS latch, Figure 13, has small clock load, achieved by the local clock buffering, and low-power feedback assuring fully static operation. It is slower than PowerPC 603 MS latch. The faster pull-up in PowerPC 603 MS latch is achieved by the use of complementary pass-gates, which are less robust. Unlike classical C<sup>2</sup>MOS structure, mC<sup>2</sup>MOS is robust to clock slope variation due to the local clock buffering. Milestones of hybrid-design technique are HLFF, Figure 8, [9] and SDFF, Figure 9, [13]. SDFF is the fastest of all the presented structures. The significant advantage over HLFF lies in very little performance penalty for embedded logic functions. The disadvantages are bigger clock load and larger effective precharge capacitance which results in increased power consumption for data patterns with more "ones". K6 Edge-Triggered-Latch, Figure 10, [14], is dynamic, self-resetting, differential, hybrid structure. It is very fast but has very high power consumption independent on the data pattern. Precharged sense-amplifier stage SA-F/F, Figure 11, [10], and the flip-flop used in StrongArm110, Figure 12, [8]. Have the speed bottleneck in output S-R latch stage. Uneven rise and fall times not only degrade speed but also cause glitches in succeeding logic stages, which increases total power consumption. The additional transistor in StrongArm FF, only provides fully static operation, with little penalty in power and delay. SA-F/F, StrongArm110 FF, and self-reset stage in K6 ETL have a very useful feature of monotonous transitions at the outputs, which drive fast domino logic, [14], [15]. These structures also have very small clock load. The SSTC\* and DSTC\* MS latches, Figure 6 and Figure 7, were simulated with minimized Master latch, as proposed in [5], and optimized Slave latch. Using our optimization approach we got approximately 40% better results, in terms of PDP<sub>tot</sub>. Minimized Master latch in SSTC\* and DSTC\* suffers from substantial voltage drop at the outputs, due to the capacitive coupling effect between the common node of the Slave latch and the floating high output driving node of the Master latch. The optimized Master latch consumes more power than the minimized one but minimizes the portion of short circuit power dissipated in the Slave latch. With this tradeoff, power remains the same and setup time is significantly reduced which leads to much better PDP<sub>tot</sub>. | Nominal conditions | # of<br>T's. | Total<br>gate<br>width<br>[u] | Internal<br>power<br>[uW] | Clock<br>power<br>[uW] | Data<br>power<br>[uW] | Total<br>power<br>[uW] | Delay<br>[ps] | PDP <sub>tot</sub> [fJ] | |---------------------|--------------|-------------------------------|---------------------------|------------------------|-----------------------|------------------------|---------------|-------------------------| | PowerPC | 16 | 185 | 56 | 46 | 5 | 107 | 266 | 28 | | HLFF | 20 | 162 | 126 | 18 | 3 | 148 | 199 | 29 | | SDFF | 23 | 167 | 178 | 27 | 2 | 207 | 187 | 39 | | mC <sup>2</sup> MOS | 24 | 170 | 114 | 15 | 6 | 136 | 292 | 40 | | SA-F/F | 19 | 214 | 137 | 18 | 3 | 158 | 272 | 43 | | StrongArm | 20 | 215 | 141 | 18 | 3 | 162 | 275 | 45 | | K6 ETL | 37 | 246 | 330 | 15 | 5 | 349 | 200 | 70 | | SSTC | 16 | 147 | 134 | 22 | 4 | 160 | 592 | 95 | | DSTC | 10 | 136 | 172 | 22 | 4 | 198 | 629 | 125 | | SSTC* | 16 | 86 | 132 | 14 | 1 | 146 | 898 | 131 | | DSTC* | 10 | 76 | 172 | 13 | 1 | 185 | 1060 | 196 | Table 2. General Characteristics However, the presented capacitive coupling effect along with the problems associated with the glitches at the data inputs, noted by Blair in [16], result in much worse performance and power features compared with other presented latches, even for the optimized structures SSTC and DSTC. Detailed timing parameters of the presented structures are shown in Table 3. | Nominal | Clk- | Clk- | Min. | Min. | Opt. | |---------------------|------|------|-------|-------|----------| | conditions | Qhl | Qlh | D-Qhl | D-Qlh | Setup | | | [ps] | [ps] | [ps] | [ps] | time[ps] | | HLFF | 195 | 191 | 199 | 155 | -21 | | PowerPC | 145 | 139 | 266 | 220 | 79 | | SDFF | 176 | 176 | 187 | 143 | -21 | | mC <sup>2</sup> MOS | 193 | 188 | 292 | 282 | 92 | | Strong Arm | 262 | 162 | 275 | 171 | -35 | | SA-F/F | 262 | 162 | 272 | 168 | -35 | | K6 ETL | | 168 | | 200 | -4 | | SSTC | 97 | 301 | 374 | 592 | 267 | | DSTC | 98 | 318 | 375 | 629 | 263 | | SSTC* | 150 | 393 | 639 | 898 | 476 | | DSTC* | 200 | 500 | 716 | 1060 | 480 | Table 3. Timing parameters Figure 4 presents the ranges and distribution of PDP<sub>tot</sub> for different data patterns. Symbol • designates the point of power dissipation (PDP<sub>tot</sub>) for average activity data pattern. Figure 4. Ranges of PDP<sub>tot</sub> For systems where high-performance is of primary interest, within available power budget, single-ended, hybrid, semi-dynamic designs present very good choice, given their features of negative setup time, and small internal delay. They have comparable power dissipation to Static MS latches, but much better performance. Low-power pass-gate style used in PowerPC 603 and modified C<sup>2</sup>MOS style are good choices for designs where speed is not of primary importance. On the basis of our comparisons, differential structures appear to be worse than single-ended ones. Differential structures switch for all data patterns and have doubled input and output capacitive load. Differential latches based on DCVS logic style suffer from uneven rise and fall times which can cause glitches and short-circuit power dissipation in succeeding logic stages. Despite all described disadvantages, differential structures have the unique property of differential signal amplification. In case where logic in the pipeline operates with reduced voltage swing signals these latches have the role of signal amplifiers, i.e. swing recovery circuits, [10]. Thus, the logic in the pipeline is the party that saves power and not the latches themselves. Overall power dissipation of such pipeline structures is decreased, but latches themselves are not ideal low-power structures, when tested solely. This is the reason why they appear to have a bad compromise between power and delay in comparison with other single- ended structures. Since the future of low-power systems lies in reduced signal swing, the importance of differential logic and latching structures is increasing. The amount of power consumed for driving the clock inputs of each structure is shown on Figure 14. Figure 14. Local Clock power consumption On Figure 15, hybrid structures show the best performance, as they really should, due to the negative setup time. If only Clk-Q parameter is taken as the valid performance indicator, the positive setup time of the MS structures is hidden and they become comparable, if not better than hybrid ones. This is illustrated on Figure 16, where PowerPC 603 MS latch becomes the "fastest", mC<sup>2</sup>MOS MS latch becomes as "fast" as HLFF and DSTC and SSTC MS latches become comparable to other structures in terms of "speed". Figure 15. Total Power range vs. Delay Figure 16. Total Power range vs. Clk-Q ## 5. CONCLUSION The problem of consistency in analysis of various latch and flip-flop designs was addressed. A set of consistent analysis approach and simulation conditions has been introduced. We strongly feel that any research of the latch and flip-flop design techniques for high-performance systems should take those parameters into account. The problems of the transistor width optimization methods have also been described. Some hidden weaknesses and potential dangers in terms of reliability of previous timing parameters and optimization methods were brought to light. ## 6. REFERENCES - [1] Ko, U., et al. Design techniques for high-performance, energy-efficient control logic in ISLPED Digest of Technical Papers, Aug. 1996 - [2] Yuan, J., and Svensson, C., Latches and flip-flops for Low Power Systems in A. Chandrakasan and R. Brodersen, Low Power CMOS design, 233-238, IEEE Press, NJ 1998. - [3] Fisher, G. J., An Enhanced Power Meter for SPICE2 Circuit in IEEE Transactions on Computer-Aided Design, vol. 7, no. 5, Oct. 1986. - [4] Gerosa, G., et al., A 2.2 W, 80 MHz Superscalar RISC Microprocessor in IEEE Journal of Solid-State Circuits, vol. 29, no. 12, December 1994., 1440-1452. - [5] Yuan, C., and Svensson, C., New Single-Clock CMOS Latches and Flipflops with Improved Speed and Power Savings in IEEE Journal of Solid-State Circuits, vol. 32, no. 1, January 1997. - [6] Unger, S.H. and Tan, C., Clocking Schemes for High-Speed Digital Systems in IEEE Transactions on Computers, vol. C-35, No 10, October 1986 - [7] Shoji, M. Theory of CMOS Digital Circuits and Circuit Failures. Princeton University Press, Princeton NJ, 1992. - [8] Montanaro, J., et al., A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor IEEE Journal of Solid-State Circuits, vol. 31, no. 11, 1703-14., Nov. 1996. - [9] Partovi, H., *et al.*, Flow-through latch and edge-triggered flip-flop hybrid elements in ISSCC Digest of Technical Papers, Feb. 1996. - [10] Matsui, M., et al. A 200 MHz 13 mm<sup>2</sup> 2-D DCT Macrocell Using Sense-Amplifier Pipeline Flip-Flop Scheme in IEEE Journal of Solid-State Circuits, vol. 29, no. 12, 1482-91, Dec. 1994. - [11] Yuan, J., and Svensson, C., CMOS Circuit Speed Optimization Based on Switch Level Simulation in Proceedings of International Symposium on Circuits and Systems, ISCAS 88, 1988. - [12] Yuan, J., and Svensson, C., Principle of CMOS circuit power-delay optimization with transistor sizing in Proceedings of International Symposium on Circuits and Systems, ISCAS 96, vol.1, 1996. - [13] Klass, F. Semi-Dynamic and Dynamic Flip-Flops with embedded logic in Digest of Technical Papers, 1998 Symposium on VLSI Circuits, Honolulu, HI, USA, 13-15 June 1998. - [14] Draper, D., et al., Circuit techniques in a 266-MHz MMX-enabled processor in IEEE Journal of Solid-State Circuits, vol. 32, no. 11, 1650-64., Nov. 1997. - [15] Gieseke, B.A., et al. A 600 MHz superscalar RISC microprocessor with out-of-order execution in ISSCC Digest of Technical Papers, 176-7, 451, Feb. 1997. - [16] Blair, G.M. Comments on New single-clock CMOS latches and flip-flops with improved speed and power savings in IEEE Journal of Solid-State Circuits, vol. 32, no. 10, pp. 1610-11., Oct.1997.