SIGDA, Super Compendium, ISLPED 1998, Abstracts

ISLPED'98 Abstracts

Sessions: [Keynote Session] [M1] [M2] [M3] [M4] [P1] [P2] [Panel] [T0] [T1] [T2] [T3] [T4] [T5] [T6] [W0] [W1] [W2]

Keynote Session

Chair: Anantha Chandrakasan

M0.1 High Performance DSPs - What's Hot and What's Not? [p. 1]

Bryan Ackland, Chris Nicol

This paper compares low power techniques current in the research literature with those used in commercial DSP design and explores why some techniques have not yet had commercial impact. It also examines the low power needs of future DSP applications.
Keywords: DSP, low power, architecture, circuit design.

M0.2 Low Power and Low Voltage CMOS Digital Circuits Techniques [p. 7]

Christer Svensson, Atila Alvandpour

One of many important factors affecting power consumption is the choice of circuit technique for logic, latches and flip-flops. We analyze the power consumption at circuit level and use the results to guide the choice of circuit technique. Several types of latches and flip-flops are compared regarding power consumption and speed. Comparing logic clearly indicates that simple static logic in general have the lowest power consumption. Another very important factor affecting power consumption is the supply voltage. We discuss the effect of low supply voltage on the choice of circuit technique.
Keywords: Low Power, Low voltage, CMOS, Digital circuits.

Session M1: RF Building Blocks

Session Chair: Lou Williams
Associate Chair: T. R. Viswanathan

M1.1 CMOS Front End Components for Micropower RF Wireless Systems [p. 11]

Tsung-Hsien Lin, Henry Sanchez, Razieh Rofougaran, William J. Kaiser

New applications have recently appeared for a low power, low cost, "embedded radio". These wireless interfaces for handheld mobile nodes and Wireless Integrated Network Sensors (WINS) must provide spread spectrum signaling for multi-user operation at 902-928 MHz. Cost considerations motivate the development of complete micropower CMOS RF systems operating at previously unexplored low power levels. Micropower CMOS VCO and mixer circuits, developed for these emerging narrow-band communication systems, are reported here. Design methods combining high-Q inductors and weak inversion MOSFET operation enable the lowest reported operating power for RF front end components including a voltage-controlled oscillator (VCO) and mixer operating at frequencies of 400 MHz - 1 GHz. In addition, the VCO, by virtue of its high-Q inductive components, displays the lowest reported phase noise for 1 GHz CMOS VCO systems for any power dissipation.

M1.2 A 1.4-GHz 3-mW CMOS LC Low Phase Noise VCO Using Tapped Bond Wire Inductance [p. 16]

Tamara I. Ahrens, Thomas H. Lee

A 1.4-GHz LC voltage-controlled oscillator has been implemented in a MOSIS 0.5-um CMOS process. Complementary cross-coupled PMOS and NMOS transistors enhance single-ended symmetry at each of the resonant nodes, reducing close-in phase noise. Tapped bond wires provide a resonant tank with high Q. At an offset frequency of 100 kHz, the measured phase noise is -107 dBc/Hz with 3mW power dissipation from a 3.0 V supply. NMOS gate capacitors achieve a 17% tuning range.

M1.3 A 3.8-mW 2.5-GHz Dual-Modulus Prescaler in a 0.8 um Silicon Bipolar Production Technology [p. 20]

Herbert Knapp, Wilhelm Wilhelm, Mira Rest, Hans-Peter Trost

This paper presents a dual-modulus ÷128/÷129 prescaler operating up to 2.5 GHz. It consumes only 3.8 mW from a 2.3 V supply when driving an 8 pF capacitive load. The circuit is operational with supplies ranging from 2 V to over 7 V. With a 2 V supply it consumes only 1.38 mA while still operating up to 2 GHz. The circuit is manufactured in a standard silicon bipolar production process (Siemens B6HF). This 25 GHz-f_T double-polysilicon technology uses 0.8 mm lithography and LOCOS isolation. The chip is mounted in a 6-pin SOT363 SMD package.

Session M2: RT-Level Power Modeling and Analysis

Session Chair: Jose Monteiro
Associate Chair: Luca Benini

M2.1 Towards the Capability of Providing Power-Area-Delay Trade-off at the Register Transfer Level [p. 24]

Chun-hong Chen, Chi-ying Tsui

This paper presents a new register-transfer level (RT-level) power estimation technique based on technology decomposition. Given the Boolean description of a circuit function, the power consumption of two typical circuit implementations, namely the minimum area implementation and the minimum delay implementation, are estimated, respectively. This provides a capability of obtaining a full power-delay-area trade-off curve at the RT level. Our method makes it possible to capture the structural and/or functional information of a circuit without going through actual gate-level implementation. Experimental results show that the accuracy is very reasonable.
Keywords: RT-level, power estimation, entropy, technology decomposition

M2.2 Stream Synthesis for Efficient Power Simulation Based on Spectral Transforms [p. 30]

Alberto Macii, Enrico Macii, Massimo Poncino, Riccardo Scarsi

One way of minimizing the time required to perform simulation-based power estimation is that of reducing the length of the input trace to be fed to the simulator. Obviously, the use of a reduced stream may introduce some errors in the estimation results. The generation (or synthesis) of the short input sequence to be used for power simulation must then be done in such a way that the resulting error is minimized. This paper introduces a new stream synthesis method whose peculiar feature is that of using spectral analysis techniques based on the discrete Fourier transform to determine a reduced sequence of vectors that enables to shorten the overall power simulation time at a very limited accuracy decrease. The effectiveness of the proposed synthesis procedure is demonstrated by the results we have obtained on the Iscas'85 combinational benchmarks for a variety of input streams characterized by different statistical and correlation properties.

M2.3 Theoretical Bounds for Switching Activity Analysis in Finite-State Machines [p. 36]

Diana Marculescu, Radu Marculescu, Massoud Pedram

The objective of this paper is to provide lower and upper bounds for the switching activity on the state lines in Finite State Machines (FSMs). Using a Markov chain model for the behavior of the states of the FSM, we derive theoretical bounds for the average Hamming distance on the state lines which are valid irrespective of the state encoding used in the final implementation. Such lower and upper bounds, in addition to providing a target for any state assignment algorithm, can also be as parameters in a high-level model of power, and thus provide an early indication about the performance limits of the target FSM. Experimental results obtained for the mcnc'91 benchmark suite show that our bounds are tighter than the bounds reported previously by other researchers and can be effectively used in a high-level power estimation framework.
Keywords: Lower/upper bounds, Hamming distance, Markov chains, switching activity, power estimation.

Session M3: Enabling Device Technology for Low-Power Applications

Session Chair: Bill Kaiser
Associate Chair: Jim Burr

M3.1 Low Power Salient Integration Mode Image Sensor with a Low Voltage Mixed-Signal Readout Architecture [p. 42]

Eric Y. Chou, A. J. Budrys, Kit M. Cham

CMOS image sensors are very suitable for battery-operated camera systems due to their low power nature. In this research work, a salient integration mode CMOS image sensor pixel design which requires only 1 or 2 transistors per pixel and a low power readout architecture was developed in a 0.35 um CMOS technology. High fill factor and small pixel size are achieved at the same time for the 2T pixel design. The readout architecture includes a low voltage low power multi-stage analog data buffer which works as a differential to single-ended conversion mechanism for a new correlated double sampling method. Total data bandwidth and switching power are also greatly reduced. The architecture was developed to be scalable to 0.18 um technology with 1.2 volt supply voltage, and lower. An experimental chip in an array size of 256 x 256 with a pixel size of 6.3 um x 6.3 um was fabricated at a HP's 0.35 um CMOS technology. Promising experimental results strongly indicates that the new pixel design and readout architecture are suitable for low voltage CMOS camera chips in future generations of CMOS technology.
Keywords: Salient integration mode image sensor, active pixel sensor, CMOS imaging, low power mixed-signal design, deep submicron technology

M3.2 A Delay Distribution Squeezing Scheme with Speed-Adaptive Threshold-Voltage CMOS (SA-Vt CMOS) for Low Voltage LSIs [p. 48]

Masayuki Miyazaki, Hiroyuki Mizuno, Koichiro Ishibashi

In a speed-adaptive threshold-voltage CMOS (SA-Vt CMOS) circuit, the substrate bias is controlled so that delay in the circuit stays constant. Distributions of device speeds are squeezed under fast-operation conditions. With a ring oscillator using 0.25-mm CMOS devices as a test circuit, we found that the worst-case operating frequency was improved from 20 MHz to 55 MHz, and the fluctuation of the operating frequency was suppressed from 44 % to 15 % while the supply-voltage variation was under 0.1 V with a 1.8 V supply voltage.

M3.3 3D CMOS SOI for High Performance Computing [p. 54]

S. J. Abou-Samra, P. A. Aisa, A. Guyot, B. Courtois

This paper addresses three topics : First, a new three-dimensional CMOS-SOI on SOI technology is presented, then design methodologies are proposed for this technology and last, a comparison is carried out between 2D and 3D designs. In this technology the P-channel devices are stacked over the N-channel ones. All gates are 100nm length. New design constraints are introduced. Consequently, new design methodologies have to be developed in order to fully take advantage of the outstanding features of 3D integration like for example the reduced length of interconnections. A 16x16 bit multiplier was designed in this technology. Comparative results between 2D and 3D integration are given here in terms of energy consumption, delay and area

M3.4 A High Speed and Low Power SOI Inverter using Active Body-Bias [p. 59]

Joonho Gil, Minkyu Je, Jongho Lee, Hyungcheol Shin

We propose a new high speed and low power SOI inverter that can operate with efficient body-bias control and free supply voltage. The performance of the proposed circuit is evaluated by both the BSIM3SOI circuit simulator and the ATLAS device simulator, and then compared with other reported SOI circuits. The proposed circuit is shown to have excellent characteristics. At the supply voltage of 1.5V, the proposed circuit operates 27% faster than the conventional SOI circuit with the same power dissipation.
Keywords: SOI inverter, low power, dynamic threshold, body-bias

Session M4: Low-Power Architectural Techniques for General Purpose Systems

Session Chair: Vivek Tiwari
Associate Chair: Chris Nicol

M4.1 Power and Performance Tradeoffs using Various Caching Strategies [p. 64]

R. Iris Bahar, Gianluca Albera, Srilatha Manne

In this paper, we propose several different data and instruction cache configurations and analyze their power as well as performance implications on the processor. Unlike most existing work in low power microprocessor design, we explore a high performance processor with the latest innovations for performance. Using a detailed, architectural-level simulator, we evaluate full system performance using several different power/performance sensitive cache configurations such as increasing cache size or associativity and including buffers along side L1 caches. We then use the information obtained from the simulator to calculate the energy consumption of the memory hierarchy of the system. As an alternative to simply increasing cache associativity or size to reduce lower-level memory energy consumption (which may have a detrimental effect on on-chip energy consumption), we show that, by using buffers, energy consumption of the memory subsystem may be reduced by as much as 13% for certain data cache configurations and by as much as 23% for certain instruction cache configurations without adversely effecting processor performance or on-chip energy consumption.

M4.2 Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High Performance Microprocessors [p. 70]

Nikolaos Bellas, Ibrahim Hajj, Constantine Polychronopoulos, George Stamoulis

In this paper we propose a technique that uses an additional mini cache located between the I-Cache and the CPU core, and buffers instructions that are nested within loops and are continuously otherwise fetched from the I-Cache. This mechanism is combined with code modifications, through the compiler, that greatly simplify that required hardware, eliminate unnecessary instruction fetching, and consequently reduce signal switching activity and the dissipated energy. We show that the additional cache, dubbed L-Cache, is much smaller and simpler than the I-Cache when the compiler assumes the role of allocating instructions in it. Through simulation, we show that, for the SPECfp95 benchmarks, the I-Cache remains disabled most of the time, and the "cheaper" extra cache is used instead. We present experimental results that validate the effectiveness of this technique, and present the energy gains for most of the SPEC95 benchmarks.

M4.3 The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms [p. 76]

Trevor Pering, Tom Burd, Robert Brodersen

The reduction of energy consumption in microprocessors can be accomplished without impacting the peak performance through the use of dynamic voltage scaling (DVS). This approach varies the processor voltage under software control to meet dynamically varying performance requirements. This paper presents a foundation for the simulation and analysis of DVS algorithms. These algorithms are applied to a benchmark suite specifically targeted for PDA devices.

M4.4 Optimizing the DRAM Refresh Count for Merged DRAM/Logic LSIs [p. 82]

Taku Ohsawa, Koji Kai, Kazuaki Murakami

In merged DRAM/logic LSIs, the DRAM portion could suffer from shorter data retention time because of heat and noise caused by the logic portion. Frequent refreshes increase power consumption. Also, they disturb normal DRAM accesses leading to performance degradation. In order to overcome this problem, we propose several DRAM refresh architectures. We have estimated the DRAM refresh count in executing benchmark programs under several architecture models. As a result, in the most effective combination of the architectures, we have obtained more than 80% reduction against a conventional DRAM refresh architecture for most benchmark programs. In addition to it, even when we have taken normal DRAM access into account, we have obtained more than 50% reduction for several benchmarks.

Session P1: Circuits and Technology

Session Chair: Sayfe Kiaei

P1.1 Integrated DC/DC Converter with Digital Controller [p. 88]

Ferdinand Sluijs, Kees Hart, Wouter Groeneveld, Stephan Haag

A DC/DC converter with integrated digital controller and switches is realized. This DC/DC converter only needs an external coil, diode and capacitor. The main advantages of this type of digital DC/DC converter are the fast response on load variation and the high efficiency over a wide power range. The DC/DC converter uses low resistance CMOS switches and operates in multi mode. This controller uses a small output voltage window as reference for control actions.

P1.2 CMOS VCOs for Frequency Synthesis in Wireless Biotelemetry [p. 91]

Rafael J. Betancourt-Zamora, Thomas H. Lee

A new phase noise model was used to optimize a differential ring VCO for minimum power consumption. We compare the phase noise performance of three buffer stages using clamped, symmetric and cross-coupled loads, respectively. We propose a cross-coupled buffer topology that achieves lower phase noise by exploiting symmetry. Measured phase noise for a 1.2mW, 150MHz VCO fabricated in 0.5mm CMOS is -103.9dBc/Hz at 500KHz offset, showing good agreement with the theory.
Keywords : Cmos, frequency synthesis, phase noise, ring oscillator, vco

P1.3 The Impact of Data Characteristics and Hardware Topology on Hardware Selection for Low Power DSP [p. 94]

Gareth Keane, Jonathan Spanier, Roger Woods

Adders and multipliers are key operations in DSP systems. The power consumption of adders is well understood but there are few detailed results on the choice of multipliers available. This paper considers how the power consumption of a number of multiplier structures such as Carry-Save array and Wallace Tree multipliers varies with data wordlengths and different layout strategies. In all cases, results were obtained from EPIC PowerMill ^TM simulations of actual synthesised circuit layouts. Analysis of the results highlights the effects of routing and interconnect optimization for low power operation and gives clear indications on choice of multiplier structure and design flow for the rapid design of DSP systems.
Keywords: Low power DSP systems, optimum hardware selection, multiplier structures.

P1.4 Low Threshold CMOS Circuits with Low Standby Current [p. 97]

Mircea R. Stan

Multi-Voltage CMOS (MVCMOS) is a design methodology for very low power supply voltages that uses low-threshold transistors in series with the supply rails. The control voltages on the gating transistors need to be outside of the Vdd - Vss range (hence the name MVCMOS) in order to reduce the standby current, but the resulting circuits operate at lower supply voltages and have a lower area overhead than the previously proposed Multi-Threshold CMOS (MTCMOS).

P1.5 Minimum Supply Voltage for Bulk Si CMOS GSI [p. 100]

Azeez J. Bhavnagarwala, Blanca Austin, James D. Meindl

Limits on energy dissipation are investigated for bulk Si CMOS circuits at each node of the 1997 National Technology Roadmap for Semiconductors (NTRS). Physical, continuous and smooth MOSFET Transregional drain current models that consider high-field effects in scaled devices, and permit trade-offs between saturation drive current and subthreshold leakage current are described and employed to model CMOS circuit performance and power dissipation at low voltages. The Transregional models are used in conjunction with physical threshold voltage roll-off models and stochastic interconnect distribution, at performance, chip sizes and transistor counts forecast by the 1997 NTRS, to project optimal supply and threshold voltages, minimizing total energy dissipated by CMOS logic circuits. Techniques exploiting datapath parallelism to further reduce supply voltage are shown to offer decreasing reductions in power dissipation with technology scaling.

P1.6 0.5 V CMOS Logic Delivering 200 Million 8x8 Bit Multiplications/s at Less Than 100 fJ Based on a 50 nm T-Gate SOI Technology [p. 103]

Volker Dudek, Reinhard Grube, Bernd H�fflinger, Michael Schau

High-performance CMOS logic at a very low voltage of 0.5 V can deliver 150 Million 8x8 multiplications/s at an energy level of only 30fJ, if 0.35 um SOI technology is enhanced with self-aligned 50 nm T-Gate transistors, if a new adder with a differential Manchester chain including special accelerators and if the DIGILOG multiplier, a leading-one-first pseudo-log multiplier with complexity order (n) are optimized simultaneously.
Keywords : Adder, Multiplier, T-Gate, low power, high-performance

P1.8 Decreasing Low-Voltage Manufacturing-Induced Delay Variations with Adaptive Mixed-Voltage-Swing Circuits [p. 106]

L. Richard Carley, Akshay Aggarwal, Ram K. Krishnamurthy

One of the major problems faced by the designer when operating CMOS static logic circuits at low power supply voltages (normalized to V_T) is that the delay spread introduced by today's IC manufacturing variations can increase dramatically. In this paper we describe an approach for decreasing the delay spread and power spread in ICs based on adaptively servoing the circuits between static CMOS operation and QuadRail operation. An on-chip series-regulator employing a dummy delay path is used to generate the adaptive low swing power supply rails making this approach fully compatible with a standard CMOS IC design methodology. Simulation results are presented demonstrating that for a 16*16+36-bit multiplier-accumulator designed in 0.5 um CMOS process the proposed approach decreases the delay spread from 3.9X to 2.3X and the power spread from 3.6X to 1.8X.
Keywords : Low power CMOS logic, mixed-swing CMOS logic, manufacturing variations, low voltage logic circuits.

P1.9 Power-Delay Tradeoffs for Radix-4 and Radix-8 Dividers [p. 109]

Alberto Nannarelli, Tomas Lang

The use of higher radices in division reduces the number of iterations to complete the operation, but increases the complexity of the circuit. In this paper we explore the influence of the radix on the power dissipation of a floating-point divider and the power-delay tradeoffs. We compare the performance and the energy consumption per operation for a radix-4 and a radix-8 divider, realized in CMOS technology. A reduction of about 40% in the energy consumption is obtained for both radices (about 70% if low-voltage gates, for dual voltage implementation, are available). Also the results show that the radix-8 divider is about 20% faster and the energy dissipated to perform a division is about the same, with respect to the radix-4.

Session P2: Systems and CAD

Session Chair: Ingrid Verbauwhede

P2.1 Automatic Characterization and Modeling of Power Consumption in Static RAM's [p. 112]

Mauro Chinosi, Roberto Zafalon, Carlo Guardiani

An automatic modeling technique is presented in this paper that allows to build an accurate model of power consumption in embedded memory blocks. A software neural-network is used to create a regression tree by automatically splitting those variables that have a discontinuous effect on the power consumption. An application of the methodology to the modeling of a 0.35 um CMOS embedded SRAM is presented.
Keywords: Power estimation, Memory modeling, Static RAMs

P2.2 Improving Sampling Efficiency for System Level Power Estimation [p. 115]

Chih-Shun Ding, Cheng-Ta Hsieh, Massoud Pedram

In this paper, we propose an efficient statistical sampling technique which is suitable for estimating the total power consumption of a large VLSI system. The basic idea is to generate simulation units for each module in the system independently and then form samples of the system power by randomly selecting simulation units for each module. Hence, sampling is performed both temporally (across different clock cycles) and spatially (across different modules). A module clustering step ensures that the module types are compatible with this sampling strategy. Experimental results show a 4x reduction in the simulation time compared to existing Monte-Carlo simulation techniques.

P2.3 Power Invariant Vector Compaction Based on Bit Clustering and Temporal Partitioning [p. 118]

Nicola Dragone, Roberto Zafalon, Carlo Guardiani, Cristina Silvano

Power dissipation is digital circuits is strongly pattern dependent. Thus, to derive accurate simulation-based power estimates, a large amount of input vectors is usually required. This paper proposes a vector compaction technique aiming at providing accurate power figures in a shorter simulation time for complex sequential circuits characterized by some hundreds of inputs. From pair-wise spatio-temporal signal correlations, the proposed approach is based on bit clustering and temporal partitioning of the input stream aiming at preserving the statistical properties of the original stream and maintaining the typical switching behavior of the circuit. The effectiveness of the proposed approach has been demonstrated over a significant set of industrial case studies implemented in CMOS submicron technology. While achieving a 10x to 50x stream size reduction, the reported results show an average and maximum errors of 2.4% and 7.1% respectively, over the simulation-based power estimates derived from the original input stream.
Keywords: Power Estimation, Vector Compaction, Markov Chains, Low Power VLSI Design

P2.4 An Empirical Comparison of Algorithmic, Instruction, and Architectural Power Prediction Models for High Performance Embedded DSP Processors [p. 121]

Catherine H. Gebotys, Robert J. Gebotys

This paper presents a comparison of statistically-derived power prediction models at the algorithmic, instruction, and architectural levels for embedded high performance DSP processors. The approach is general enough to be applied to any embedded DSP processor. Results from 168 power measurements of DSP code show that power can be predicted at instruction and architecture levels with less than 2% error. This result is important for developing a general methodology for power characterization of embedded DSP software since low power is critical to complex DSP applications in many cost sensitive markets.

P2.5 Power Calculation and Modeling in Deep Submicron [p. 124]

Jay Abraham

Over the past few years it has become increasingly apparent that modern IC design is no longer bounded by timing and area constraints. Power has become significantly more important. In an era of hand held devices ranging from mobile computing to wireless communication systems, managing and controlling power takes on an important role. Several benefits are realized with low power designs in addition to extended battery life. Low power devices often run at lower junction temperatures and this leads to high reliability and low cost cooling systems [1,2,3,6]. Calculation and modeling of power (and delay) in deep-submicron (less than 0.25 microns) designs poses several challenges. This paper discusses the use of the Delay and Power Calculation System (DPCS) as a means by which EDA (Electronic Design Automation) tools can accurately calculate and model power.

P2.6 Partial Bus-Invert Coding for Power Optimization of System Level Bus [p. 127]

Youngsoo Shin, Sook-Ik Chae, Kiyoung Choi

We present a partial bus-invert coding scheme for power optimization of system level bus. In the proposed scheme, we select a sub-group of bus lines involved in bus encoding to avoid unnecessary inversion of bus lines not in the sub-group thereby reducing the total number of bus transitions. We propose a heuristic algorithm that selects the sub-group of bus lines for bus encoding. Experiments on benchmark examples indicate that the partial bus-invert coding reduces the total bus transitions by 62.6% on the average, compared to that of the unencoded patterns.

P2.7 The Petrol Approach to High-Level Power Estimation [p. 130]

Rafael Peset Llopis, Kees Goossens

High-level power estimation is essential for designing complex low-power ICs. However, the lack of flexibility, or restriction to synthesizable code of previously presented high-level power estimation approaches limits their use. In this paper we present a novel, more general and flexible high-level power estimation approach, that avoids these limitations. Petrol, as we call it, is not limited to specialized application domains, synthesizable VHDL, or data path parts of a design. We show that glitches can be usefully modeled at higher levels of abstraction. The Petrol approach shows good correlation with gate-level power estimates. It is currently used for commercial designs.

P2.8 Power Consumption of Parallel Spread Spectrum Correlator Architectures [p. 133]

Won Namgoong, Teresa Meng

Parallel correlation in direct-sequence spread spectrum system allows faster and more reliable coarse acquisition. However, the power consumed becomes significant especially for receivers that employ a large number of parallel correlators. In this paper, the power efficiency of various parallel correlator architectures is explored assuming baseband sampled signals of two samples per chip. Active correlators placed in parallel that use both two's complement and sign-magnitude accumulators are first presented. A functionally equivalent M-parallel passive correlators are then studied. In this approach, the baseband sampled signals are passed through a tapped delay-line. Each tap is then multiplied by a stationary reference pseudonoise code and summed using a binary tree network. The passive correlators are generally more power efficient for large M values. Further reduction in power consumption is possible by splitting the tapped delay-line into even and odd delays and summing using two smaller binary tree adders. This proposed architecture consumes significantly less power compared to all other architectures. The power dissipation of M-parallel correlator architectures are evaluated for M = 8, 16, 32 using TSMC 0.35 -um CMOS technology at 3.3V supply voltage.

P2.9 A Low Power Video Processor [p. 136]

Uzi Zangi, Ran Ginosar

Multiple power saving methods were applied to a video processor for color digital video and still cameras. Architectural level methods failed to save power: asynchronous design, dynamic voltage scaling, bus switching minimization, pipeline stage merging, reduction of switching times and clock gating. However changing the algorithm to work on pixel differences yielded 3-15% power reduction in typical cases.

P2.10 Power Dissipated by CMOS Gates Driving Lossless Transmission Lines [p. 139]

Yehea I. Ismail, Eby G. Friedman, Jose L. Neves

The dynamic and short-circuit power consumption of a CMOS gate driving an LC transmission line as a limiting case of an RLC transmission line is investigated in this paper. Closed form solutions for the output voltage and short-circuit power of a CMOS gate driving an LC transmission line are presented. These solutions agree with AS/X simulations within 11% error for a wide range of transistor widths and line impedances. The ratio of the short-circuit to dynamic power is less than 7% for CMOS gates driving LC transmission lines where the line is matched or underdriven. Therefore, the total power consumption is expected to decrease as inductance effects becomes more significant as compared to an RC model of the interconnect.

Panel: Past and Future Blockbusters in Low-Power Design [p. 142]

Moderator: Jan M. Rabaey, Bryan Ackland, Bob Brodersen, Christer Svenson, Bruce Wooley

Invited Talks : Session T0

Session Chair: Farid N. Najm

T0.1 Emerging Power Management Tools for Processor Design [p. 143]

D. T. Blaauw, A. Dharchoudhury, R. Panda, S. Sirichotiyakul, C. Oh, T. Edwards

Power management is an increasing concern for processor design. In this paper, we presented an overview of traditional power simulation tools and discussed two emerging power management design technologies: power distribution integrity analysis and standby current measurement and optimization. We present methods for accurate peak current simulation, which is needed for power grid integrity analysis, and discuss the generation and compression of the simulation vectors. Also, static approaches for calculating an upper-bound on the maximum peak current are presented. Standby leakage current is state dependent and we present methods for calculating both the average and maximum leakage current. Finally, optimization methods for minimizing the leakage current by either assigning a standby state to the circuit or by using a dual-Vt process are discussed.

T0.2 Recent Developments in High Integration Multi-Standard CMOS Transceivers for Personal Communication Systems [p. 149]

Jacque C. Rudell, Jia-Jiunn Ou, Sekhar Narayanaswami, George Chien, Jeffrey A. Weldon, Li Lin, King-Chun Tsai, Luns Tee, Kelvin Khoo, Danelle Au, Troy Robinson, Danilo Gerna, Masanori Otsuka, Paul Gray

Issues associated with the integration of transceiver components on to a single silicon substrate are discussed. In particular, recently proposed receiver and transmitter architectures for high integration are examined on the promise of providing multi-standard capability. In addition, existing barriers to lower power transceiver operation are examined as well as some proposed directions for future integrated transceiver research and development.

Session T1: Low-Power Logic Circuits

Session Chair: Brock Barton
Associate Chair: Rick Carley

T1.1 Low-Energy Embedded FPGA Structures [p. 155]

Eric Kusse, Jan M. Rabaey

This paper introduces an energy-efficient FPGA module, intended for embedded implementations. The main features of the proposed cell include a rich local-interconnect network, which drastically reduces the energy dissipated in the wiring, and a dual-voltage scheme that allows pass-transistor networks to operate at low-voltages yet maintain decent performance. Simulations on a benchmark set demonstrate that the proposed module succeeds in its goal of reducing energy consumption by an order of magnitude over existing implementations.
Keywords: FPGAs, Low Energy, Dual Voltage, Pass-transistors, Power, Embedded, Low Swing, Interconnect Network.

T1.2 Low Swing Interconnect Interface Circuits [p. 161]

Hui Zhang, Jan Rabaey

This paper reviews a number of low-swing on-chip interconnect schemes, and presents a thorough analysis of their effectiveness and limitations. In addition, several new interface circuits, presenting even more energy savings, are proposed. Some of these circuits not only reduce the interconnect swing, but also use very low supply voltages, so as to obtain quadratic energy savings. The performances of each of the presented circuits are thoroughly examined using simulation on a benchmark interconnect circuit. Energy savings with a factor of seven have been observed for some of the schemes.

T1.3 True Single-Phase Energy-Recovering Logic for Low-Power, High-Speed VLSI [p. 167]

Suhwan Kim, Marios C. Papaefthymiou

In dynamic logic families that rely on energy recovery to achieve low energy dissipation, the flow of data through cascaded gates is controlled using multi-phase clocks. Consequently, these families require multiple clock generators and can exhibit increased energy consumption on their clock distribution networks. Moreover, they are not attractive for high-speed design due to clock skew management problems. In this paper, we present TSEL, the first energy-recovering logic family that operates with a single-phase clocking scheme. TSEL outperforms previous energy-recovering logic families in terms of energy efficiency and operating speed. In HSPICE simulations with a standard 0.5 um technology from MOSIS, pipelined carry-lookahead adders in TSEL function correctly for operating frequencies exceeding 280MHz. For operating frequencies above 80 MHz, they dissipate considerably less energy per operation than alternative implementations of the same adder architecture in other energy-recovering logic families. In comparison with their CMOS counterparts, the TSEL adders dissipate about half as much energy at 280MHz. Our results indicate that TSEL is an excellent candidate for high speed and low power VLSI system design.

Session T2: System Level Power Issues

Session Chair: Renu Mehra
Associate Chair: Maurizio Damiani

T2.1 System-Level Power Estimation and Optimization [p. 173]

Luca Benini, Robin Hodgson, Polly Siegel

Most work to date on power reduction has focused at the component level, not at the system level. In this paper, we propose a framework for describing the power behavior of system-level designs. The model consists of a set of resources, an environmental workload specification, and a power management policy, which serves as the heart of the system model. We map this model to a simulation-based framework to obtain an estimate of the system's power dissipation. Accompanying this, we propose an algorithm to optimize power management policies. The optimization algorithm can be used in a tight loop with the estimation engine to derive new power-management policy algorithms for a given system-level description. We tested our approach by applying it to a real-life low-power portable design, achieving a power estimation accuracy of ~10%, and a 23% reduction in power after policy optimization.

T2.2 Memory Modeling for System Synthesis [p. 179]

Sari L. Coumeri, Donald E. Thomas

We present our methodology for developing models of on-chip SRAM memory organizations. The models were created to enable the quick evaluation of energy, area, and performance of different memory configurations considered during synthesis. The models are defined in terms of parameters, such as size and mode of operation, which are known at synthesis time. Our methodology does not require knowledge of the underlying memory circuitry and provides models with average percentage errors within 8%. We found that only 10 different memories from a large span of possible memory sizes are needed to obtain reasonably accurate models, with average errors within 15%. We further use these models to evaluate different low power memory organizations and have seen energy reductions of up to 88%. In this paper we present our modeling methodology, discuss the important aspects in developing the models, and show results of using the models in evaluating low power memory organizations.

T2.3 Monitoring System Activity for OS-Directed Dynamic Power Management [p. 185]

Luca Benini, Alessandro Bogliolo, Stefano Cavallucci, Bruno Ricc�

In this paper we describe a workload monitoring system that has been specifically designed for supporting dynamic power management in personal computers with tight power constraints (such as laptop or notebook computers). Our monitoring system is minimally intrusive, and has negligible impact on system activity. Moreover, it can be used both for on-line system monitoring and off-line data collection. We used our monitoring tool to collect data on the usage of system resources (disks, CPU, keyboard and mouse) for a laptop computer, under several workload conditions. Our analysis shows that resource usage is strongly resource and workload dependent, and that on-line usage monitoring capability is a critical issue of the implementation of effective power management policies.

Session T3: Variable Voltage and Analog Techniques

Session Chair: Christian Enz
Associate Chair: Venu Gopinathan

T3.1 A Reconfigurable Dual Output Low Power Digital PWM Power Converter [p. 191]

Abram Dancy, Anantha Chandrakasan

This versatile power converter controller provides dual outputs at a fixed switching frequency and can regulate either output voltage or target system delay (using an external L-C filter). In the voltage regulation mode, the output voltage is monitored with an A/D converter, and the feedback compensation network is implemented digitally. The generation of the PWM signal is done with a hybrid delay line/counter approach, which saves power and area relative to previous implementations. Power devices are included on chip to create the two independently regulated output PWM signals. The key features of this design are its low power dissipation, reconfigurability, use of either delay or voltage feedback, and multiple outputs.

T3.2 Voltage Scheduling Problem for Dynamically Variable Voltage Processors [p. 197]

Tohru Ishihara, Hiroto Yasuura

This paper presents a model of dynamically variable voltage processor and basic theorems for power-delay optimization. A static voltage scheduling problem is also proposed and formulated as an integer linear programming (ILP) problem. In the problem, we assume that a core processor can vary its supply voltage dynamically, but can use only a single voltage level at a time. For a given application program and a dynamically variable voltage processor, a voltage scheduling which minimizes energy consumption under an execution time constraint can be found.

T3.3 On the Optimum Design of Regulated Cascode Operational Transconductance Amplifiers [p. 203]

Thomas Burger, Qiuting Huang

An optimal design procedure to achieve minimum power consumption for a given technology and gain bandwidth is presented. Regulated cascode gain enhancement is used to ensure sufficient DC-gain at minimum gate length transistors. To validate the approach five folded cascode OTA's have been implemented, spanning a bias range of 1uA -10mA, with measured unity-gain bandwidths within 20% of the designed value. For 17 mW at 3 V, a 0.5 um CMOS OTA achieves 630 MHz with 51 degree phase margin. The method has been applied in the design of a 3rd order Change(Summation) modulator for GSM receivers. The modulator consumes 2.8 mW at 3 V and has a dynamic range of 86 dB for a 100 kHz input signal bandwidth.

Session T4: Logic Synthesis for Low Power

Session Chair: George Stamoulis
Associate Chair: Sarma Vrudhula

T4.1 Low Power Logic Synthesis under a General Delay Model [p. 209]

Unni Narayanan, Peichen Pan, C. L. Liu

Till now most efforts in low power logic synthesis have concentrated on minimizing the total switching activity of a circuit under a zero delay model. This simplification ignores the effects of glitch transitions which may contribute as much as 30% of the total power consumption of a circuit. Hence, low power logic synthesis techniques which optimize power under a zero delay model are often not successful in attaining "real" power savings as measured under a more accurate general delay model. In practice, to accurately estimate the switching activity in a circuit under a general delay model can be computationally expensive. Hence, to repeatedly call accurate but slow power estimation tools to direct the synthesis flow is not a viable approach in the design of low power synthesis tools. In this paper we take advantage of a fast method for estimating the total switching activity in a circuit under a general delay model to synthesize low power circuits. Specifically, we use the approximation as a basis for algorithms that solve two problems: (1) low power technology decomposition of gates under a general delay model (2) low power retiming of sequential circuits under a general delay model.

T4.2 Local Transformation Techniques for Multi-Level Logic Circuits Utilizing Circuit Symmetries for Power Reduction [p. 215]

Ki-Seok Chung, C. L. Liu

In this paper, we present several optimization techniques for power reduction utilizing circuit symmetries. There are four kinds of symmetries that we detect in a given circuit implementation. First, we propose an algorithm for detecting the four different types of symmetries in a given circuit implementation of a Boolean function. Several re-synthesis techniques utilizing such symmetries are proposed. These techniques enable us to optimize power consumption and delay with no (or very little) area overhead. We have carried out experiments on MCNC benchmark circuits to demonstrate the efficiency of the proposed techniques. The average power reduction is 14% with little or none area and/or delay overhead.

T4.3 A Power Optimization Method Considering Glitch Reduction by Gate Sizing [p. 221]

Masanori Hashimoto, Hidetoshi Onodera, Keikichi Tamaru

We propose a power optimization method considering glitch reduction by gate sizing. Our method reduces not only the amount of capacitive and short-circuit power consumption but also the power dissipated by glitches which has not been exploited previously. In the optimization method, we improve the accuracy of statistical glitch estimation method and device a gate sizing algorithm that utilizes perturbations for escaping a bad local solution. The effect of our method is verified experimentally using 12 benchmark circuits with a 0.5 um standard cell library. Gate sizing reduces the number of glitch transitions by 38.2 % on average and by 63.4 % maximum. This results in the reduction of total transitions by 12.8 % on average. When the circuits are optimized for power without delay constraints, the power dissipation is reduced by 7.4 % on average and by 15.7 % maximum further from the minimum-sized circuits.

Session T5: Circuit-Level Power Analysis and Estimation

Session Chair: Suresh Rajgopal
Associate Chair: Chi-Ying Tsui

T5.1 A Unified Approach in the Analysis of Latches and Flip-Flops for Low-Power Systems [p. 227]

Vladimir Stojanovic, Vojin Oklobdzija, Raminder Bajwa

In this paper we proposed a set of rules for consistent estimation of the real performance and power features of the latch and flip-flop structures. A new simulation and optimization approach is presented, targeting both high-performance and power budget issues. The analysis approach reveals the sources of performance and power consumption bottlenecks in different design styles. Certain misleading parameters have been properly modified and weighted to reflect the real properties of the compared structures. Furthermore, the results of the comparison of representative latches and flip-flops illustrate the advantages of our approach and the suitability of different design styles for low-power and high-performance applications.

T5.2 Estimation of Maximum Power Supply Noise for Deep Sub-Micron Designs [p. 233]

Yi-Min Jiang, Kwang-Ting Cheng, An-Chang Deng

We propose a new technique for generating a small set of patterns to estimate the maximum power supply noise of deep sub-micron designs. We first build the charge/discharge current and output voltage waveform libraries for each cell, taking power and ground pin characteristics, the power net RC and other input characteristics as parameters. Based on the cells' current and voltage libraries, the power supply noise of a 2-vector sequence can be estimated efficiently by a cell-level waveform simulator. We then apply the Genetic Algorithm based on the efficient waveform simulator to generate a small set of patterns producing high power supply noise. Finally, the results are validated by simulating the obtained patterns using a transistor level simulator. Our experimental results show that the patterns generated by our approach produce a tight lower bound on the maximum power supply noise.

T5.3 Estimation of Standby Leakage Power in CMOS Circuits Considering Accurate Modeling of Transistor Stacks [p. 239]

Zhanping Chen, Mark Johnson, Liqiong Wei, Kaushik Roy

Low supply voltage requires the device threshold to be reduced in order to maintain performance. Due to the exponential relationship between leakage current and threshold voltage in the weak inversion region, leakage power can no longer be ignored. In this paper we present a technique to accurately estimate leakage power by accurately modeling the leakage current in transistor stacks. The standby leakage current model has been verified by HSPICE. We demonstrate that the dependence of leakage power on primary input combinations can be accounted for by this model. Based on our analysis we can determine good bounds for leakage power in the standby mode. As a by-product of this analysis , we can also determine the set of input vectors which can put the circuits in the low-power standby mode. Results on a large number of benchmarks indicate that proper input selection can reduce the standby leakage power by more than 50% for some circuits.

T5.4 Separation and Extraction of Short-Circuit Power Consumption in Digital CMOS VLSI Circuits [p. 245]

Atila Alvandpour, Per Larsson-Edefors, Christer Svensson

In this paper, we present a new technique which indirectly separates and extracts the total short-circuit power consumption of digital CMOS circuits. We avoid a direct encounter with the complex behavior of the short-circuit currents. Instead, we separate the dynamic power consumption from the total power and extract the total short-circuit power. The technique is based on two facts: first, the short-circuit power consumption disappears at a V_dd close to V_T and, secondly, the total capacitance depends on supply voltage in a sufficiently weak way in standard CMOS circuits. Hence, the total effective capacitance can be estimated at a low V_dd . To avoid reducing V_dd below the specified forbidden level, a polynomial is used to estimate the power versus supply voltage down to V _T based on a small voltage sweep over the allowed supply voltage levels. The result shows good accuracy for the short-circuit current ranges of interest.
Keywords : Short-circuit current, Power consumption, Power estimation.

Session T6: Low-Power Design for Application Specific Processors

Session Chair: Naresh Shanbhag
Associate Chair: Mary Jane Irwin

T6.1 Decorrelating (DECOR) Transformations for Low-Power Adaptive Filters [p. 250]

Sumant Ramprasad, Naresh R. Shanbhag, Ibrahim N. Hajj

Presented in this paper are decorrelating transformations (referred to as DECOR transformations) to reduce the power dissipation in adaptive filters. The coefficients generated by the weight update block in an adaptive filter are passed through a decorrelating block such that fewer bits are required to represent the coefficients. Thus, the size of the arithmetic units in the filter (F-block) is reduced thereby reducing the power dissipation. The DECOR transform is well suited for narrow-band filters because there is significant correlation between adjacent coefficients. In addition, the effectiveness of DECOR transforms increases with increase in the order of the filter and decrease in coefficient precision. Simulation results indicate reduction in power dissipation in the F-block ranging from 12% to 38% for filter bandwidths ranging from 0:15fs to 0:025fs (where fs is the sample rate).

T6.2 The Logarithmic Number System for Strength Reduction in Adaptive Filtering [p. 256]

John R. Sacha, Mary Jane Irwin

An important technique for reducing power consumption in VLSI systems is strength reduction, the substitution of a less-costly operation such as a shift, for a more-costly operation such a multiplication. Using a logarithmic number representation provides several opportunities for strength reductions; in particular, multiplication is performed as the fixed-point addition of logarithms, and extracting a square root is implemented via a shift. These reductions occur transparently at the hardware level; consequently relatively little algorithmic modification is required, and they are readily applicable to adaptive filtering. For performing Givens rotations in the QR decomposition recursive least squares adaptive filter, logarithmic arithmetic is shown to compare favorably to other strength reduction techniques, such as CORDIC arithmetic, in terms of switched capacitance and numerical accuracy.

T6.3 Low-Power Architecture of the Soft-Output Viterbi Algorithm [p. 262]

David Garrett, Mircea Stan

This paper investigates the low power implementation issues of the soft-output Viterbi algorithm (SOVA), a building block for turbo codes. By briefly explaining the theory of turbo codes, and by reviewing several of the decoding algorithms, we develop the computational requirements for a SOVA implementation, and ultimately develop an architecture that completes those computations with reduced power consumption. The architecture builds on previous work on the Viterbi and Soft-Output Viterbi algorithms, and incorporates a novel orthogonal access memory structure, which provides parallel access across sequentially received data.
Keywords : SOVA, turbo codes, VA, low power.

T6.4 Low Power Methodology and Design Techniques for Processor Design [p. 268]

J. Patrick Brennan, Alvar Dean, Stephen Kenyon, Sebastian Ventrone

IBM's ASIC design methodologies is used to develop a low power microprocessor for the mobile (battery powered) marketplace. The design called for a reduction of active power by a factor of 10 times from an estimate of a product designed in a standard 3 volt ASIC design system. An overview of the design methodology and some of the innovative power reduction techniques are presented.

Invited Talks : Session W0

Session Chair: Chuck Traylor

W0.1 Power Distribution in High-Performance Design [p. 274]

Michael Benoit, Sandy Taylor, David Overhauser, Steffen Rochel

Power distribution design in high-performance chips is a task that is not eased through the application of power reduction techniques. Although the average power of a high-performance design can be reduced, the peak to average power current ratio of blocks increases as a result, aggravating the challenges faced prior to average power reduction. This paper discusses the power distribution design challenge : to reliably deliver a predictable voltage to all transistors under all operating conditions. Steps in power estimation, approaches to power distribution implementation, and verification of power distribution are reviewed. The myths versus reality of power distribution design in high-performance chips are provided.
Keywords: Power grid, Power distribution, IR drop

W0.2 Low-Power Miniaturized Information Display Systems [p. 279]

Michael Bolotski, Philip Alvelda

This paper discusses low power issues in the design of miniature information display devices built on silicon substrates.
Keywords: LCOS, microdisplay, power, field-sequential color

Session W1: Low-Power Memory

Session Chair: Bill Athas
Associate Chair: Dan Dobberpuhl

W1.1 Low-Power Embedded SRAM Macros with Current-Mode Read/Write Operations [p. 282]

Jinn-Shyan Wang, Po-Hui Yang, Wayne Tseng

The newly proposed SRAM performs both read and write operations in the current-mode. Due to the current-mode operations, voltage swings at bit-lines and data-lines are kept very small during read and write. The AC power dissipation of bit-lines and data-lines can thus be saved efficiently. For an embedded SRAM macro used in an 8-bit m-controller, the SRAM using the fully current-mode technique consumes only 30% power dissipation as compared to the SRAM with only current-mode read operation. Experimental results show good agreement with the simulation results and prove the feasibility of the new technique.

W1.2 A Three-Port Adiabatic Register File Suitable for Embedded Applications [p. 288]

Stephen Avery, Marwan Jabri

Adiabatic logic promises extremely low power consumption for those applications where slower clock rates are acceptable . However, there have been very few adiabatic memory designs, and any circuit of even moderate complexity requires some form of ram. This paper presents a register file implemented entirely with adiabatic logic, and fabricated using a 1.2 um cmos technology. Comparison with a conventional cmos logic implementation, using both measured and simulated results, indicates significant power savings have been realised.

W1.3 A Low Power SRAM using Auto-Backgate-Controlled MT-CMOS [p. 293]

Koji Nii, Hiroshi Makino, Yoshiki Tujihashi, Chikayoshi Morishima, Yasushi Hayakawa, Hiroyuki Nunogami, Takahiko Arakawa, Hisanori Hamano

We have been proposed a low power SRAM using an effective method called "ABC-MT-CMOS" [1]. It controls the backgates to reduce the leakage current when the SRAM is not activated (sleep mode) while retaining the data stored in the memory cells. We also adopted a "CSB Scheme" which clamps both the source lines of the memory cell array and the bit lines. We designed and fabricated test chips containing a 32K-bit gate array SRAM. The experimental results show that the leakage current is reduced to 1/1000 in sleep mode. The active power is 0.27 mW/MHz at 1 V, which is a reduction of 1/12 of a conventional SRAM with a 3.3V.

Session W2: High-Level Power Analysis and Optimization

Session Chair: Anand Raghunathan
Associate Chair: Joerg Henkel

W2.1 Fast High-Level Power Estimation for Control-Flow Intensive Designs [p. 299]

Kamal S. Khouri, Ganesh Lakshminarayana, Niraj K. Jha

In this paper, we present a power estimation technique for control-flow intensive designs that is tailored towards driving iterative high-level synthesis systems, where hundreds of architectural trade-offs are explored and compared. Our method is fast and relatively accurate. The algorithm utilizes the behavioral information to extract branch probabilities, and uses these in conjunction with switching activity and circuit capacitance information, to estimate the power consumption of a given architecture. We test our algorithm using a series of experiments, each geared towards measuring a different indicator. The first set of experiments measures the algorithm's accuracy when compared to the actual circuit power. The second set of experiments measures the average tracking index, and tracking index fidelity for a series of architectures. This index measures how well the algorithm makes decisions when comparing the relative power consumption of two architectures contending as low-power candidates. Results indicate that our algorithm achieved an average estimation error of 11.8% and an average tracking index of 0.95 over all examples.

W2.2 The Energy Complexity of Register Files [p. 305]

Victor Zyuban, P. Kogge

Register files (RF) represent a substantial portion of the energy budget in modern processors, and are growing rapidly with the trend towards wider instruction issue. The actual access energy costs depend greatly on the register file circuitry used. This paper compares various RF circuitry techniques for their energy efficiencies, as a function of architectural parameters such as the number of registers and the number of ports. The Port Priority Selection technique was found to be the most energy efficient. The dependence of register file access energy upon technology scaling is also studied. However, as this paper shows, it appears that none of these will be enough to prevent centralized register files from becoming the dominant power component of next-generation superscalar computers, and alternative methods for inter-instruction communication need to be developed. Split register file architecture is analyzed as a possible alternative.

W2.3 Power Exploration for Dynamic Data Types Through Virtual Memory Management Refinement [p. 311]

Julio L. da Silva Jr., Francky Catthoor, Diederik Verkest, Hugo De Man

In this paper we present our novel power exploration methodology for applications with dynamic data types. Our methodology is crucial to obtain effective solutions in an embedded (HW or SW) processor context. The contributions are twofold. First we define the complete search space for Virtual Memory Management (VMM) mechanisms in a structured way with orthogonal decision trees. Secondly we present our systematic methodology for exploration of the maximal power that takes into account characteristics of the application to heavily prune the search space guiding the choices of a VMM mechanism. Finally we demonstrate for two industrial examples that power can vary considerably depending on the VMM chosen. Moreover these experiments show the effectiveness of our exploration methodology.