SIGDA Super Compendium, ICCAD 2002, Abstracts

<BODY> <H3><A HREF="../../../../../main_ss.htm">Main Page</A></BR> ICCAD 2002 : [ <A HREF="iccadtoc.htm">Table of Contents</A> ] [ <A HREF="iccadses.htm">Session Index</A> ] [ <A HREF="iccadai.htm">Author Index</A> ]</H3> <HR> <H2>ICCAD 2002 ABSTRACTS</H2> <B> Sessions: [<A HREF="#sess1a">1A</A>] [<A HREF="#sess1b">1B</A>] [<A HREF="#sess1c">1C</A>] [<A HREF="#sess1d">1D</A>] [<A HREF="#sess2a">2A</A>] [<A HREF="#sess2b">2B</A>] [<A HREF="#sess2c">2C</A>] [<A HREF="#sess3a">3A</A>] [<A HREF="#sess3b">3B</A>] [<A HREF="#sess3c">3C</A>] [<A HREF="#sess3d">3D</A>] [<A HREF="#sess4a">4A</A>] [<A HREF="#sess4b">4B</A>] [<A HREF="#sess4c">4C</A>] [<A HREF="#sess4d">4D</A>] [<A HREF="#sess5a">5A</A>] [<A HREF="#sess5b">5B</A>] [<A HREF="#sess5c">5C</A>] [<A HREF="#sess5d">5D</A>] [<A HREF="#sess6a">6A</A>] [<A HREF="#sess6b">6B</A>] [<A HREF="#sess6c">6C</A>] [<A HREF="#sess7a">7A</A>] [<A HREF="#sess7b">7B</A>] [<A HREF="#sess7c">7C</A>] [<A HREF="#sess7d">7D</A>] [<A HREF="#sess8a">8A</A>] [<A HREF="#sess8b">8B</A>] [<A HREF="#sess8c">8C</A>] [<A HREF="#sess8d">8D</A>] [<A HREF="#sess9a">9A</A>] [<A HREF="#sess9b">9B</A>] [<A HREF="#sess9c">9C</A>] [<A HREF="#sess9d">9D</A>] [<A HREF="#sess10a">10A</A>] [<A HREF="#sess10b">10B</A>] [<A HREF="#sess10c">10C</A>] [<A HREF="#sess10d">10D</A>] [<A HREF="#sess11a">11A</A>] [<A HREF="#sess11b">11B</A>]<P> </B> </DL> <HR> <H2>Session 1A: <A NAME="sess1a"> Substrate Modeling </A></H2> <I>Moderators: Eli Chiprout, Intel Corp., Chandler, AZ<BR> Sharad Kapur, Agere Systems, Inc., Murray Hill, NJ </I> <DL> <DT> <A HREF="../../pdffiles/01a_1.pdf" NAME="01a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1A.1 Comprehensive Frequency-Dependent Substrate Noise Analysis Using Boundary Element Methods [p. 2] <DD><I> Hongmei Li, Jorge Carballido, Harry H. Yu, Vladimir I. Okhmatovski, Elyse Rosenbaum, Andreas C. Cangellaris </I><P> We present a comprehensive methodology for the electrodynamic modeling of substrate noise coupling. A new and efficient method is introduced for the calculation of the Green's function that can accommodate arbitrary substrate doping profiles and thus facilitate substrate noise analysis using boundary element methods. In addition to a discussion of the application of the method and its validation in the context of substrate transfer resistance extraction, preliminary results from its application to frequency-dependent substrate noise modeling are presented also.<P> <DT> <A HREF="../../pdffiles/01a_2.pdf" NAME="01a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1A.2 Theoretical and Practical Validation of Combined BEM/FEM Substrate Resistance Modeling [p. 10] <DD><I> E. Schrik, P.M. DeWilde, N.P. van der Meijs </I><P> In mixed-signal designs, substrate noise originating from the digital part can seriously influence the functionality of the analog part. As such, accurately modeling the properties of the substrate as a noise-propagator is becoming ever more important. A model can be obtained through the Finite Element Method (FEM) or the Boundary Element Method (BEM). The FEM performs a full 3D discretization of the substrate, which makes this method very accurate and flexible but also slow. The BEM only discretizes the contact areas on the boundary of the substrate, which makes it less flexible, but significantly faster. A combination between BEM and FEM can be efficient when we need flexibility and speed at the same time. This paper briefly describes the BEM and the FEM and their combination, but mainly concentrates on the theoretical validation of the combined method and the experimental verification through implementation in the SPACE layout to circuit extractor and comparison with commercial BEM and FEM tools. <P> <DT> <A HREF="../../pdffiles/01a_3.pdf" NAME="01a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1A.3 Implicit Treatment of Substrate and Power-Ground Losses in Return-Limited Inductance Extraction [p. 16] <DD><I> Dipak Sitaram, Yu Zheng, K. L. Shepard </I><P> Full-wave analysis, based on rigorous solution of the differential or integral form of Maxwell's equations, is too slow for all but the smallest designs. Traditional on-chip extraction engines are, therefore, being pushed to extract inductance and provide accurate high-frequency interconnect modelling while maintaining computational efficiency and capacity. This paper describes further accuracy-improving enhancements to the commercial full-chip RLCK extraction engine, Assura RLCX[1], based on the return-limited inductance formulation. Specifically, we incorporate substrate losses due to eddy currents and power-ground losses while, based on design-driven assumptions, avoiding explicit extraction of the power-ground and substrate. Results are validated on small testcases where comparison with full-wave solution is practical. <P> </DL> <HR> <H2>Session 1B: <A NAME="sess1b"> Invited Session: Design for Low-Power </A></H2> <I>Moderators: Borivoje Nikolic, University of California, Berkeley, CA <BR> Lawrence T. Pileggi, Carnegie Mellon University, Pittsburgh, PA </I> <DL> <DT> <A HREF="../../pdffiles/01b_1.pdf" NAME="01b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1B.1 Minimizing Power across Multiple Technology and Design Levels [p. 24] <DD><I> Takayasu Sakurai </I><P> Approaches to achieve low-power and high-speed VLSI's are described with the emphasis on techniques across multiple technology and design levels. To suppress the leakage current in a standby mode, Boosted Gate MOS (BGMOS) is effective, which is based on cooperation between technology level and circuit level. To reduce the power in an active mode, VDD-hopping and VTH-hopping are promising, which are cooperative approaches between circuit and software. Power consumed in interconnect system can be reduced by a cooperative approach between application and layout as in bus shuffling. Other low-power design approaches are also discussed. <P> <DT> <A HREF="../../pdffiles/01b_2.pdf" NAME="01b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1B.2 Optimization and Control of V<SUB>DD</SUB> and V<SUB>TH</SUB> for Low-Power, High-Speed CMOS Design [p. 28] <DD><I> Tadahiro Kuroda </I><P> It is essential to control VDD and VTH for low-power, high-speed CMOS design. In this paper, it is shown that these two parameters can be controlled by designers as objectives of design optimization to find better trade-offs between power and speed. Quantitative analysis of trade- offs between power and speed is presented. Some of the popular circuit techniques and design examples to control VDD and VTH are introduced. A simple theory to compute optimum multiple VDD's and VTH's is described. Scaling scenarios of variable and/or multiple VDD's and VTH's is discussed to show future technology directions. <P> <DT> <A HREF="../../pdffiles/01b_3.pdf" NAME="01b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1B.3 Methods for True Power Minimization [p. 35] <DD><I> Robert W. Brodersen, Mark A. Horowitz, Dejan Markovic, Borivoje Nikolic, Vladimir Stojanovic </I><P> This paper presents methods for efficient power minimization at circuit and micro-architectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimize energy consumption subject to a delay constraint. The true power minimization is achieved when the energy reduction potentials of all tuning variables are balanced. We derive the sensitivity of energy to delay for each of the tuning variables connecting its energy saving potential to the physical properties of the circuit. This helps to develop understanding of optimization performance and identify the most efficient techniques for energy reduction. The optimizations are applied to some examples that span typical circuit topologies including inverter chains, SRAM decoders, and adders. At a delay of 20% larger than the minimum, energy savings of 40% to 70% are possible, indicating that achieving peak performance is expensive in terms of energy. Energy savings of about 50% can be achieved without delay penalty with the balancing of sizes, supplies, and thresholds. <P> </DL> <HR> <H2>Session 1C: <A NAME="sess1c"> Routing </A></H2> <I>Moderators: Tong Gao, Monterey Design Systems, Sunnyvale, CA<BR> Charles J. Alpert, IBM Corp., Austin, TX </I> <DL> <DT> <A HREF="../../pdffiles/01c_1.pdf" NAME="01c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1C.1 A Novel Framework for Multilevel Routing Considering Routability and Performance [p. 44] <DD><I> Shih-Ping Lin, Yao-Wen Chang </I><P> We propose in this paper a novel framework for multilevel routing considering both routability and performance. The two-stage multilevel framework consists of coarsening followed by uncoarsening. Unlike the previous multilevel routing, we integrate global routing, detailed routing, and resource estimation together at each level of the framework, leading to more accurate routing resource estimation during coarsening and thus facilitating the solution refinement during uncoarsening. Further, the exact routing information obtained at each level makes our framework more flexible in dealing with various routing objectives (such as crosstalk, power, etc). Experimental results show that our approach obtains significantly better routing solutions than previous works. For example, for a set of 11 commonly used benchmark circuits, our approach achieves 100% routing completion for all circuits while the previous multilevel routing, the three-level routing, and the hierarchical routing can complete routing for only 3, 0, 3 circuits, respectively. In particular, the number of routing layers used by our router is even smaller. We also have performed experiments on timing-driven routing. The results are also very promising. <P> <DT> <A HREF="../../pdffiles/01c_2.pdf" NAME="01c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1C.2 An Enhanced Multilevel Routing System [p. 51] <DD><I> Jason Cong, Min Xie, Yan Zhang </I><P> In this paper, we present several novel techniques that make the recently published multilevel routing scheme [19] more effective and complete. Our contributions include: (1) resource reservation for local nets during the coarsening process, (2) congestion-driven, graph-based Steiner tree construction during the initial routing and the refinement process and (3) multi- iteration refinement considering the congestion history. The experiments show that each of these techniques helps to improve the completion rate considerately. Compared to [19], the new routing system reduces the number of failed nets by 2. to 18., with less than 50% increase in runtime in most cases. <P> <DT> <A HREF="../../pdffiles/01c_3.pdf" NAME="01c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1C.3 Track Assignment: A Desirable Intermediate Step Between Global Routing and Detailed Routing [p. 59] <DD><I> Shabbir H Batterywala, Narendra Shenoy, William Nicholls, Hai Zhou </I><P> Routing is one of the most complex stages in the back-end design process. Simple routing algorithms based on two stages of global routing and detailed routing do not offer appropriate opportunities to address problems arising from signal delay, cross-talk and process constraints. An intermediate stage of track assignment between global and detailed routing proves to be an ideal place to address these problems. With this stage it is possible to use global routing information to efficiently address these problems and to aid the detailed router in achieving the wiring completions. In this paper we formulate routing as a three stage process; global routing, track assignment and detailed routing. We describe the intermediate track assignment problem and suggest an efficient heuristic for its solution. We introduce cost metrics to model basic effects arising from connectivity. We discuss extensions to include signal integrity and process constraints. We propose a heuristic based on weighted bipartite matching as a core routine. To improve its performance additional heuristics based on lookahead and segment splitting are also suggested. Experimental results are given to highlight the efficacy of track assignment stage in routing process. <P> <DT> <A HREF="../../pdffiles/01c_4.pdf" NAME="01c_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1C.4 ECO Algorithms for Removing Overlaps Between Power Rails and Signal Wires [p. 67] <DD><I> Hua Xiang, Kai-Yuan Chao, D.F. Wong </I><P> Design ECO commonly happens in industry due to constraints or target changes from manufacturing, marketing, reliability, or performance. At each step, designers usually want to modify the existing solution incrementally and keep the design as close as possible to the existing one. In this paper, we address the PSO (Power rail - Signal wire Overlap) problem which solves overlaps between power rails and signal wires due to the changes in power rail design on the top layer of a multiple layer routing region. PSO problems are frequently caused by changes from power delivery system or package design. The new routing solution satisfies the following constraints: 1) Keep the routing of power rails in the new design unchanged. 2) Only the routing of the top two layers is changed. 3) Horizontal (vertical) signal wire segments on the top layer can only move up/down (left/right). At the same time, the new routing solution keeps the routing pattern unchanged. This requires: a) If one end point of a horizontal (vertical) wire segment on the top layer is a fixed pin, this segment can not move. b) If vertical (horizontal) projections of two horizontal (vertical) signal wire segments have overlaps, then the up/down (left/right) relationship should not be changed. c) If two horizontal (vertical) segments belonging to different nets are on the same track, their left/right (up/down) relationship should not be changed as long as the two segments still exist in the new solution. 4) For each signal wire segment, the deviation (i.e., the difference between its new position and the old one) should not exceed the user-defined allowable deviation bound. Different bounds can be set on different segments. We propose two algorithms to solve the PSO problem. Both algorithms guarantee to find a feasible solution as long as one exists. One is faster, while the other makes effort to minimize the total deviation as well as the max deviation. According to time and quality requirements, users can choose an appropriate algorithm to solve the problem. For a set of industrial test circuits, we were able to remove all overlaps between power rails and signal wires with minimal wire deviation. <P> </DL> <HR> <H2>Session 1D: <A NAME="sess1d"> Advances in Testing </A></H2> <I>Moderators: Tomoo Inoue, Hiroshima City University, Hiroshima, Japan <BR> M. Jeske-Chrzanowska, Portland University, Portland, OR </I> <DL> <DT> <A HREF="../../pdffiles/01d_1.pdf" NAME="01d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1D.1 Fast Seed Computation for Reseeding Shift Register in Test Pattern Compression [p. 76] <DD><I> Nahmsuk Oh, Rohit Kapur, T. W. Williams </I><P> Solving a system of linear equations has been widely used to compute seeds for LFSR reseeding to compress test patterns. However, as chip size is growing, solving linear equations requires a large number of computations that is proportional to n3. This paper proposes a new scan chain architecture and algorithm so that the order of computation is proportional to the number of scan cells in a chip. The new architecture is a methodology change that does not require complex Design-For-Testability (DFT) as proposed in the previous techniques. Instead of solving linear equations, the proposed new seed computation algorithm topologically determines seeds for test vectors. The compression ratio might be slightly lower than the other approaches, but the proposed approach can handle larger designs in a reasonable amount of time. Computation analysis shows that, for 1 million scan cell design, if we assume it takes 1 msec for the proposed technique to compute seeds, it would take more than 14 minutes for other techniques that solve linear equations. <P> <DT> <A HREF="../../pdffiles/01d_2.pdf" NAME="01d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1D.2 On Undetectable Faults in Partial Scan Circuits [p. 82] <DD><I> Irith Pomeranz, Sudhakar M. Reddy </I><P> We provide a definition of undetectable faults in partial scan circuits under a test application scheme where a test consists of primary input vectors applied at-speed between scan operations. We also provide sufficient conditions for a fault to be undetectable under this test application scheme. We present experimental results on finite-state machine benchmarks to demonstrate the effectiveness of these conditions in identifying undetectable faults. <P> <DT> <A HREF="../../pdffiles/01d_3.pdf" NAME="01d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1D.3 Conflict Driven Techniques for Improving Deterministic Test Pattern Generation [p. 87] <DD><I> Chen Wang, Sudhakar M. Reddy, Irith Pomeranz, Xijiang Lin, Janusz Rajski </I><P> This work presents several new techniques for enhancing the performance of deterministic test pattern generation for VLSI circuits. The techniques introduced are called dynamic decision ordering, conflict driven recursive learning and conflict learning. An important feature shared by all these techniques is that they are triggered by the occurrence of a conflict in the generation of tests. Hence, they are not active all the time nor for all the faults. This feature allows the ATPG system that uses these techniques to resolve hard-to-resolve faults with far fewer backtracks and leaves the system as efficient as before in the absence of conflicts. We have incorporated these techniques into a commercial D-algorithm based ATPG tool. The experimental results on full scan versions of ITC'99 benchmark circuits demonstrate an improvement of the ATPG system both in the number of aborted faults and in test generation time. <P> <DT> <A HREF="../../pdffiles/01d_4.pdf" NAME="01d_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 1D.4 On Theoretical and Practical Considerations of Path Selection For Delay Fault Testing [p. 94] <DD><I> Jing-Jia Liou, Li-C. Wang, Tim Kwang-Ting Cheng </I><P> In current industrial practice, critical path selection is an indispensable step for AC delay test and timing validation. Traditionally, this step relies on the construction of a set of worse-case paths based upon discrete timing models. The assumption of discrete timing models can be invalidated by delay effects in the deep submicron domain, where timing defects and process variation are statistical in nature. In this paper, we study the problem of optimizing critical path selection, under both fixed delay and statistical delay assumptions. With a novel problem formulation and new theoretical results, we prove that the problem in both cases are computationally intractable. We then discuss practical heuristics and their theoretical performance bounds, and demonstrate that among all heuristics under consideration, only one is theoretically feasible. Finally, we provide consistent experimental results based upon defect-injected simulation using an efficient statistical timing analysis framework. <P> </DL> <HR> <H2>Session 2A: <A NAME="sess2a"> Novel Ideas in High Level Synthesis </A></H2> <I>Moderators: Julio L. da Silva, Get2Chip.com, Inc., San Jose, CA<BR> Patrick R. Schaumont, University of California, Los Angeles, CA </I> <DL> <DT> <A HREF="../../pdffiles/02a_1.pdf" NAME="02a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2A.1 Interface Specification for Reconfigurable Components [p. 102] <DD><I> Satnam Singh </I><P> This paper presents a way of encoding some kinds of dynamic reconfiguration behaviour in the interface portion of circuit descriptions. This has many advantages. The user of a reconfigurable circuit has some knowledge about the reconfigurable interface of the circuit. Static analysis tools can make better decisions about how to schedule virtual hardware. And most importantly the compiler can automatically synthesize the required interface between reconfigurable portions of the system and the regular portions of the design. Several existing models of dynamic reconfiguration from the literature are captured using our type system extension based on sum types. This is especially important in System-on-Chip (SoC) contexts where a reconfigurable IP block may have to communicate over a non-trivial IP bus like CoreConnect<SUP>TM</SUP>. <P> <DT> <A HREF="../../pdffiles/02a_2.pdf" NAME="02a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2A.2 Interconnect-aware High-level Synthesis for Low Power [p. 110] <DD><I> Lin Zhong, Niraj K. Jha </I><P> Interconnects (wires, buffers, clock distribution networks, multiplexers and busses) consume a significant fraction of total circuit power. In this work, we demonstrate the importance of optimizing on-chip interconnects for power during high-level synthesis. We present a methodology to integrate interconnect power optimization into high-level synthesis. Our binding algorithm not only reduces power consumption in functional units and registers in the resultant register-transfer level (RTL) architecture, but also optimizes interconnects for power. We take physical design information into account for this purpose. To estimate interconnect power consumption accurately for deep sub-micron (DSM) technologies, wire coupling capacitance is taken into consideration. We observed that there is significant spurious (i:e:, unnecessary) switching activity in the interconnects and propose techniques to reduce it. Compared to interconnect-unaware power-optimized circuits, our experimental results show that interconnect power can be reduced by 53.1% on an average, while reducing overall power by an average of 26.8% with 0.5% area overhead. Compared to area-optimized circuits, the interconnect power reduction is 72.9% and overall power reduction is 56.0% with 44.4% area overhead. <P> <DT> <A HREF="../../pdffiles/02a_3.pdf" NAME="02a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2A.3 Predictability: Definition, Analysis and Optimization [p. 118] <DD><I> Ankur Srivastava, Majid Sarrafzadeh </I><P> Predictability is the quantified form of accuracy. We propose a predictability driven design methodology. The novelty lies in defining and using the idea of predictability. In order to illustrate the basic concepts we focus on the low power binding problem. The binding problem for low power was solved in [3], [5], but in the presence of inaccuracies, their claims of optimality are imprecise. Our experiments show that these inaccuracies could be as high as 33%. Our methodology could improve this unpredictability to as low as 11% with minimal power penalty (7% on average). <P> </DL> <HR> <H2>Session 2B: <A NAME="sess2b"> Formal Techniques for Validation and Synthesis </A></H2> <I>Moderators: Aarti Gupta, NEC USA, Inc., Princeton, NJ<BR> Rajeev Ranjan, Real Intent, Santa Clara, CA </I> <DL> <DT> <A HREF="../../pdffiles/02b_1.pdf" NAME="02b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2B.1 Simplifying Boolean Constraint Solving For Random Simulation-Vector Generation [p. 123] <DD><I> Jun Yuan, Ken Albin, Adnan Aziz, Carl Pixley </I><P> We present an algorithm for simplifying the solution of conjunctive Boolean constraints of state and input variables, in the context of constrained random vector generation using BDDs. The basis of our approach is extraction of "hold-constraints" from constraint system. Hold-constraints are deterministic and trivially resolvable; in addition, they can be used to simplify the original constraints as well as refine the conjunctive partition. Experiments demonstrate significant reduction in the time and space needed for constructing the conjunction BDDs, and the time spent in vector generation during simulation. <P> <DT> <A HREF="../../pdffiles/02b_2.pdf" NAME="02b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2B.2 Specifying and Verifying Imprecise Sequential Datapaths by Arithmetic Transforms [p. 128] <DD><I> Katarzyna Radecka, Zeljko Zilic </I><P> We address verification of imprecise datapath circuits with sequential elements. Using Arithmetic Transform (AT) and its extensions, we verify the sequential datapath circuits with finite precision. An efficient formulation of the precision verification is presented as a polynomial maximization search over Boolean inputs. Using a branch-and-bound search for the precision error and the block-level composition of ATs, we verify the approximated, rounded and truncated pipelined datapaths. <P> <DT> <A HREF="../../pdffiles/02b_3.pdf" NAME="02b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2B.3 Convertibility Verification and Converter Synthesis: Two Faces of the Same Coin [p. 132] <DD><I> Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, Alberto L. Sangiovanni-Vincentelli </I><P> An essential problem in component-based design is how to compose components designed in isolation. Several approaches have been proposed for specifying component interfaces that capture behavioral aspects such as interaction protocols, and for verifying interface compatibility. Likewise, several approaches have been developed for synthesizing converters between incompatible protocols. In this paper, we introduce the notion of adaptability as the property that two interfaces have when they can be made compatible by communicating through a converter that meets specified requirements. We show that verifying adaptability and synthesizing an appropriate converter are two faces of the same coin: adaptability can be formalized and solved using a game-theoretic framework, and then the converter can be synthesized as a strategy that always wins the game. Finally we show that this framework can be related to the rectification problem in trace theory. <P> </DL> <HR> <H2>Session 2C: <A NAME="sess2c"> Embedded Tutorial: Subthreshold Leakage Modeling and Reduction Techniques </A></H2> <I>Moderators: Georges Gielen, Katholieke University, Leuven, Belgium<BR> Farid N. Najm, University of Toronto, Toronto, Ontario, Canada </I> <DL> <DT> <A HREF="../../pdffiles/02c_1.pdf" NAME="02c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 2C.1 Subthreshold Leakage Modeling and Reduction Techniques [p. 141] <DD><I> James Kao, Siva Narendra, Anantha Chandrakasan, </I><P> As technology scales, subthreshold leakage currents grow exponentially and become an increasingly large component of total power dissipation. CAD tools to help model and manage subthreshold leakage currents will be needed for developing ultra low power and high performance integrated circuits. This paper gives an overview of current research to control leakage currents, with an emphasis on areas where CAD improvements will be needed. The first part of the paper explores techniques to model subthreshold leakage currents at the device, circuit, and system levels. Next, circuit techniques such as source biasing, dual Vt partitioning, MTCMOS, and VTCMOS are described. These techniques reduce leakage currents during standby states and minimize power consumption. This paper also explores ways to reduce total active power by limiting leakage currents and optimally trading off between dynamic and leakage power components. <P> </DL> <HR> <H2>Session 3A: <A NAME="sess3a"> Compilation Techniques for Hardware/Software Codesign </A></H2> <I>Moderators: Joerg Henkel, NEC USA, Inc., Princeton, NJ<BR> Luc Semeria, Synopsys, Inc., Mountain View, CA </I> <DL> <DT> <A HREF="../../pdffiles/03a_1.pdf" NAME="03a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3A.1 Symbolic Pointer Analysis [p. 150] <DD><I> Jianwen Zhu </I><P> One of the bottlenecks in the recent movement of hardware synthesis from behavioral C programs is the difficulty in reasoning about runtime pointer values at compile time. The pointer analysis problem has been investigated in the compiler community for two decades, which has yielded efficient, polynomial time algorithms for context-insensitive analysis. However, at the accuracy level for which hardware synthesis is desired, namely context and flow sensitive analysis, the time and space complexity of the best algorithms reported grow exponentially with program size. In this paper, we present our first step towards a new analysis technology which potentially leads to almost-linear time complexity and sub-linear space complexity algorithm even for the most accurate analysis. The key idea that contributes to this efficiency is to implicitly encode the pointer-to relation in the Boolean domain, thereby capturing the procedure transfer function completely, compactly and canonically. This represents a wide departure from the traditional techniques, all of which explicitly capture pointer-to relation using variations of point-to graph, which have to be re-evaluated for different calling contexts. Experiments for our first flow-insensitive algorithm on common benchmarks show promising result. <P> <DT> <A HREF="../../pdffiles/03a_2.pdf" NAME="03a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3A.2 Dynamic Compilation for Energy Adaptation [p. 158] <DD><I> P. Unnikrishnan, G. Chen, M. Kandemir, D. R. Mudgett </I><P> While previous compiler research indicates that significant improvements in energy efficiency may be possible if properly optimized code is used, the energy constraints under which a given application code should be optimized may not always be available at compile-time. More importantly, these constraints may change dynamically during the course of execution. In this work, we present a dynamic recompilation/linking framework using which the energy behavior of a given application can be optimized while the application is being executed. Our preliminary experiments indicate that large energy gains are possible through dynamic code recompilation/linking at the expense of a relatively small increase in execution time. <P> <DT> <A HREF="../../pdffiles/03a_3.pdf" NAME="03a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3A.3 Hardware/Software Partitioning of Software Binaries [p. 164] <DD><I> Greg Stitt, Frank Vahid </I><P> Partitioning an embedded system application among a microprocessor and custom hardware has been shown to improve the performance, power or energy of numerous examples. The advent of single-chip microprocessor/FPGA platforms makes such partitioning even more attractive. Previous partitioning approaches have partitioned sequential program source code, such as C or C++. We introduce a new approach that partitions at the software binary level. Although source code partitioning is preferable from a purely technical viewpoint, binary-level partitioning provides several very practical benefits for commercial acceptance. We demonstrate that binary- level partitioning yields competitive speedup results compared to source-level partitioning, achieving an average speedup of 1.4 compared to 1.5 for eight benchmarks partitioned on a single-chip microprocessor/FPGA device. <BR> Keywords:<BR> Hardware/software partitioning, synthesis, binary translation, decompilation, low power, assembly language, FPGA, codesign, synthesis. <P> </DL> <HR> <H2>Session 3B: <A NAME="sess3b"> Timing Driven Placement </A></H2> <I>Moderators: Rajeev Jayaraman, Xilinx, Inc., San Jose, CA<BR> Jens Vygen, Bonn University, Bonn, Germany </I> <DL> <DT> <A HREF="../../pdffiles/03b_1.pdf" NAME="03b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3B.1 A Novel Net Weighting Algorithm for Timing-Driven Placement [p. 172] <DD><I> Tim (Tianming) Kong </I><P> Net weighting for timing-driven placement has been very popular in industry and academia. It has various advantages such as low complexity, high flexibility and ease of implementation. Existing net weighting algorithms, however, are often ad-hoc. There is generally no known good net weighting algorithms. In this paper, we present a novel net weighting algorithm based on the concept of path-counting, and apply it in timing-driven FPGA placement application. Theoretically this is the first ever known accurate, all-path counting algorithm. Experimental data shows that compared with the weighting algorithm used in state-of-the-art FPGA placement package VPR[1], this new algorithm can achieve the longest path delay reduction of up to 38.8%, 15.6% on average with no runtime overhead and only a 4.1% increase in total wirelength. <P> <DT> <A HREF="../../pdffiles/03b_2.pdf" NAME="03b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3B.2 Timing-Driven Placement using Design Hierarchy Guided Constraint Generation [p. 177] <DD><I> Xiaojian Yang, Bo-Kyung Choi, Majid Sarrafzadeh </I><P> Design hierarchy plays an important role in timing-driven placement for large circuits. In this paper, we present a new methodology for delay budgeting based timing-driven placement. A novel slack assignment approach is described as well as its application on delay budgeting with design hierarchy information. The proposed timing-driven placement flow is implemented into a placement tool named Dragon (timing-driven mode), and evaluated using an industrial place and route flow. Compared to Cadence QPlace, timing-driven Dragon generates placement results with shorter clock cycle and better routability. <P> <DT> <A HREF="../../pdffiles/03b_3.pdf" NAME="03b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3B.3 Multi-objective Circuit Partitioning for Cutsize and Path-Based Delay Minimization [p. 181] <DD><I> Cristinel Ababei, Navaratnasothie Selvakkumaran, Kia Bazargan, George Karypis </I><P> In this paper we present multi-objective hMetis partitioning for simultaneous cutsize and circuit delay minimization. We change the partitioning process itself by introducing a new objective function that incorporates a truly path-based delay component for the most critical paths. To avoid semi-critical paths from becoming critical, the traditional slack-based delay component is also included in the cost function. The proposed timing driven partitioning algorithm is built on top of the hMetis algorithm, which is very efficient. Simulations results show that 14% average delay improvement can be obtained. Smooth trade-off between cutsize and delay is possible in our algorithm. <P> </DL> <HR> <H2>Session 3C: <A NAME="sess3c"> Invited Session: Emerging Technology Opportunities and Challenges </A></H2> <I>Moderators: Makoto Ikeda, University of Tokyo, Tokyo, Japan<BR> Andreas Kuehlmann, Cadence Berkeley Labs., Berkeley, CA </I> <DL> <DT> <A HREF="../../pdffiles/03c_1.pdf" NAME="03c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3C.1 A Hybrid ASIC and FPGA Architecture [p. 187] <DD><I> Paul S. Zuchowski, Christopher B. Reynolds, Richard J. Grupp, Shelly G. Davis, Brendan Cremen, Bill Troxel </I><P> This paper introduces a new hybrid ASIC/FPGA chip architecture that is being developed in collaboration between IBM and Xilinx, and highlights some of the design challenges this offers for designers and CAD developers. We will review recent data from both the ASIC and FPGA industries, including technology features, and trends in usage and costs. This background data indicates that there are advantages to using standard ASICs and FPGAs for many applications, but technical and financial considerations are increasingly driving the need for a hybrid ASIC/FPGA architecture at specific volume tiers and technology nodes. As we describe the hybrid chip architecture we will point out evolving tool and methodology issues that will need to be addressed to enable customers to effectively design hybrid ASIC/FPGAs. The discussion will highlight specific automation issues in the areas of logic partitioning, logic simulation, verification, timing, layout and test. <P> <DT> <A HREF="../../pdffiles/03c_2.pdf" NAME="03c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3C.2 Managing Power and Performance for System-on-Chip Designs using Voltage Islands [p. 195] <DD><I> David E. Lackey, Paul S. Zuchowski, Thomas R. Bednar, Douglas W. Stout, Scott W. Gould, John M. Cohn </I><P> This paper discusses Voltage Islands, a system architecture and chip implementation methodology, that can be used to dramatically reduce active and static power consumption for System-on-Chip (SoC) designs. As technology scales for increased circuit density and performance, the need to reduce power consumption increases in significance as designers strive to utilize the advancing silicon capabilities. The consumer product market further drives the need to minimize chip power consumption. Effective use of Voltage Islands for meeting SoC power and performance requirements, while meeting Time to Market (TAT) demands, requires novel approaches throughout the design flow as well as special circuit components and chip powering structures. This paper outlines methods being used today to design Voltage Islands in a rapid-TAT product development environment, and discusses the need for industry EDA advances to create an industry-wide Voltage Island design capability. <P> <DT> <A HREF="../../pdffiles/03c_3.pdf" NAME="03c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3C.3 Sub-90nm Technologies - Challenges and Opportunities for CAD [p. 203] <DD><I> Tanay Karnik, Shekhar Borkar, Vivek De </I><P> Future high performance microprocessor design with technology scaling beyond 90nm will pose two major challenges: (1) energy and power, and (2) parameter variations. Design practice will have to change from deterministic design to probabilistic and statistical design. This paper discusses circuit techniques and design automation opportunities to overcome the challenges. <P> </DL> <HR> <H2>Session 3D: <A NAME="sess3d"> Inductance Modeling I </A></H2> <I>Moderators: Eli Chiprout, Intel Corp., Chandler, AZ<BR> Mustafa Celik, Monterey Design Systems, Sunnyvale, CA </I> <DL> <DT> <A HREF="../../pdffiles/03d_1.pdf" NAME="03d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3D.1 A Local Circuit Topology for Inductive Parasitics [p. 208] <DD><I> Andrea Pacelli </I><P> A novel circuit topology for inductive coupling between interconnecting wires is presented. The model is local, i.e., only coupling between neighboring wires is explicitly modeled. However, the topology accounts for long-range coupling by propagating the vector potential from one wire to the next. Examples of model calibration, both directly from layout and as model-order reduction of a given inductance matrix, are presented for simple wiring structures. <P> <DT> <A HREF="../../pdffiles/03d_2.pdf" NAME="03d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3D.2 INDUCTWISE: Inductance-Wise Interconnect Simulator and Extractor [p. 215] <DD><I> Tsung-Hao Chen, Clement Luk, Hyungsuk Kim, Charlie Chung-Ping Chen </I><P> We develop a robust, efficient, and accurate tool, which integrates inductance extraction and simulation, called INDUCTWISE. This paper advances the state-of-the-art inductance extraction and simulation techniques and contains two major parts. In the first part, INDUCTWISE extractor, we discover the recently proposed inductance matrix sparsification algorithm, the K- method[1], albeit its great benefits of efficiency, has a major flaw on the stability. We provide both a counter example and a remedy for it. A window section algorithm is also presented to preserve the accuracy of the sparsification method. The second part, INDUCTWISE simulator, demonstrates great efficiency of integrating the nodal analysis formulation with the improved K- method. Experimental results show that INDUCTWISE has over 250x speedup compared to SPICE3. The proposed sparsification algorithm accelerates the simulator another 175x and speeds up the extractor 23.4x within 0.1% of error. INDUCTWISE can extract and simulate an 118K-conductor RKC circuit within 18 minutes. It has been well tested and released on the web for public usage. (http://vlsi.ece.wisc.edu/Inductwise.htm) <P> <DT> <A HREF="../../pdffiles/03d_3.pdf" NAME="03d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 3D.3 A Precorrected-FFT Method for Simulating On-chip Inductance [p. 221] <DD><I> Haitian Hu, David T. Blaauw, Vladimir Zolotov, Kaushik Gala, Min Zhao, Rajendran Panda, Sachin S. Sapatnekar </I><P> The simulation of on-chip inductance using PEEC-based circuit analysis methods often requires the solution of a subproblem where an extracted inductance matrix must be multiplied by a current vector, an operation with a high computational cost. This paper presents a highly accurate technique, based on a precorrected-FFT approach, that speeds up this calculation. Instead of computing the inductance matrix explicitly, the method exploits the properties of the inductance calculation procedure while implicitly considering the effects of all of the inductors in the layout. An optimized implementation of the method has been applied to accurately simulate large industrial circuits with up to 121,000 inductors and nearly 7 billion mutual inductive couplings in about 20 minutes. Techniques for trading off the CPU time with the accuracy using different approximation orders and grid constructions are also illustrated. Comparisons with a block diagonal sparsification method in terms of accuracy, memory and speed demonstrate that our method is an excellent approach for simulating on-chip inductance in a large circuit. <P> </DL> <HR> <H2>Session 4A: <A NAME="sess4a"> Efficient Simulation for Analog and RF </A></H2> <I>Moderators: Helmut E. Graeb, Technical University of Munich, Munich, Germany<BR> Ken Kundert, Cadence Design Systems, Inc., San Jose, CA </I> <DL> <DT> <A HREF="../../pdffiles/04a_1.pdf" NAME="04a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4A.1 On the Difference between Two Widely Publicized Methods for Analyzing Oscillator Phase Behavior [p. 229] <DD><I> Piet Vanassche, Georges Gielen, Willy Sansen </I><P> This paper describes the similarities and differences between two widely publicized methods for analyzing oscillator phase behavior. The methods were presented in [3] and [6]. It is pointed out that both methods are almost alike. While the one in [3] can be shown to be, mathematically, more exact, the approximate method in [6] is somewhat simpler, facilitating its use for purposes of analysis and design. In this paper, we show that, for stationary input noise sources, both methods produce equal results for the oscillator's phase noise behavior. However, when considering injection locking, it is shown that both methods yield different results, with the approximation in [6] being unable to predict the locking behavior. In general, when the input signal causing the oscillator phase perturbations is non-stationary, the exact model produces the correct results while results obtained using approximate model break down. <P> <DT> <A HREF="../../pdffiles/04a_2.pdf" NAME="04a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4A.2 A Behavioral Simulation Tool for Continuous-Time Delta-Sigma Modulators [p. 234] <DD><I> Kenneth Francken, Martin Vogels, Ewout Martens, Georges Gielen </I><P> Circuitšlevel simulation of ?S modulators is a timešconsuming task (taking one or more days for meaningful results). While there are a great variety of techniques and tools that speed up the simulations for discreteštime (DT) ?S modulators, there is no rigorous methodology implemented in a tool to efficiently simulate and design the continuousštime (CT) counterpart. Yet, in todays lowšpower, highšaccuracy and/or very highšspeed demands for AštošD converters, designers are often forced to resort to the use of CT ?S topologies. In this paper, we present a method for the highšlevel simulation of continuousštime ?S modulators that is based on behavioral models and which exhibits the best tradešoff between accuracy, speed and extensibility compared to other possible techniques that are reviewed briefly in this work. A useršfriendly tool, implementing this methodology, is then presented. Nonidealities such as finite gain, finite GBW, output impedance and also nonlinearities such as clipping, harmonic distortion and the important effect of jitter are modeled. Finally, experiments were carried out using the tool, exploring important design trade-offs. <P> <DT> <A HREF="../../pdffiles/04a_3.pdf" NAME="04a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4A.3 Making Fourier-Envelope Simulation Robust [p. 240] <DD><I> Jaijeet Roychowdhury </I><P> Fourier-envelope algorithms are an important component of the mixed-signal/RF verification toolbox. In this paper, we address the unpredictability and lack of robustness that has been reported for these algorithms. We show that the problem stems from fast oscillations in envelopes that are expected to be slowly varying. We demonstrate that this is related to the fact that the envelope equations are always stiff, whether or not the underlying system is. We show that careful choice of envelope initial conditions is necessary to obtain useful solutions, and propose two techniques for finding good initial conditions. Applying these, and solving the envelope equations with stiffly-stable numerical methods, we improve the robustness and reliability of Fourier-envelope methods. We illustrate the new methods with a direct- downconversion mixer circuit. <P> </DL> <HR> <H2>Session 4B: <A NAME="sess4b"> Interconnect Optimization </A></H2> <I>Moderators: Cheng-Kok Koh, Purdue University, West Lafayette, IN<BR> Charles Chiang, Synopsys, Inc., Mountain View, CA </I> <DL> <DT> <A HREF="../../pdffiles/04b_1.pdf" NAME="04b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4B.1 Optimal Buffered Routing Path Constructions for Single and Multiple Clock Domain Systems [p. 247] <DD><I> Soha Hassoun, Charles J. Alpert, Meera Thiagarajan </I><P> Shrinking process geometries and the increasing use of IP components in SoC designs give rise to new problems in routing and buffer insertion. A particular concern is that cross-chip routing will require multiple clock cycles. Another is the integration of independently clocked components. This paper explores simultaneous routing and buffer insertion in the context of single and multiple clock domains. We present optimal and efficient polynomial algorithms that can be used to estimate communication overhead for interconnect and resource planning in single and multi-clock domain systems. Experimental results verify the correctness and practicality of our approach. <P> <DT> <A HREF="../../pdffiles/04b_2.pdf" NAME="04b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4B.2 Shaping Interconnect for Uniform Current Density [p. 254] <DD><I> Muzhou Shao, D.F. Wong, Youxin Gao, Li-Pen Yuan, Huijing Cao </I><P> As the VLSI technology scaling down, the electromigration problem becomes one of the major concerns in high-performance IC design for both power network and signal interconnects. For a uniform width metal interconnect, the current flows through the driving point is much larger than that flows through the fan-out point since much of current bypasses to the ground through the parasitic capacitance. This causes the lifetime of driving point to be quite shorter than that of fanout point due to electromigration. In order to avoid breakdown at the driving point, wire sizing is an effective solution. Thus we present a wire shape, of which the current density as well as the lifetime is uniform along the wire. SPICE simulation results show the uniformity of current density of this wire shape. Under the same current density bound, we demonstrate that chip area and power consumption are significantly reduced for this wire shape compared to the uniform width wire. The wire shape functions we derived are continuous. However, it is not necessary to ultra-accurately reproduce the continuous shape on the silicon, since we can round the continuous shape to the nearest available litho width and this will not degrade the uniformity of current density. <P> <DT> <A HREF="../../pdffiles/04b_3.pdf" NAME="04b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4B.3 Non-tree Routing for Reliability and Yield Improvement [p. 260] <DD><I> Andrew B. Kahng, Bao Liu, Ion I. Mandoiu </I><P> We propose to introduce redundant interconnects for manufacturing yield and reliability improvement. By introducing redundant interconnects, the potential for open faults is reduced at the cost of increased potential for short faults; overall, manufacturing yield and fault tolerance can be improved. We focus on a post-processing, tree augmentation approach which can be easily integrated in current physical design flows. Our contributions are as follows:<BR> We formulate the problem as a variant of the classical 2-edge-connectivity augmentation problem in which we take into account such practical issues as wirelength increase budget, routing obstacles, and use of Steiner points.<BR> We show that an optimum solution can always be found on the Hanan grid defined by the terminals and the corners of the feasible routing region.<BR> We give a compact integer program formulation which, for up to 100 terminal nets, is solved in practical runtime by the commercial optimization package CPLEX.<BR> We give a well-scaling greedy algorithm which has practical runtime up to 1,000 terminals, and comes on the average within 1-2% of the optimum computed by CPLEX.<BR> We give a comprehensive experimental study comparing the solution quality and runtime of our methods with the best reported methods for 2-edge-connectivity augmentation, including a sophisticated heuristic based on minimum-weight branchings [9] and a recent genetic algorithm [14].<BR> Experiments on randomly generated and industry testcases show that our greedy augmentation method achieves significant increase in reliability (as measured by the percentage of biconnected tree edges) with very small increase in wirelength. For example, on 1,000 terminal nets the average percentage of biconnected tree edges is 34.19% for a wirelength increase of only 1%, and 87.73% for a wirelength increase of 20%. SPICE simulations on industry routed nets show that non-tree routing has the additional benefit of reducing maximum sink delay by an average of 28.26% compared to Steiner routing, and by an average of 3.72% compared to timing optimized routing. SPICE simulations further imply that non-tree routing has smaller delay variation due to process variability. <P> </DL> <HR> <H2>Session 4C: <A NAME="sess4c"> Chip-Level Communication Structures </A></H2> <I>Moderators: Leon Stok - IBM Corp., Yorktown Heights, NY<BR> Dwight Hill - Synopsys, Inc., Mountain View, CA </I> <DL> <DT> <A HREF="../../pdffiles/04c_1.pdf" NAME="04c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4C.1 Concurrent Flip-Flop and Repeater Insertion for High Performance Integrated Circuits [p. 268] <DD><I> Pasquale Cocchini </I><P> For many years, CMOS process scaling has allowed a steady increase in the operating frequency and integration density of integrated circuits. Only recently, however, have we reached a point where it takes several clock cycles for global signals to traverse a complex digital system such as a modern microprocessor. Thus, interconnect latency must be taken into account in current and future design tools at the architectural as well as synthesis level. To this purpose, this work proposes a new latency-aware technique for the performance-driven concurrent insertion of flip- flops and repeaters in VLSI circuits. Overwhelming evidence showing an exponential increase in the number of pipelined interconnects with process scaling, for high-performance microprocessors as well as high-end ASICs, is also presented. This increase indicates a radical change in current design methodologies to cope with this new emerging problem. <P> <DT> <A HREF="../../pdffiles/04c_2.pdf" NAME="04c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4C.2 Throughput-Driven IC Communication Fabric Synthesis [p. 274] <DD><I> Tao Lin, Lawrence T. Pileggi </I><P> As the scale of system integration continues to grow, the on-chip communication becomes the ultimate bottleneck of system performance and the primary determinant of system architecture. In this paper we propose a throughput-driven synthesis methodology for on-chip communication fabrics based on optimized bus models. Compared with traditional delay-driven, wire-by-wire planning methods, the throughput-driven methodology provides a feasible and accurate system- level solution to address delay and congestion problems simultaneously during early-phase design planning. Unlike the conventional methods which are based on rather inaccurate RC models and simplistic delay metrics, in our methodology the communication fabrics are characterized in terms of realistic Partial Element Equivalent Circuits (PEEC) extracted from the multi-layer interconnects and transistor level transient analysis via SPICE-like tools. The characterized models facilitate a flexible interconnect fabric optimization engine that can be embedded into a system planner for throughput-driven synthesis. Furthermore, engineering trade-offs considering repeater area and interconnect power consumption are further considered as part of this methodology. <P> <DT> <A HREF="../../pdffiles/04c_3.pdf" NAME="04c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4C.3 Repeater Insertion and Wire Sizing Optimization for Throughput-Centric VLSI Global Interconnects [p. 280] <DD><I> Harshit Shah, Pun Shiu, Brian Bell, Mamie Aldredge, Namarata Sopory, Jeff Davis </I><P> As technology advances towards billion transistor systems, the cost of complex wire networks will require area efficient wiring methodologies. This paper explores the tradeoffs between wire latency, throughput and area for deep submicron (DSM) interconnect technologies. From basic physical models, optimal wiring sizing for repeater networks are rigorously derived and compared to HSPICE simulations. Key case studies from 250nm to 70nm technologies reveal that significant wire area reduction (20-50%) can be achieved with optimal wire sizing to maximize the throughput per unit wire area. <P> </DL> <HR> <H2>Session 4D: <A NAME="sess4d"> New DFT Techniques and Methodologies </A></H2> <I>Moderators: Bozena Kaminska, 3MT Solutions, Lake Oswego, OR<BR> Tomoo Inoue, Hiroshima City University, Hiroshima, Japan </I> <DL> <DT> <A HREF="../../pdffiles/04d_1.pdf" NAME="04d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4D.1 Test-Model based Hierarchical DFT Synthesis [p. 286] <DD><I> Sanjay Ramnath, Frederic Neuveux, Mokhtar Hirech, Felix Ng </I><P> With increasing design sizes and adoption of System on a Chip (SoC) methodology, design synthesis and test automation tools are hitting capacity and performance bottlenecks. Currently, hierarchical synthesis flows for large designs lack complete design-for-test (DFT) support. With this paper, we address a solution, involving the introduction of test models in a traditional DFT synthesis flow, that we term Hierarchical DFT Synthesis (HDS). We discuss the use of Core Test Language (CTL) based test models combined with physical and timing models to provide a complete flow for chip-level DFT. In doing so we address some challenges the new flow presents such as Design Rule Checking (DRC), DFT architecting and optimization. We describe methods to overcome these challenges thereby presenting a new methodology to handle complex next generation designs. <P> <DT> <A HREF="../../pdffiles/04d_2.pdf" NAME="04d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4D.2 Characteristic Faults and Spectral Information for Logic BIST [p. 294] <DD><I> Xiaoding Chen, Michael S. Hsiao </I><P> We present a new method of built-in-self-test (BIST) for sequential circuits and system-on-a- chip (SOC) using characteristic faults and circuit-specific spectral information in the form of one or more Hadamard coefficients. The Hadamard coefficients are extracted from the test sequences for a small set of characteristic faults of the circuit. By extracting a few characteristic faults from the circuit, we show that detection of these characteristic faults is sufficient in detecting a vast majority of the remaining faults in the circuit. The small number of characteristic faults allows us to reduce the coefficients necessary for BIST. State relaxation is performed on the compacted test sequences to reduce the spectral noise further. Since we are targeting only a very small number of characteristic faults, the execution times for computing the spectra are greatly reduced. Our experimental results show that our new method can achieve high BIST coverage with both lower computational efforts and storage, with very few characteristic faults. <P> <DT> <A HREF="../../pdffiles/04d_3.pdf" NAME="04d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 4D.3 A Novel Scan Architecture for Power-Efficient, Rapid Test [p. 299] <DD><I> Ozgur Sinanoglu, Alex Orailoglu </I><P> Scan-based testing methodologies remedy the testability problem of sequential circuits; yet they suffer from prolonged test time and excessive test power due to numerous shift operations. The high density of the unspecified bits in test data enables the utilization of the test response data captured in the scan chain for the generation of the subsequent test stimulus, thus reducing both test time and test data volume. The proposed scan-based test scheme accesses only a subset of scan cells for loading the subsequent test stimulus while freezing the remaining scan cells with the response data captured, thus decreasing the scan chain transitions during shift operations. The experimental results confirm the significant reductions in test application time, test data volume and test power achieved by the proposed scan-based testing methodology. <P> </DL> <HR> <H2>Session 5A: <A NAME="sess5a"> System-Level Analog Design </A></H2> <I>Moderators: Jaijeet Roychowdhury, University of Minnesota, Minneapolis, MN<BR> Henry Chang, Cadence Design Systems, Inc., San Jose, CA </I> <DL> <DT> <A HREF="../../pdffiles/05a_1.pdf" NAME="05a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5A.1 Optimization of a Fully Integrated Low Power CMOS GPS Receiver [p. 305] <DD><I> Peter Vancorenland, Philippe Coppejans, Wouter De Cock, Paul Leroux, Michiel Steyaert </I><P> This paper describes an optimization technique able to optimize a complete wireless receiver architecture in a reasonable amount of time. The optimizer alternates between spice level optimizations of simple building blocks and a full architecture optimization of the whole based on accurate models of the building blocks. The models of the building blocks are interpolated over the data points acquired in the Spice level simulations. The optimizer technique has been applied to the optimization of an architecture for a GPS receiver. The optimal design has been implemented in a standard 0.25µm CMOS process. <P> <DT> <A HREF="../../pdffiles/05a_2.pdf" NAME="05a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5A.2 Analysis and Optimization of Substrate Noise Coupling in Single-Chip RF Transceiver Design [p. 309] <DD><I> Adil Koukab, Kaustav Banerjee, Michel Declercq </I><P> The relentless move to single chip integration of RF, analog and digital blocks results in a challenging noise coupling effects that should be controlled. In this paper, we propose a methodology that uses a suite of commercial tools in combination with a high-speed extractor based on an innovative semi-analytical method to deal with noise coupling problems, and enable RF designers to achieve a first silicon-success of their chips. The integration of the methodology in a typical RF design flow is illustrated and its successful application to help a single-chip integration of a transceiver demonstrated. <P> <DT> <A HREF="../../pdffiles/05a_3.pdf" NAME="05a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5A.3 Design of Pipeline Analog-to-Digital Converters via Geometric Programming [p. 317] <DD><I> Maria del Mar Hershenson </I><P> In this paper we present a method for the design of analog-to-digital converters (ADCs). This method computes the sizes of the different components (transistors, capacitors, etc.) in a predefined ADC topology so that the design specifications are met in the desired process technology. <BR> The method is based on formulating the ADC design constraints such as specifications on power, signal-to-noise ratio (SNR), area, and sampling frequency in special convex form in terms of the component sizes of the ADC and intermediate design variables. More specifically, we cast the problem of sizing the components of the ADC as a geometric program. Therefore, all design constraints are formulated as posynomial inequality or monomial equality constraints. Very efficient numerical algorithms are then used to solve the resulting geometric program and to compute the component sizes of an ADC that meets the desired specifications. The synthesis method is fast, and determines the globally optimal design; in particular the final solution is completely independent of the starting point (which can even be infeasible), and infeasible specifications are unambiguously detected.<BR> This paper introduces the concept of hierarchical problem formulation within a geometric programming framework. This modular formulation allows a high re-use of the ADC posynomial model. <P> </DL> <HR> <H2>Session 5B: <A NAME="sess5b"> Inductance Modeling II </A></H2> <I>Moderators: Kenneth L. Shepard, Columbia University, New York, NY<BR> Mattan Kamon, Coventor, Inc., Cambridge, MA </I> <DL> <DT> <A HREF="../../pdffiles/05b_1.pdf" NAME="05b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5B.1 Proximity Templates for Modeling of Skin and Proximity Effects on Packages and High Frequency Interconnect [p. 326] <DD><I> Luca Daniel, Alberto Sangiovanni-Vincentelli, Jacob White </I><P> Modeling the exponentially varying current distributions in conductor interiors associated with high frequency interconnect behavior causes a rapid increase in the computation time and memory required even by recently developed fast electromagnetic analysis programs. In this paper we describe a procedure to generate numerically a set of basis functions which efficiently represent conductor current variation, and thus improving solver efficiency. The method is based on solving a sequence of template problems, and is easily generalized to arbitrary conductor cross-sections. Results are presented to demonstrate that the numerically computed basis functions are seven to twenty times more efficient than the commonly used piece-wise constant basis functions.<BR> Categories & Subject Descriptors:<BR> B.7.2 Simulation, B.8.2 Performance Analysis & Design Aids, I.6 Simulation & Modeling.l<BR> General Terms:<BR> Algorithms, Performance.l<BR> Keywords:<BR> Skin Effect, Proximity Effect, Parasitic extraction, Interconnect analysis. <P> <DT> <A HREF="../../pdffiles/05b_2.pdf" NAME="05b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5B.2 Transmission Line Design of Clock Trees [p. 334] <DD><I> Rafael Escovar, Robert Suaya </I><P> We investigate appropriate regimes for transmission line propagation of signals on digital integrated circuits. We start from exact solutions to the transmission line equations proposed by Davis and Meindl. We make appropriate modifications due to finite rise time. They affect the delay calculation and hypothesis pertaining the constancy of the electromagnetic parameters. We study these effects in detail. To find the domain of physical variables where transmission line behavior is feasible, we pose the problem as a nonlinear minimization problem in a space spanned by two continuous variables, with four parameters. From the resulting solutions and employing monotonicity properties of the functional we extract regimes of validity. These regimes of validity happen to be commensurate with what is reachable and doable with todays leading technologies. We complete this study with a qualitative analysis of driver insertion in the presence of transmission lines. The resulting configurations are suitable for the development of an improved clock design discipline. <P> <DT> <A HREF="../../pdffiles/05b_3.pdf" NAME="05b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5B.3 On-Chip Interconnect Modeling by Wire Duplication [p. 341] <DD><I> Guoan Zhong, Cheng-Kok Koh, and Kaushik Roy </I><P> In this paper, we present a novel wire duplication-based interconnect modeling technique. The proposed modeling technique exploits the sparsity of the L-1 matrix, where L is the inductance matrix, and constructs a sparse and stable equivalent RLC circuit by windowing the original inductance matrix. The model avoids matrix inversions. Most important, it is more accurate and more efficient than many existing techniques. <P> </DL> <HR> <H2>Session 5C: <A NAME="sess5c"> Emerging Technologies: Circuits and Systems </A></H2> <I>Moderators: Mary Ann Maher, MEMSCAP Inc., Oakland, CA<BR> Michael Butts, Cadence Design Systems, Inc., Portland, OR </I> <DL> <DT> <A HREF="../../pdffiles/05c_1.pdf" NAME="05c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5C.1 A Case for CMOS/nano Co-design [p. 348] <DD><I> Matthew M. Ziegler, Mircea R. Stan </I><P> The challenge of extending Moore's Law past the physical and economic barriers of present semiconductor technologies calls for novel nanoelectronic solutions. Circuits composed of mixed silicon semiconductors and nanoelectronics can provide a means for gradually switching technology paradigms. We suggest a design methodology to accompany this concept. Furthermore, we explore design tradeoffs for a nanoscale crossbar technology that supports CMOS/nano co-design. <P> <DT> <A HREF="../../pdffiles/05c_2.pdf" NAME="05c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5C.2 Reversible Logic Circuit Synthesis [p. 353] <DD><I> Vivek V. Shende, Aditya K. Prasad, Igor L. Markov, John P. Hayes </I><P> Reversible or information-lossless circuits have applications in digital signal processing, communication, computer graphics and cryptography. They are also a fundamental requirement in the emerging field of quantum computation. We investigate the synthesis of reversible circuits that employ a minimum number of gates and contain no redundant input-output line-pairs (temporary storage channels). We prove constructively that every even permutation can be implemented without temporary storage using NOT, CNOT and TOFFOLI gates. We describe an algorithm for the synthesis of optimal circuits and study the reversible functions on three wires, reporting distributions of circuit sizes. Finally, in an application important to quantum computing, we synthesize oracle circuits for Grover's search algorithm, and show a significant improvement over a previously proposed synthesis algorithm. <P> <DT> <A HREF="../../pdffiles/05c_3.pdf" NAME="05c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5C.3 Extraction and LVS for Mixed-domain Integrated MEMS Layouts [p. 361] <DD><I> Bikram Baidya, Tamal Mukherjee </I><P> As design of integrated MicroElectroMechanical Systems (MEMS) matures, there is an increasing need for verification of MEMS layouts. This requires a mixed-domain LVS (layout- versus-schematic) methodology capable of extracting an integrated schematic from the mixed- domain layout and verifying it against the designed schematic. This paper reports on a prototype implementation of MEMS LVS and a MEMS extractor, which, in addition to reconstructing the extracted schematic also captures the domain-specific parasitics in the individual devices. This schematic is then used by a custom schematic-versus-schematic comparator to match connectivity of various elements between the designed and extracted schematics. Finally, simulation of the extracted schematic also helps in capturing the true behavior of the system.<BR> Keywords:<BR> MEMS extraction, parasitics, MEMS LVS, verification, integrated MEMS <P> <DT> <A HREF="../../pdffiles/05c_4.pdf" NAME="05c_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5C.4 Schematic-Based Lumped Parameterized Behavioral Modeling for Suspended MEMS [p. 367] <DD><I> Qi Jing, Tamal Mukherjee, Gary K. Fedder </I><P> Schematic-based lumped parameterized behavioral modeling and simulation methodologies have become available since the emergence of analog HDLs. They greatly ease iterative hierarchical multi-domain simulation, which is critical to the design of MEMS. NODAS is one of such tools, with models written in VerilogA and simulation performed within the Cadence framework. This paper focuses on several key modeling issues in NODAS, including schematic representation, element communication, linear, nonlinear and multi-domain modeling, and extensibility to new physical effects, processes and physical domains. A nonlinear beam model and an electrostatic gap model are discussed as examples. Simulation comparison to finite element analyses and experimental data verifies the accuracy of the models and validates the simulation methodology.<BR> Keywords:<BR> MEMS, schematic-based, lumped, parameterized, behavioral, nonlinear beam, electrostatic gap <P> </DL> <HR> <H2>Session 5D: <A NAME="sess5d"> Low Power and Transistor Level Optimization </A></H2> <I>Moderators: Rajeev Murgai - Fujitsu Labs. of America, Sunnyvale, CA<BR> Michel Berklaar - Magma Design Automation, Eindhoven, The Netherlands </I> <DL> <DT> <A HREF="../../pdffiles/05d_1.pdf" NAME="05d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5D.1 Standby Power Optimization via Transistor Sizing and Dual Threshold Voltage Assignment [p. 375] <DD><I> Mahesh Ketkar, Sachin S. Sapatnekar </I><P> This paper presents a novel enumerative approach, with provable and efficient pruning techniques, for dual threshold voltage (Vt) assignment at the transistor level. Since the use of low Vt may entail a substantial increase in leakage power, we formulate the problem as one of combined optimization for leakage-delay tradeoffs under Vt optimization and sizing. Based on an analysis of the effects of these two transforms on the delay and leakage, we justify a two-step procedure for performing this optimization. Results are presented on the ISCAS85 benchmark suite favorably comparing our approach with an existing sensitivity-based optimizer. <P> <DT> <A HREF="../../pdffiles/05d_2.pdf" NAME="05d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5D.2 Power Efficiency of Voltage Scaling in Multiple Clock, Multiple Voltage Cores [p. 379] <DD><I> Anoop Iyer, Diana Marculescu </I><P> Due to increasing clock speeds, increasing design sizes and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar, out-of-order processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block. Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock, Single Voltage (MCSV), as well as Multiple Clock, Dynamic Voltage (MCDV) cores. Our results show that MCSV processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10%. By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Clock, Dynamic Voltage (SCDV) cores, while being up to 8% better in some cases. Our results show that MCDV cores consume 22% less power at an average 12% performance loss. <P> <DT> <A HREF="../../pdffiles/05d_3.pdf" NAME="05d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5D.3 Optimized Power-Delay Curve Generation for Standard Cell ICs [p. 387] <DD><I> Miodrag Vujkovic, Carl Sechen </I><P> An effective way to compare logic techniques, logic families, or cell libraries is by means of power (or area) versus delay plots, since the efficiency of achieving a particular delay is of crucial significance. In this paper we describe a method of producing an optimized power versus delay curve for a combinational circuit. We then describe a method for comparing the relative merits of a set of power versus delay curves for a circuit, each generated with a different cell library. Our results indicate that very few combinational functions need to be in a cell library, at most 11. The power-delay points achieved by Design Compiler from Synopsys using the state- of-the-art Artisan Sage-X library compare unfavorably to our approach. In terms of minimum energy-delay product, our approach is superior by 79% on average. Our approach yields the same delay points with a 107% savings in power consumption, on average. We also show that the specified VDD for a process technology should only be used for the absolute fastest implementations of a circuit. <P> <DT> <A HREF="../../pdffiles/05d_4.pdf" NAME="05d_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 5D.4 Gate Sizing Using Lagrangian Relaxation Combined with a Fast Gradient-Based Pre-Processing Step [p. 395] <DD><I> Hiran Tennakoon, Carl Sechen </I><P> In this paper, we present Forge, an optimal algorithm for gate sizing using the Elmore delay model. The algorithm utilizes Lagrangian relaxation with a fast gradient-based pre-processing step that provides an effective set of initial Lagrange multipliers. Compared to the previous Lagrangian-based approach, Forge is considerably faster and does not have the inefficiencies due to difficult-to-determine initial conditions and constant factors. We compared the two algorithms on 30 benchmark designs, on a Sun UltraSparc-60 workstation. On average Forge is 200 times faster than the previously published algorithm. We then improved Forge by incorporating a slew- rate-based convex delay model, which handles distinct rise and fall gate delays. We show that Forge is 15 times faster, on average, than the AMPS transistor-sizing tool from Synopsys, while achieving the same delay targets and using similar total transistor area. <P> </DL> <HR> <H2>Session 6A: <A NAME="sess6a"> Statistical Techniques for Power and Timing Estimation </A></H2> <I>Moderators: Farid N. Najm, University of Toronto, Toronto, Ontario, Canada<BR> Dennis M. Sylvester, University of Michigan, Ann Arbor, MI </I> <DL> <DT> <A HREF="../../pdffiles/06a_1.pdf" NAME="06a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 6A.1 A Markov Chain Sequence Generator for Power Macromodeling [p. 404] <DD><I> Xun Liu, Marios C. Papaefthymiou </I><P> In this paper, we present a novel sequence generator based on a Markov chain model. Specifically, we formulate the problem of generating a sequence of vectors with given average input probability p, average transition density d, and spatial correlation s as a transition matrix computation problem, in which the matrix elements are subject to constraints derived from the specified statistics. We also give a practical heuristic that computes such a matrix and generates a sequence of l n-bit vectors in O(nl+n2) time. Derived from a strongly mixing Markov chain, our generator yields binary vector sequences with accurate statistics, high uniformity, and high randomness. Experimental results show that our sequence generator can cover more than 99% of the parameter space. Sequences of 2,000 48-bit vectors are generated in less than 0.05 seconds, with average deviations of the signal statistics p, d, and s equal to 1.6%, 1.8%, and 2.8%, respectively. <BR> Our generator enables the detailed study of power macromodeling. Using our tool and the ISCAS-85 benchmark circuits, we have assessed the sensitivity of power dissipation to the three input statistics p, d, and s. Our investigation reveals that power is most sensitive to transition density, while only occasionally exhibiting high sensitivity to signal probability and spatial correlation. Our experiments also show that input signal imbalance can cause estimation errors as high as 100% in extreme cases, although errors are usually within 25%. <P> <DT> <A HREF="../../pdffiles/06a_2.pdf" NAME="06a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 6A.2 Circuit Power Estimation Using Pattern Recognition Techniques [p. 412] <DD><I> Lipeng Cao </I><P> This paper proposes a circuit power estimation method using Bayesian inference and neural networks. Based on statistical distribution of circuit leakage power and switching energy, the entire state and transition space of a circuit are classified using neural networks into a limited few classes that represent different power consumption average values. This technique enables efficient table-lookup of circuit power of the entire state and transition space. Theoretical basis of Bayesian inference, feature extraction for neural networks of circuit leakage power and switching energy are discussed. Experiments on a wide range of circuit topologies demonstrated the robustness of the proposed method for estimating circuit leakage power of all possible states and switching energy of all possible transitions. <P> <DT> <A HREF="../../pdffiles/06a_3.pdf" NAME="06a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 6A.3 Estimation of Signal Arrival Times in the presence of Delay Noise [p. 418] <DD><I> Sarvesh Bhardwaj, Sarma B.K. Vrudhula, David Blaauw </I><P> Delay due to capacitive coupling of interconnects has become an important reliability issue in the design of nanometer circuits. In this paper we present a probabilistic approach towards analyzing the impact of capacitive coupling noise on signal delay. The variation in the delay is due to the variation in the relative arrival times of the aggressors and the victim. We derive expressions for the moments of the victim voltage in the presence of noise. From these we compute estimates of the earliest and latest possible arrival times of the victim. We compare the analytical results with Monte Carlo simulations using SPICE. Even though the analytical calculations are 200 times faster than the Monte Carlo simulations, the differences in the estimates of the mean and standard deviation of the arrival time is no more than 2.8%. In addition, the width of the timing intervals using the proposed approach is reduced by as much as 48% with a confidence level of 0.984. That is 98.4% of the Monte Carlo simulations result in an arrival time that falls within the derived interval which is 48% shorter. <P> </DL> <HR> <H2>Session 6B: <A NAME="sess6b"> Embedded Tutorial: CAD Computation for Manufacturability: Can We Save VLSI Technology from Itself ? </A></H2> <I> Moderators: Andrzej Strojwas, Carnegie Mellon University, Pittsburgh, PA<BR> Andrew B. Kahng, University of California at San Diego, La Jolla, CA </I> <DL> <DT> <A HREF="../../pdffiles/06b_1.pdf" NAME="06b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 6B.1 CAD Computation for Manufacturability: Can We Save VLSI Technology from Itself ? [p. 424] <DD><I> Mark Lavin, Lars Liebman </I><P> Every 18 to 24 months, the areal density of VLSI doubles and the predicted date for the End Of CMOS Scaling is pushed out approximately 18 to 24 months. This rate of growth has been controlled mainly by the increasing capabilities of lithographic patterning. However, the rate of improvement of lithography systems in key physical parameters such as illumination wavelength has begun to slow. To make up the shortfall, lithography has increasingly turned to using CAD tools to transform the geometric shapes in designs into shapes on photomasks which have been compensated for systematic pattern-distorting effects. As the requirements for this compensation grow, however, it becomes increasingly difficult to hide in post-tapeout data preparation, and must be considered in back-end design tools and methodologies. We describe a number of the challenges to lithographic patterning, highlighting the factors that limit "physical scaling" and introduce the layout-to-mask shape transformations that compensate for these limitations. We describe the implementation of these transformations in general-purpose and specialized CAD tools, pointing out challenges like growth of computation effort. Finally, we describe how limitations of post-tapeout compensation drive the need for "litho-aware" physical design tools, showing examples in cell design, place-and-route, and layout migration. <P> </DL> <HR> <H2>Session 6C: <A NAME="sess6c"> Embedded Tutorial: Molecular Electronics: Devices, Systems and Tools for Gigagate, Gigabit Chips </A></H2> <I>Moderators: Paul D. Franzon, North Carolina State University, Raleigh, NC<BR> Tamal Mukherjee, Carnegie Mellon University, Pittsburgh, PA </I> <DL> <DT> <A HREF="../../pdffiles/06c_1.pdf" NAME="06c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 6C.1 Molecular Electronics: Devices, Systems and Tools for Gigagate, Gigabit Chips [p. 433] <DD><I> Michael Butts, AndrÚ DeHon, Seth Copen Goldstein </I><P> New electronics technologies are emerging which may carry us beyond the limits of lithographic processing down to molecularscale feature sizes. Devices and interconnects can be made from a variety of molecules and materials including bistable and switchable organic molecules, carbon nanotubes, and, single-crystal semiconductor nanowires. They can be self-assembled into organized structures and attached onto lithographic substrates. This tutorial reviews emerging molecular-scale electronics technology for CAD and system designers and highlights where ICCAD research can help support this technology. <P> </DL> <HR> <H2>Session 7A: <A NAME="sess7a"> Satisfiability Checking </A></H2> <I>Moderators: Armin Biere, ETH Zurich, Zurich, Switzerland<BR> James H. Kukula, Synopsys, Inc., Hillsboro, OR </I> <DL> <DT> <A HREF="../../pdffiles/07a_1.pdf" NAME="07a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7A.1 Conflict Driven Learning in a Quantified Boolean Satisfiability Solver [p. 442] <DD><I> Lintao Zhang, Sharad Malik </I><P> Within the verification community, there has been a recent increase in interest in Quantified Boolean Formula evaluation (QBF) as many interesting sequential circuit verification problems can be formulated as QBF instances. A closely related research area to QBF is Boolean Satisfiability (SAT). Recent advances in SAT research have resulted in some very efficient SAT solvers. One of the critical techniques employed in these solvers is Conflict Driven Learning. In this paper, we adapt conflict driven learning for application in a QBF setting. We show that conflict driven learning can be regarded as a resolution process on the clauses. We prove that under certain conditions, tautology clauses obtained from resolution in QBF also obey the rules for implication and conflicts of regular (nontautology) clauses; and therefore they can be treated as regular clauses and used in future search. We have implemented this idea in a new QBF solver called Quaffle and our initial experiments show that conflict driven learning can greatly speed up the solution process for most of the benchmarks we tested. <P> <DT> <A HREF="../../pdffiles/07a_2.pdf" NAME="07a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7A.2 Generic ILP versus Specialized 0-1 ILP: An Update [p. 450] <DD><I> Fadi A. Aloul, Arathi Ramani, Igor L. Markov, Karem A. Sakallah </I><P> Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas such as hardware and software verification, FPGA routing, planning, etc. Further uses are complicated by the need to express "counting constraints" in conjunctive normal form (CNF). Expressing such constraints by pure CNF leads to more complex SAT instances. Alternatively, those constraints can be handled by Integer Linear Programming (ILP), but generic ILP solvers may ignore the Boolean nature of 0-1 variables. Therefore specialized 0-1 ILP solvers extend SAT solvers to handle these so-called "pseudo-Boolean" constraints.<BR> This work provides an update on the on-going competition between generic ILP techniques and specialized 0-1 ILP techniques. To make a fair comparison, we generalize recent ideas for fast SAT-solving to more general 0-1 ILP problems that may include counting constraints and optimization. Another aspect of our comparison is evaluation on 0-1 ILP benchmarks that originate in Electronic Design Automation (EDA), but that cannot be directly solved by a SAT solver. Specifically, we solve instances of the Max-SAT and Max-ONEs optimization problems which seek to maximize the number of satisfied clauses and the "true" values over all satisfying assignments, respectively. Those problems have straightforward applications to SAT-based routing and are additionally important due to reductions from Max-Cut, Max-Clique, and Min Vertex Cover. Our experimental results show that specialized 0-1 techniques tend to outperform generic ILP techniques on Boolean optimization problems as well as on general EDA SAT problems. <P> <DT> <A HREF="../../pdffiles/07a_3.pdf" NAME="07a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7A.3 Binary Time-Frame Expansion [p. 458] <DD><I> Farzan Fallah </I><P> This paper introduces a new method for performing time-frame expansion based on writing the number of time frames in terms of powers of two. In the proposed method, the behavior of a circuit for t time frames, where 0 = t < n is modeled by unrolling the circuit 2<SUP>0</SUP>, 2<SUP>1</SUP>, 2<SUP>2</SUP>, á , 2<SUP>(log n-1)</SUP> times and combining them. This formulation of the problem makes it possible to prune the search space quickly when the problem is infeasible. To show the advantage of this method, we have used it to model the state justification problem and solve the problem using a SAT-solver. Experimental results show several orders of magnitude speedup for some non-trivial infeasible problems. Furthermore, in most cases the CPU time requirement grows linearly in terms of the number of time frames. <P> </DL> <HR> <H2>Session 7B: <A NAME="sess7b"> Emerging Technologies: Device Modeling and Simulation </A></H2> <I>Moderators: Joel R. Phillips, Cadence Berkeley Labs., San Jose, CA<BR> Narayan R. Aluru, University of Illinois at Urbana-Champaign, Urbana, IL </I> <DL> <DT> <A HREF="../../pdffiles/07b_1.pdf" NAME="07b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7B.1 Fast Methods for Simulation of Biomolecule Electrostatics [p. 466] <DD><I> Shihhsien S. Kuo, Michael D. Altman, Jaydeep P. Bardhan, Bruce Tidor, Jacob K. White </I><P> Computer simulation is an important tool for improving our understanding of biomolecule electrostatics, in part to aid in drug design. However, the numerical techniques used in these simulation tools do not exploit fast solver approaches widely used in analyzing integrated circuit interconnects. In this paper we describe one popular formulation used to analyze biomolecule electrostatics, present an integral formulation of the problem, and apply the precorrected-FFT method to accelerate the solution of the integral equations. <P> <DT> <A HREF="../../pdffiles/07b_2.pdf" NAME="07b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7B.2 Efficient Mixed-Domain Analysis of Electrostatic MEMS [p. 474] <DD><I> Gang Li, N. R. Aluru </I><P> We present efficient computational methods for scattered point and meshless analysis of electrostatic microelectromechanical systems (MEMS). Electrostatic MEMS are governed by coupled mechanical and electrostatic energy domains. A self-consistent analysis of electrostatic MEMS is implemented by combining a finite cloud method based interior mechanical analysis with a boundary cloud method based exterior electrostatic analysis. Lagrangian descriptions are used for both mechanical and electrostatic analyses. Meshless finite cloud and boundary cloud methods combined with fast algorithms and Lagrangian descriptions are flexible, efficient and attractive alternatives compared to conventional finite element/boundary element methods for self-consistent electromechanical analysis. Numerical results are presented for an electrostatic comb drive device.<BR> Keywords:<BR> coupled electro-mechanical analysis, meshless, finite cloud method, boundary cloud method, Lagrangian electrostatics <P> <DT> <A HREF="../../pdffiles/07b_3.pdf" NAME="07b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7B.3 FastMag: A 3-D Magnetostatic Inductance Extraction Program for Structures with Permeable Materials [p. 478] <DD><I> Yehia Massoud, Jacob White </I><P> In this paper we present a fast and efficient program for extraction of the frequency dependent inductance of structures with permeable materials. The program, FastMag, uses a magnetic surface charge formulation, efficient techniques for evaluating the required integrals, and a preconditioned GMRES method to solve the resulting linear system. Results from examples are presented to demonstrate the accuracy and versatility of the FastMag program. <P> </DL> <HR> <H2>Session 7C: <A NAME="sess7c"> Circuit-Level Analog CAD </A></H2> <I>Moderators: Rob A. Rutenbar, Carnegie Mellon University, Pittsburgh, PA<BR> Henry Chang, Cadence Design Systems, Inc., San Jose, CA </I> <DL> <DT> <A HREF="../../pdffiles/07c_1.pdf" NAME="07c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7C.1 Analog Circuit Sizing Based on Formal Methods Using Affine Arithmetic [p. 486] <DD><I> Andreas Lemke, Lars Hedrich, Erich Barke </I><P> We present a novel approach to optimization-based variation-tolerant analog circuit sizing. Using formal methods based on affine arithmetic, we calculate guaranteed bounds on the worst-case behavior and deterministically find the global optimum of the sizing problem by means of branch-and-bound optimization. To solve the nonlinear circuit equations with parameter variations, we define a novel affine-arithmetic Newton operator that gives a significant improvement in computational efficiency over an implementation using interval arithmetic. The calculation of guaranteed worst-case bounds and the global optimization are demonstrated by a prototype implementation. <P> <DT> <A HREF="../../pdffiles/07c_2.pdf" NAME="07c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7C.2 SiSMA: A Statistical Simulator for Mismatch Analysis of MOS ICs [p. 490] <DD><I> G. Biagetti, S. Orcioni, L. Signoracci, C. Turchetti, P. Crippa, M. Alessandrini </I><P> This paper presents a simulator for the statistical analysis of MOS integrated circuits affected by mismatch effect. The tool is based on a rigorous formulation of circuit equations including random current sources to take into account technological tolerances. The simulator requires a simulation time of several orders of magnitude lower than that required by Montecarlo analysis, while ensuring a good accuracy.<BR> Keywords:<BR> Device mismatch, stochastic simulation, MNA, MOS ICs, non-Montecarlo analysis. <P> <DT> <A HREF="../../pdffiles/07c_3.pdf" NAME="07c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7C.3 Efficient Solution Space Exploration Based on Segment Trees in Analog Placement with Symmetry Constraints [p. 497] <DD><I> Florin Balasa, Sarat C Maruvada, Karthik Krishnamoorthy </I><P> The traditional way of approaching device-level placement problems for analog layout is to explore a huge search space of absolute placement representations, where cells are allowed to illegally overlap during their moves [2, 7, 8]. This paper presents a novel analog placement technique operating on the set of binary tree representations of the layout [3], where the typical presence of an arbitrary number of symmetry groups of devices is directly taken into account during the exploration of the solution space. The efficiency of the novel approach is due to a data structure called segment tree, mainly used in computational geometry. <P> </DL> <HR> <H2>Session 7D: <A NAME="sess7d"> Physical Effects in Deep Sub Micron Technology </A></H2> <I>Moderators: Ron Duncan, Synopsys, Inc., Fremont, CA </I> <DL> <DT> <A HREF="../../pdffiles/07d_1.pdf" NAME="07d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7D.1 Post Global Routing RLC Crosstalk Budgeting [p. 504] <DD><I> Jinjun Xiong, Jun Chen, James Ma Lei He, </I><P> Existing layout optimization methods often assume a set of interconnects with given RLC crosstalk bounds in a routing region. RLC crosstalk bound partitioning is critical for effectively applying these methods at the full-chip level. In this paper, we develop an optimal crosstalk budgeting scheme based on linear programming (LP) formulation, and apply it to shield insertion and net ordering at the full-chip level. Experiment results show that compared to the best alter- native approach, the LP based method reduces the total routing area by up to 7.61% and also uses less runtime. To the best of our knowledge, this is the first in-depth work that studies the RLC crosstalk budgeting problem. <P> <DT> <A HREF="../../pdffiles/07d_2.pdf" NAME="07d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7D.2 A Technology-independent CAD Tool For ESD Protection Device Extraction - ESDExtractor [p. 510] <DD><I> R.Y. Zhan, H.G. Feng, Q. Wu, G. Chen, X.K. Guan, Albert Z. Wang </I><P> The challenges for developing an ESD (Electro-static Discharge) layout extractor originate from unconventional layout patterns of ESD protection devices, parasitic ESD device extraction and device count reduction. This paper reports a new technology-independent layout extractor, ESDExtractor, which is capable of extracting all types of ESD devices and answers the demands for ESD design verification. General methodology to extract both intentional and parasitic ESD devices, specific algorithms and implementation methods for efficiency-enhancement are presented, followed by a design example. <P> <DT> <A HREF="../../pdffiles/07d_3.pdf" NAME="07d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 7D.3 On Mask Layout Partitioning for Electron Projection Lithography [p. 514] <DD><I> Ruiqi Tian, Ronggang Yu, Xiaoping Tang, D.F. Wong </I><P> Electron projection lithography (EPL) is a leading candidate for next generation lithography (NGL) in VLSI production. The membrane mask used in EPL is divided into sub-fields by struts for structural support. A layout must be partitioned into these sub-fields on mask and then stitched back together by the EPL tool on wafer. To minimize possible stitching errors, partitioning of a mask layout should minimize cuts of layout features in the overlapping area between two adjacent sub-fields. This paper presents the first formulation of the mask layout partitioning problem for EPL as a graph problem. The graph formulation is optimally solved with a shortest path approach. Two other techniques are also presented to speed up computation. Experimental runs on data from a real industry design show excellent results. <P> </DL> <HR> <H2>Session 8A: <A NAME="sess8a"> Verification at the Switch, Gate, and RT Levels </A></H2> <I>Moderators: Ken McMillan, Cadence Berkeley Labs., Berkeley, CA<BR> Alan Hu, University of British Columbia, Vancouver, BC, Canada </I> <DL> <DT> <A HREF="../../pdffiles/08a_1.pdf" NAME="08a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8A.1 High Capacity and Automatic Functional Extraction Tool for Industrial VLSI Circuit Designs [p. 520] <DD><I> Sasha Novakovsky, Shy Shyman, Ziyad Hanna </I><P> In this paper we present an advanced functional extraction tool for automatic generation of high- level RTL from switch-level circuit netlist representation. The tool is called FEV-Extract and is part of a comprehensive Formal Equivalence Verification (FEV) system developed at Intel to verify modern microprocessor designs. FEV-Extract employs a powerful hierarchical analysis procedure, and advanced and generic algorithms for automatic recognition of logical primitives, to cope with variety of circuit design styles and their complexity. Logic equations are then extracted to generate a behavioral RTL model described in industrial standard HDL languages, to be used in the formal equivalence verification, logic simulation, synthesis and testability flows.<BR> Categories and Subject Descriptors<BR> D.3 [VERIFICATION, MODELING AND SIMULATION]: Formal verification techniques. Switch, logic and high-level simulation, design validation, HW/SW co-simulation, combinational and sequential equivalence checking. Model checking. Theorem proving.<BR> General Terms<BR> Algorithms for design verification.<BR> Keywords:<BR> Switch Level Analysis, Functional Abstraction, Formal Equivalence Verification (FEV), Design For Testability (DFT), Synthesis, Logic simulation, Binary Decision Diagrams (BDDs), Satisfiability procedures, Hardware Description Languages (HDL). <P> <DT> <A HREF="../../pdffiles/08a_2.pdf" NAME="08a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8A.2 Combinational Equivalence Checking through Function Transformation [p. 526] <DD><I> Hee Hwan Kwak, In-Ho Moon, James H. Kukula, Thomas R. Shiple </I><P> Circuits can be simplified for combinational equivalence checking by transforming internal functions, while preserving their ranges. In this paper, we investigate how to effectively apply the idea to improve equivalence checking. We propose new heuristics to identify groups of nets in a cut, and elaborate detailed aspects of the new equivalence checking method. With a given miter, we identify a group of nets in a cut and transform the function of each net into a more compact representation with less variables. These new compact parametric representations preserve the range of nets as well as of the cut. This transformation significantly reduces the size of intermediate BDDs and enables the verification to be conclusive for many designs which state-of-the-art equivalence checkers fail to verify. Iterative groupings and transformations are performed until no grouping is possible for a cut. Then we proceed to the next cut and continue until the compare point is reached. Our experimental results show the effectiveness of our strategy and new grouping heuristics on the new method.<BR> Categories and Subject Descriptors<BR> J.6 [Computer-aided engineering]: Verification<BR> General Terms<BR> Algorithms, Experimentation, Verification<BR> Keywords:<BR> Combinational verification, equivalence checking <P> <DT> <A HREF="../../pdffiles/08a_3.pdf" NAME="08a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8A.3 GSTE Through a Case Study [p. 534] <DD><I> Jin Yang, Amit Goel </I><P> Generalized Symbolic Trajectory Evaluation (GSTE) [17, 18, 19] is a very significant extension of STE that has the power to verify all w-regular properties but at the same time preserves the benefits of the original STE [16]. It also extends the symbolic quaternary model used by STE to support seamless model refinement for efficiency and accuracy trade-off in GSTE model checking. In this paper, we present a case study on FIFO verification to illustrate the strength of GSTE and demonstrate its methodology in specifying and verifying large scale designs. <P> </DL> <HR> <H2>Session 8B: <A NAME="sess8b"> New Trends in Logic Synthesis </A></H2> <I>Moderators: Hamid Savoj, Magma Design Automation, Cupertino, CA<BR> Diana Marculescu, Carnegie Mellon University, Pittsburgh, PA </I> <DL> <DT> <A HREF="../../pdffiles/08b_1.pdf" NAME="08b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8B.1 Whirlpool PLAs: A Regular Logic Structure and Their Synthesis [p. 543] <DD><I> Fan Mo, Robert K. Brayton </I><P> A regular circuit structure called a Whirlpool PLA (WPLA) is proposed. It is suitable for the implementation of finite state machines as well as combinational logic. A WPLA is logically a four-level Boolean NOR network. By arranging the four logic arrays in a cycle, a compact layout is achieved. Doppio-ESPRESSO, a four-level logic minimization algorithm is developed for WPLA synthesis. No technology mapping, placement or routing is necessary for the WPLA. Area and delay trade-off is absent, because these two goals are usually compatible in WPLA synthesis. <P> <DT> <A HREF="../../pdffiles/08b_2.pdf" NAME="08b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8B.2 Metrics for Structural Logic Synthesis [p. 551] <DD><I> Prabhakar Kudva, Andrew Sullivan, William Dougherty </I><P> Routability or wiring congestion in a VLSI chip is becoming increasingly important as chip complexity increases. Congestion has a significant impact on performance, yield and chip area. Although advances in placement algorithms have attempted to alleviate this problem, the inherent structure of the logic netlist has a significant impact on the routability irrespective of the placement algorithm used. Placement algorithms find optimal assignment of locations to the logic and do not have the ability to change the netlist structure. Significant decisions regarding the circuit structure are made early in synthesis such as during the technology independent logic optimization step. Optimizations in this step use literal count as a metric for optimization and do not adequately capture the intrinsic entanglement of the netlist. Two circuits with identical literal counts may have significantly different congestion characteristics post placement. In this paper, we motivate that a property of the network structure called adhesion can make a significant contribution to routing congestion. We then provide a metric to measure this property. We also show that adhesion as measured by this metric can be used in addition to literal counts to estimate and optimize post routing congestion early in the design flow. <P> <DT> <A HREF="../../pdffiles/08b_3.pdf" NAME="08b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8B.3 Simplification of Non-Deterministic Multi-Valued Networks [p. 557] <DD><I> Alan Mishchenko, Robert Brayton </I><P> We discuss the simplification of non-deterministic MV networks and their internal nodes using internal flexibilities. Given the network structure and its external specification, the flexibility at a node is derived as a non-deterministic MV relation. This flexibility is used to simplify the node representation and enhance the effect of Boolean resubstitution. We show that the flexibility derived is maximum. The proposed approach has been implemented and tested in MVSIS [16]. Experimental results show that it performs well on a variety of MV and binary benchmarks. <P> </DL> <HR> <H2>Session 8C: <A NAME="sess8c"> Memory Issues in High-Level Synthesis </A></H2> <I>Moderators: Allen C.-H. Wu, National Tsing Hua University, Hsinchu, Taiwan, ROC<BR> Loganath Ramachandran, Synopsys, Inc., Mountain View, CA </I> <DL> <DT> <A HREF="../../pdffiles/08c_1.pdf" NAME="08c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8C.1 High-Level Synthesis of Distributed Logic-Memory Architectures [p. 564] <DD><I> Chao Huang, Srivaths Ravi, Anand Raghunathan, Niraj K. Jha </I><P> With the increasing cost of global communication on-chip, high-performance designs for data- intensive applications require architectures that distribute hardware resources (computing logic, memories, interconnect, etc.) throughout a chip, while restricting computations and communications to geographic proximities. In this paper, we present a methodology for high- level synthesis (HLS) of distributed logic-memory architectures, i.e., architectures that have logic and memory distributed across several partitions in a chip. Conventional HLS tools are capable of extracting parallelism from a behavior for architectures that assume a monolithic controller/datapath communicating with a memory or memory hierarchy. This work provides techniques to extend the synthesis frontier to more general architectures that can extract both coarse- and fine-grained parallelism from data accesses and computations in a synergistic manner. Our methodology selects many possible ways of organizing data and computations, carefully examines the trade-offs (i.e., communication overheads, synchronization costs, area overheads) in choosing one solution over another, and utilizes conventional HLS techniques for intermediate steps.<BR> We have evaluated the proposed framework on several benchmarks by generating register- transfer level (RTL) implementations using an existing commercial HLS tool with and without our enhancements, and by subjecting the resulting RTL circuits to logic synthesis and layout. The results show that circuits designed as distributed logic-memory architectures using our framework achieve significant (up to 5.31X, average of 3.45X) performance improvements over well-optimized conventional designs with small area overheads (up to 19.3%, 15.1% on average). <P> <DT> <A HREF="../../pdffiles/08c_2.pdf" NAME="08c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8C.2 An Energy-conscious Algorithm for Memory Port Allocation [p. 572] <DD><I> Preeti Ranjan Panda, Lakshmikantam Chitturi </I><P> Multiport memories are extensively used in modern system designs because of the performance advantages they offer. The increased memory access throughput could lead to significantly faster schedules in behavioral synthesis. However, they also have an associated area and energy penalty. We describe a technique for mapping data accesses to multiport memories during behavioral synthesis that results in significantly better energy characteristics than an unoptimized multiport design. The technique consists of an initial colouring of the array access nodes in the data flow graph based on spatial locality, followed by attempts to consecutively access memory locations with the same colour on the same port. Our experiments on several applications indicate a significant reduction in address bus switching activity, leading to an overall energy reduction over an unoptimized design, while still maintaining a performance advantage over a single-port solution. <P> <DT> <A HREF="../../pdffiles/08c_3.pdf" NAME="08c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8C.3 Energy Efficient Address Assignment Through Minimized Memory Row Switching [p. 577] <DD><I> Sambuddhi Hettiaratchi, Peter Y.K. Cheung, Thomas J.W. Clarke </I><P> Data transfer intensive applications consume a significant amount of energy in memory access. The selection of a memory location from a memory array involves driving row and column select lines. A signal transition on a row select line often consumes significantly more energy than a transition on a column select line. In order to exploit this difference in energy consumption of row and column select lines, we propose a novel address assignment methodology that aims to minimize high energy row transitions by assigning spatially and temporally local data items to the same row. The problem of energy efficient address assignment has been formulated as a multi-way graph partitioning problem and solved with a heuristic. Our experiments demonstrate that our methodology achieves row transition counts very close to the optimum and that the methodology can, for some examples, reduce row transition count by 40- 70% over row major mapping. Moreover, we also demonstrate that our methodology is capable of handling access sequences with over 15 million accesses in moderate time.<BR> Categories and Subject Descriptors<BR> B.5.1 [Register-Transfer-Level Implementation]: DesignÖMemory design; B.5.2 [Register-Transfer-Level Implementation]: Design AidsÖAutomatic synthesis; Optimization<BR> General Terms<BR> Algorithms, Design, Experimentation, Performance<BR> Keywords<BR> address assignment, data layout, memory synthesis <P> </DL> <HR> <H2>Session 8D: <A NAME="sess8d"> Noise Effects on Circuit Operation </A></H2> <I>Moderators: Florentin Dartu, Intel Corp., Hillsboro, OR<BR> Anirudh Devgan, IBM Corp., Austin, TX </I> <DL> <DT> <A HREF="../../pdffiles/08d_1.pdf" NAME="08d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8D.1 Refining Switching Window by Time Slots for Crosstalk Noise Calculation [p. 583] <DD><I> Pinhong Chen, Yuji Kukimoto, Kurt Keutzer </I><P> For crosstalk noise calculation, computing switching windows of a net helps us identify noise sources accurately. Traditional approaches use a single continuous switching window for a net. Under this model, signal switching is assumed to happen any time within the window. Although conservative and sound, this model can result in too much pessimism since in reality the exact timing of signal switching is determined by a path delay up to the net, i.e. the underlying circuit structure does not always allow signal switching at arbitrary time within the continuous switching window. To address this inherent inaccuracy of the continuous switching window, we propose a refinement of the traditional approaches, where signal switching is characterized by a set of discontinuous switching windows instead of a single continuous window. Each continuous switching window is divided into multiple windows, called time slots, and the signal switching activity of each slot is analyzed separately to calculate the maximum noise with more accuracy. By controlling the size of a time slot we can trade off accuracy and runtime, which makes this approach highly scalable. We have confirmed by experiments on industrial circuits that up to 90% of the noise violations detected by the traditional approach can be unreal. <P> <DT> <A HREF="../../pdffiles/08d_2.pdf" NAME="08d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8D.2 Noise Propagation and Failure Criteria for VLSI Designs [p. 587] <DD><I> V. Zolotov, D. Blaauw, S. Sirichotiyakul, M. Becer, C. Oh, R. Panda, A. Grinshpon, R. Levy </I><P> Noise analysis has become a critical concern in advanced chip designs. Traditional methods suffer from two common issues. First, noise that is propagated through the driver of a net is combined with noise injected by capacitively coupled aggressor nets using linear summation. Since this ignores the non-linear behavior of the driver gate the noise that develops on a net can be significantly underestimated. We therefore propose a new linear model that accurately combines propagated and injected noise on a net and which maintains the efficiency of linear simulation. After the propagated and injected noise are correctly combined on a victim net, it is necessary to determine if the noise can result in a functional failure. This is the second issue that we discuss in this paper. Traditionally, noise failure criteria have been based on unity gain points of the DC or AC transfer curves. However, we will show that for digital designs, these approaches can result in a pessimistic analysis in some cases, while in other cases, they allow circuit operation that is extremely close to regions that are unstable and do not allow sufficient margin for error in the analysis. In this paper, we compare the effectiveness of the discussed noise failure criteria and also present a propagation based method, which is intended to overcome these drawbacks. The proposed methods were implemented in a noise analysis tool and we demonstrate results on industrial circuits. <P> <DT> <A HREF="../../pdffiles/08d_3.pdf" NAME="08d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 8D.3 Efficient Crosstalk Noise Modeling Using Aggressor and Tree Reductions [p. 595] <DD><I> Li Ding, David Blaauw, Pinaki Mazumder </I><P> This paper describes a fast method to estimate crosstalk noise in the presence of multiple aggressor nets for use in physical design automation tools. Since noise estimation is often part of the inner-loop of optimization algorithms, very efficient closed-form solutions are needed. Previous approaches have typically used simple lumped 3-4 node circuit templates. One aggressor net is modeled at a time assuming that the coupling capacitances to all quiet aggressor nets are grounded. They also model the load from interconnect branches as a lumped capacitor and use a dominant pole approximation to solve the template circuit. While these approximations allow for very fast analysis, they result in significant underestimation of the noise. In this paper, we propose a new and more comprehensive fast noise estimation model. We use a 6 node template circuit and propose a novel reduction technique for modeling quiet aggressor nets based on the concept of coupling point admittance. We also propose a reduction method to replace tree branches with effective capacitors which models the effect of resistive shielding. Finally, we propose a new double pole approach to solve the template circuit. We tested the proposed method on noise-prone interconnects from an industrial high performance processor. Our results show a worst-case error of 7.8% and an average error of 2.7%, while allowing for very fast analysis. <P> </DL> <HR> <H2>Session 9A: <A NAME="sess9a"> Low Level Aware Behavioral Synthesis </A></H2> <I>Moderators: Forrest D. Brewer, University of California, Santa Barbara, CA<BR> Barry Pangrle, Synopsys, Inc., Mountain View, CA </I> <DL> <DT> <A HREF="../../pdffiles/09a_1.pdf" NAME="09a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9A.1 Bit-level Scheduling of Heterogeneous Behavioural Specifications [p. 602] <DD><I> M.C. Molina, J.M. Mendâas, R. Hermida </I><P> This paper presents a heuristic scheduling algorithm for heterogeneous specifications, those formed by operations of different types and widths. The algorithm extracts the common operative kernel of the operations, and binds afterwards operations to cycles with the aim of distributing uniformly the number of bits calculated per cycle. In consequence, operations may be fragmented and executed during a set of non-necessarily consecutive cycles, and over a set of several linked simple hardware resources. The proposed algorithm, in combination with allocation algorithms able to guarantee bit-level reuse of hardware resources, obtains considerably smaller datapaths than the ones proposed by conventional synthesis algorithms. In the datapaths produced the type, number, and width of the hardware resources are independent of the type, number, and width of the specification operations and variables. <P> <DT> <A HREF="../../pdffiles/09a_2.pdf" NAME="09a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9A.2 Coupling-Aware High-level Interconnect Synthesis for Low Power [p. 609] <DD><I> Chun-Gi Lyuh, Taewhan Kim, Ki-Wook Kim </I><P> Ultra deep submicron (UDSM) technology and system-on-chip (SoC) have resulted in a considerable portion of power dissipated on buses, in which the major sources of the power dissipation are (1) the transition activities on the signal lines and (2) the coupling capacitances of the lines. However, there has been no easy way of optimizing (1) and (2) simultaneously at an early stage of the synthesis process. In this paper, we propose a new (on-chip) bus synthesis algorithm to minimize the total sum of (1) and (2) in the microarchitecture synthesis. Specifically, unlike the previous approaches in which (1) and (2) are minimized sequentially without any interaction between them, or only one of them is minimized, we, given a scheduled dataflow graph to be synthesized, minimize (1) and (2) simultaneously by formulating and solving the two important issues in an integrated fashion: binding data transfers to buses and determining a (physical) order of signal lines in each bus, both of which are the most critical factors that affect the results of (1) and (2). Experimental results on a number of benchmark problems show that the proposed integrated low-power bus synthesis algorithm reduces power consumption by 24.8%, 40.3% and 18.1% on average over those in [12] (for minimizing (1) only), [1] (for (2) only) and [12, 1] (for (1) and then (2)), respectively. <P> <DT> <A HREF="../../pdffiles/09a_3.pdf" NAME="09a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9A.3 Layout-Driven Resource Sharing in High-Level Synthesis [p. 614] <DD><I> Junhyung Um, Jaehoon Kim, Taewhan Kim </I><P> In deep submicron (DSM) technology, the interconnects are equally as or more important than the logic gates. In particular, to achieve timing closure in DSM technology, it is very necessary and critical to consider the interconnect delay at an early stage of the synthesis process. It has been known that resource sharing in high-level synthesis is one of the major synthesis tasks which greatly affect the final synthesis/layout results. In this paper, we propose a new layout- driven resource sharing approach to overcome some of the limitations of the previous works in which the effects of layout on the synthesis have never been taken into account or considered in local and limited ways, or whose computation time is excessively large. The proposed approach consists of two steps: (Step 1) We relax the integrated resource sharing and placement into an efficient linear programming (LP) formulation based on the concept of discretizing placement space; (Step 2) We derive a feasible solution from the solution obtained in Step 1. Then, we employ an iterative mechanism based on the two steps to tightly integrate resource sharing and placement tasks so that the slack time violation due to interconnect delay (determined by placement) as well as logic delay (determined by resource sharing) should be minimized. From experiments using a set of benchmark designs, it is shown that the approach is effective, and efficient, completely removing the slack time violation produced by conventional methods. <P> </DL> <HR> <H2>Session 9B: <A NAME="sess9b"> Advances in Timing Analysis Accuracy </A></H2> <I>Moderators: Tim Burks, Magma Design Automation, Cupertino, CA<BR> David Blaauw, University of Michigan, Ann Arbor, MI </I> <DL> <DT> <A HREF="../../pdffiles/09b_1.pdf" NAME="09b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9B.1 A Delay Metric for RC Circuits based on the Weibull Distribution [p. 620] <DD><I> Frank Liu, Chandramouli Kashyap, Charles J. Alpert </I><P> Physical design optimizations such as placement, interconnect synthesis, floorplanning, and routing require fast and accurate analysis of RC networks. Because of its simple close form and fast evaluation, the Elmore delay metric has been widely adopted. The recently proposed delay metrics PRIMO and H-gamma match the first three circuit moments to the probability density function of a Gamma statistical distribution. Although these methods demonstrate impressive accuracy compared to other delay metrics, their implementations tend to be challenging. As an alternative to matching to the Gamma distribution, we propose to match the first two circuit moments to a Weibull distribution. The result is a new delay metric called Weibull based Delay (WED). The primary advantages of WED over PRIMO and Hgamma are its efficiency and ease of implementation. Experiments show that WED is robust and has satisfactory accuracies at both near-and far-end nodes. <P> <DT> <A HREF="../../pdffiles/09b_2.pdf" NAME="09b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9B.2 WTA - Waveform-Based Timing Analysis for Deep Submicron Circuits [p. 625] <DD><I> Larry McMurchie, Carl Sechen </I><P> Existing static timing analyzers make several assumptions about circuits, implicitly trading off accuracy for speed. In this paper we examine the validity of these assumptions, notably the slope approximation to waveforms, single-input transitions, and the choice of a propagating signal based on a single voltage-time point. We provide data on static CMOS gates that show delays obtained in this way can be optimistic by more than 30%. We propose a new approach, Waveform-based Timing Analysis that employs a state-of-the-art circuit simulator as the underlying delay modeler. We show that such an approach can achieve more accurate delays than slope-based timing analyzers at a computation cost that still allows iterations between design modification and delay analysis. <P> <DT> <A HREF="../../pdffiles/09b_3.pdf" NAME="09b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9B.3 General Framework for Removal of Clock Network Pessimism [p. 632] <DD><I> Jindrich Zejda, Paul Frain </I><P> The paper presents a simple yet powerful general theoretical framework and efficient implementation for removal of clock network timing pessimism. We address pessimism in static timing analysis (STA) tools caused by considering delay variation along common segments of clock paths. The STA tools compute setup (hold) timing slack based on conservative combinations of late (early) launching and early (late) capturing arrival times. To avoid exponential-time path-based analysis the STA tools use both early and late arrival times on gates common to both launching and capturing paths. It is impossible in real circuit and is observed as the clock network pessimism in STA. Our approach supports any kind of delay variation though the typical causes of the pessimism are process, voltage, and temperature on-chip variation, and reconvergence in clock network. We propose a new theoretical framework that allows to apply known graph algorithms instead of time consuming forward and backward multi-pass tracing algorithms and heuristics that are limited to some network topologies [4]. The new graph-based framework supports clock networks of virtually any size and type, e.g., tree, mesh, hybrid, clock gating, chains of multipliers and dividers, loops in such chains, etc. The implementation based on the proposed framework has proven its strength in a commercial sign-off static timing analyzer and thus is helping hundreds of designers to achieve faster clock speeds of their chips.<BR> Keywords:<BR> Clock network reconvergence, static timing analysis, process, voltage and temperature delay variation, deep sub-micron. <P> </DL> <HR> <H2>Session 9C: <A NAME="sess9c"> Customization of Embedded System Architectures </A></H2> <I>Moderators: Petru Eles, Linkûping University, Linkûping, Sweden<BR> Hiroyuki Tomiyama, ISIT, Fukuoka, Japan </I> <DL> <DT> <A HREF="../../pdffiles/09c_1.pdf" NAME="09c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9C.1 Synthesis of Custom Processors based on Extensible Platforms [p. 641] <DD><I> Fei Sun, Srivaths Ravi, Anand Raghunathan, Niraj K. Jha </I><P> Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions. <BR> In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed ow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Further, we employ a two-stage process, wherein a limited number of promising instruction candidates are first selected, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint.<BR> We have evaluated the proposed techniques using a state-of-the-art extensible processor platform, in the context of a commercial design ow. Experiments with several benchmark programs indicate that custom processors synthesized using automatic custom instruction selection can result in large improvements in performance (upto 5.4X, average of 3.4X), energy (upto 4.5X, average of 3.2X), and energy-delay product (upto 24.2X, average of 12.6X), while speeding up the design process significantly. <P> <DT> <A HREF="../../pdffiles/09c_2.pdf" NAME="09c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9C.2 Efficient Instruction Encoding for Automatic Instruction Set Design of Configurable ASIPs [p. 649] <DD><I> Jong-eun Lee, Kiyoung Choi, Nikil Dutt </I><P> Application-specific instructions can significantly improve the performance, energy, and code size of configurable processors. A common approach used in the design of such instructions is to convert application-specific operation patterns into new complex instructions. However, processors with a fixed instruction bitwidth cannot accommodate all the potentially interesting operation patterns, due to the limited code space afforded by the fixed instruction bitwidth. We present a novel instruction set synthesis technique that employs an efficient instruction encoding method to achieve maximal performance improvement. We build a library of complex instructions with various encoding alternatives and select the best set of complex instructions while satisfying the instruction bitwidth constraint. We formulate the problem using integer linear programming and also present an effective heuristic algorithm. Experimental results using our technique generate instruction sets that show improvements of up to 38% over the native instruction set for several realistic benchmark applications running on a typical embedded RISC processor. <P> <DT> <A HREF="../../pdffiles/09c_3.pdf" NAME="09c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9C.3 Synthesis of Customized Loop Caches for Core-Based Embedded Systems [p. 655] <DD><I> Susan Cotterell, Frank Vahid </I><P> Embedded system programs tend to spend much time in small loops. Introducing a very small loop cache into the instruction memory hierarchy has thus been shown to substantially reduce instruction fetch energy. However, loop caches come in many sizes and variations š using the configuration best on the average may actually result in worsened energy for a specific program. We therefore introduce a loop cache exploration tool that analyzes a particular program's profile, rapidly explores the possible configurations, and generates the configuration with the greatest power savings. We introduce a simulation-based approach and show the good energy savings that a customized loop cache yields. We also introduce a fast estimation-based approach that obtains nearly the same results in seconds rather than tens of minutes or hours.<BR> Keywords:<BR> Low power, low energy, tuning, loop cache, embedded systems, instruction fetching, customized architectures, memory hierarchy, estimation, synthesis, architecture tuning. <P> <DT> <A HREF="../../pdffiles/09c_4.pdf" NAME="09c_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9C.4 A Hierarchical Modeling Framework for On-Chip Communication Architectures [p. 663] <DD><I> Xinping Zhu, Sharad Malik </I><P> The communication sub-system of complex IC systems is increasingly critical for achieving system performance. Given this, it is important that the on-chip communication architecture should be included in any quantitative evaluation of system design during design space exploration. While there are several mature methodologies for the modeling and evaluation of architectures of processing elements, there is relatively little work done in modeling of an extensive range of on-chip communication architectures, and integrating this into a single modeling and simulation environment combining processing element and on-chip communication architectures. This paper describes a modeling framework with accompanying simulation tools that attempts to fill this gap. Based on an analysis of a wide range of on-chip communication architectures, we describe how a specific hierarchical class library can be used to develop new on-chip communication architectures, or variants of existing ones with relatively little incremental effort. We demonstrate this through three case studies including two commercial on-chip bus systems and an on-chip packet switching network. Here we show that through careful analysis and construction it is possible for the modeling environment to support the common features of these architectures as part of the library and permit instantiation of the individual architectures as variants of the library design. As part of this methodology we also show how different levels of abstraction of the model can be supported and viewed as different variants that can be used in an accuracy versus simulation time trade-off. <P> </DL> <HR> <H2>Session 9D: <A NAME="sess9d"> Advances in Combinational Synthesis </A></H2> <I>Moderators: Olivier Coudert, Monterey Design Systems, Inc., Sunnyvale, CA<BR> Prabhakar Kudva, IBM Corp., Yorktown Heights, NY </I> <DL> <DT> <A HREF="../../pdffiles/09d_1.pdf" NAME="09d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9D.1 A New Enhanced SPFD Rewiring Algorithm [p. 672] <DD><I> Jason Cong, Joey Y. Lin, Wangning Long </I><P> This paper presents an in-depth study of the theory and algorithms for the SPFD-based (Set of Pairs of Functions to be Distinguished) rewiring, and explores the flexibility in the SPFD computation. Our contributions are in the following two areas: (1) We present a theorem and a related algorithm for more precise characterization of feasible SPFD-based rewiring. Extensive experimental results show that for LUT-based FPGAs, the rewiring ability of our new algorithm is 70% higher than SPFD-based local rewiring algorithms (SPFD-LR) [19][21] and 18% higher than the recently developed SPFD-based global rewiring algorithm (SPFD-GR)[20]. (2) In order to achieve more rewiring ability on certain selected wires used in various optimizations, we study the impact of using different atomic SPFD pair assignment methods during the SPFD- based rewiring. We develop several heuristic atomic SPFD pair assignment methods for area or delay minimization and show that they lead to 10% more selected rewiring ability than the random (or arbitrary) assignment methods. When combining (1) and (2) together, we can achieve 38.1% higher general rewiring ability. <P> <DT> <A HREF="../../pdffiles/09d_2.pdf" NAME="09d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9D.2 Topologically Constrained Logic Synthesis [p. 679] <DD><I> Subarnarekha Sinha, Alan Mishchenko, Robert K Brayton </I><P> SPFDs, a mechanism for expressing flexibilily during logic synthesis, were first introduced for FPGA synthesis. They were then extended to general, combinational Boolean networks and later the concept of sequential SPFDs was introduced. In this paper, we explore the idea of using SPFDs for functional decomposition. A new type of functional decomposition called topologically constrained decomposition is introduced. An algorithm is provided for solving this problem using SPFDs. Preliminary experimental results are encouraging and indicate the feasibility of the approach. A scheme is also presented for generating instances of the topologically constrained decomposition problem. <P> <DT> <A HREF="../../pdffiles/09d_3.pdf" NAME="09d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9D.3 Resynthesis of Multi-Level Circuits Under Tight Constraints Using Symbolic Optimization [p. 687] <DD><I> Victor N Kravets, Karem A Sakallah </I><P> We apply recently introduced constructive multi-level synthesis in the resynthesis loop targeting convergence of industrial designs. The incremental ability of the resynthesis approach allows more predictable circuit implementations while allowing their aggressive optimization. The approach is based on a very general symbolic decomposition template for logic synthesis that uses information-theoretical properties of a function to infer its decomposition patterns (rather than more conventional measures such as literal counts). Using this template the decomposition is done in a Boolean domain unrestricted by the representation of a function, enabling superior implementation choices driven by additional technological constraints. The symbolic optimization is applied in resynthesis of industrial circuits which have tight timing constraints yielding their much improved timing properties. <P> <DT> <A HREF="../../pdffiles/09d_4.pdf" NAME="09d_4"><IMG SRC="../../../../../icons/pdficon.gif"></A> 9D.4 Folding of Logic Functions and Its Application to Look Up Table Compaction [p. 694] <DD><I> Shinji Kimura, Takashi Horiyama, Masaki Nakanishi, Hirotsugu Kajihara </I><P> The paper describes the folding method of logic functions to reduce the size of memories for keeping the functions. The folding is based on the relation of fractions of logic functions. We show that the fractions of the full adder function have the bit-wise NOT relation and the bit-wise OR relation, and that the memory size becomes half (8-bit). We propose a new 3-1 LUT with the folding mechanisms which can implement a full adder with one LUT. A fast carry propagation line is introduced for a multi-bit addition. The folding and fast carry propagation mechanisms are shown to be useful to implement other multi-bit operations and general 4 input functions without extra hardware resources. The paper shows the reduction of the area consumption when using our LUTs compared to the case using 4-1 LUTs on several benchmark circuits. also useful to implement general 4 input functions. We show that more than half of general 4 input functions can be implemented with two 3-1 LUTs. We compare the area consumption on the case using our 3-1 LUTs and that using 4-1 LUTs. By using our 3-1 LUTs, the area can be reduced up to 56% on several benchmark circuits. <P> </DL> <HR> <H2>Session 10A: <A NAME="sess10a"> System-Level Performance and Power Modeling and Optimization </A></H2> <I>Moderators: Frank Vahid - University of California, Riverside, CA<BR> Xiaobo (Sharon) Hu - University of Notre Dame, Notre Dame, IN </I> <DL> <DT> <A HREF="../../pdffiles/10a_1.pdf" NAME="10a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10A.1 Schedulability Analysis of Multiprocessor Real-Time Applications with Stochastic Task Execution Times [p. 699] <DD><I> Sorin Manolache, Petru Eles, Zebo Peng </I><P> This paper presents an approach to the analysis of task sets implemented on multiprocessor systems, when the task execution times are specified as generalized probability distributions. Because of the extreme complexity of the problem, an exact solution is practically impossible to be obtained even for toy examples. Therefore, our methodology is based on approximating the generalized probability distributions of execution times by Coxian distributions of exponentials. Thus, we transform the generalized semi-Markov process, corresponding to the initial problem, into a continuous Markov chain (CTMC) which, however, is extremely large and, hence, most often is impossible to be stored in memory. We have elaborated a solution which allows to generate and analyze the CTMC in an efficient way, such that only a small part has to be stored at a given time. Several experiments investigate the impact of various parameters on complexity, in terms of time and memory, as well as the trade-offs regarding the accuracy of generated results. <P> <DT> <A HREF="../../pdffiles/10a_2.pdf" NAME="10a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10A.2 Battery-Aware Power Management Based on Markovian Decision Processes [p. 707] <DD><I> Peng Rong, Massoud Pedram </I><P> This paper is concerned with the problem of maximizing capacity utilization of the battery power source in a portable electronic system under latency and loss rate constraints. First, a detailed stochastic model of a power-managed, battery-powered electronic system is presented. The model, which is based on the theories of continuous-time Markovian decision processes and stochastic networks, captures two important characteristics of today's rechargeable battery cells, i.e., the current rate-capacity characteristic and the relaxation-induced recovery. Next, the battery-aware dynamic power management problem is formulated as a policy optimization problem and solved exactly by using a linear programming approach. Experimental results show that the proposed method outperforms existing heuristic methods for battery management by as much as 17% in terms of the average energy delivered per unit weight of battery cells. <P> <DT> <A HREF="../../pdffiles/10a_3.pdf" NAME="10a_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10A.3 Leakage Power Modeling and Reduction with Data Retention [p. 714] <DD><I> Weiping Liao, Joseph M. Basile, Lei He </I><P> In this paper, we study leakage power reduction using power gating in the forms of the Virtual power/ground Rails Clamp (VRC) and Multi-threshold CMOS (MTCMOS) techniques. We apply power gating to two circuit types: memory-based units and datapath components. Using a microarchitecturelevel power simulator, as well as power and timing models derived from detailed circuit designs, we further study leakage power modeling and reduction at the system level for modern high-performance VLIW processors. We show that the leakage power can be over 40% of the total power for such processors. Moreover, we propose time-out scheduling of VRC to reduce power up to 85.65% for L2 cache. This power savings results in close to 1/3 total power dissipation for the VLIW processors we study. <P> </DL> <HR> <H2>Session 10B: <A NAME="sess10b"> Advances in Dynamic Voltage Scheduling </A></H2> <I>Moderators: Rajesh K. Gupta, University of California, Irvine, CA<BR> Miodrag Potkonjak, University of California, Los Angeles, CA </I> <DL> <DT> <A HREF="../../pdffiles/10b_1.pdf" NAME="10b_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10B.1 Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads [p. 721] <DD><I> Steven M. Martin, Krisztian Flautner, Trevor Mudge, David Blaauw </I><P> Dynamic voltage scaling (DVS) reduces the power consumption of processors when peak performance is unnecessary. However, the achievable power savings by DVS alone is becoming limited as leakage power increases. In this paper, we show how the simultaneous use of adaptive body biasing (ABB) and DVS can be used to reduce power in high-performance processors. Analytical models of the leakage current, dynamic power, and frequency as functions of supply voltage and body bias are derived and verified with SPICE simulation. We then show how to determine the correct trade-off between supply voltage and body bias for a given clock frequency and duration of operation. The usefulness of our approach is evaluated on real workloads obtained using real-time monitoring of processor utilization for four applications. The results demonstrate that application of simultaneous DVS and ABB results in an average energy reduction of 48% over DVS alone. <P> <DT> <A HREF="../../pdffiles/10b_2.pdf" NAME="10b_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10B.2 A Realistic Variable Voltage Scheduling Model for Real-Time Applications [p. 726] <DD><I> Bren Mochocki, Xiaobo Sharon Hu, Gang Quan </I><P> Voltage scheduling is indispensable for exploiting the benefit of variable voltage processors. Though extensive research has been done in this area, current processor limitations such as transition overhead and voltage level discretization are often considered insignificant and are typically ignored. We show that for hard, real-time applications, disregarding such details can lead to sub-optimal or even invalid results. We propose two algorithms that guarantee valid solutions. The first is a greedy yet simple approach, while the second is more complex but significantly reduces energy consumption under certain conditions. Through experimental results on both real and randomly generated systems, we show the effectiveness of both algorithms, and explore what conditions make it beneficial to use the complex algorithm over the basic one. <P> <DT> <A HREF="../../pdffiles/10b_3.pdf" NAME="10b_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10B.3 Frame-Based Dynamic Voltage and Frequency Scaling for a MPEG Decoder [p. 732] <DD><I> Kihwan Choi, Karthik Dantu, Wei-Chung Cheng, Massoud Pedram </I><P> This paper describes a dynamic voltage and frequency scaling (DVFS) technique for MPEG decoding to reduce the energy consumption while maintaining a quality of service (QoS) constraint. The computational workload for an incoming frame is predicted using a frame-based history so that the processor voltage and frequency can be scaled to provide the exact amount of computing power needed to decode the frame. More precisely, the required decoding time for each frame is separated into two parts: a frame-dependent (FD) part and a frame-independent (FI) part. The FD part varies greatly according to the type of the incoming frame whereas the FI part remains constant regardless of the frame type. In the DVFS scheme presented in this paper the FI part is used to compensate for the prediction error that may occur during the FD part such that a significant amount of energy can be saved while meeting the frame rate requirement. The proposed DVFS algorithm has been implemented on a StrongArm-1110 based evaluation board. Measurement results demonstrate a higher than 50% CPU energy saving as a result of DVFS. <P> </DL> <HR> <H2>Session 10C: <A NAME="sess10c"> Techniques in Placement </A></H2> <I>Moderators: Lukas P.P.P. van Ginneken, Magma Design Automation, Cupertino, CA<BR> Kia Bazargan, University of Minnesota, Minneapolis, MN </I> <DL> <DT> <A HREF="../../pdffiles/10c_1.pdf" NAME="10c_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10C.1 Congestion Minimization During Placement Without Estimation [p. 739] <DD><I> Bo Hu, Malgorzata Marek-Sadowska </I><P> This paper presents a new congestion minimization technique for standard cell global placement. The most distinct feature of this approach is that it does not follow the traditional "estimate-then- eliminate" strategy. Instead, it avoids the excessive usage of routing resources by the "local" nets so that more routing resources are available for the uncertain "global" nets. The experimental results show that our new technique, SPARSE, achieves better routability than the traditional total wire length (Bounding Box) guided placers, which had been shown to deliver the best routability results among the placers optimizing different cost functions [2]. Another feature of SPARSE is the capability of allocating white space implicitly. SPARSE exploits the well known empirical Rent's rule and is able to improve the routability even more in the presence of white space. Compared to the most recent academic routability-driven placer Dragon[8], SPARSE is able to produce solutions with equal or better routability. <P> <DT> <A HREF="../../pdffiles/10c_2.pdf" NAME="10c_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10C.2 Free Space Management for Cut-Based Placement [p. 746] <DD><I> Charles J. Alpert, Gi-Joon Nam, Paul G. Villarrubia </I><P> IP blocks and large macro cells are increasingly prevalent in physical design, actually causing an increase in the available free space for the dust logic. We observe that top-down placement based on recursive bisection with multilevel partitioning performs poorly on these porous designs. However, analytic solvers have the ability to find the natural distribution of cells in the layout. Consequently, we propose an enhancement to cut-based placement called Analytic Constraint Generation (ACG). ACG utilizes an analytic engine to set constraints for the multi-level partitioner. We show that for real industry designs, ACG significantly improves the performance of cut-based placement, as implemented within a state-of-the-art industrial placer. <P> <DT> <A HREF="../../pdffiles/10c_3.pdf" NAME="10c_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10C.3 Incremental Placement for Layout-Driven Optimizations on FPGAs [p. 752] <DD><I> Deshanand P. Singh, Stephen D. Brown </I><P> This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocks that have numerous architectural constraints. Experiments indicate that our technique produces results of extremely high quality. <P> </DL> <HR> <H2>Session 10D: <A NAME="sess10d"> Model Order Reduction </A></H2> <I>Moderators: Mustafa Celik, Monterey Design Systems, Sunnyvale, CA<BR> Sharad Kapur, Agere Systems, Inc., Murray Hill, NJ </I> <DL> <DT> <A HREF="../../pdffiles/10d_1.pdf" NAME="10d_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10D.1 Robust and Passive Model Order Reduction for Circuits Containing Susceptance Elements [p. 761] <DD><I> Hui Zheng, Lawrence T. Pileggi </I><P> Numerous approaches have been proposed to address the overwhelming modeling problems that result from the emergence of magnetic coupling as a dominant performance factor for ICs and packaging. Firstly, model order reduction (MOR) methods have been extended to robustly capture very high frequency behaviors for large RLC systems via methods such as PRIMA[8] with guaranteed passivity. In addition, new models of the magnetic couplings in terms of susceptance (inverse of inductance) have shown great promise for robust sparsification of otherwise intractable inductance coupling-matrix problems[3-5]. However, model order reduction via PRIMA for circuits that include susceptance elements does not guarantee passivity. Moreover, susceptance elements are incompatible with the path tracing algorithms that provide the fundamental runtime efficiency of RICE [10]. In this paper a novel MOR algorithm, SMOR, is proposed as an extension of ENOR [11] which exploits the matrix properties of susceptance- based circuits for runtime efficiency, and provides for a numerically stable, provably passive MOR using a new orthonormalization strategy. <P> <DT> <A HREF="../../pdffiles/10d_2.pdf" NAME="10d_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10D.2 Efficient Model Order Reduction via Multi-Node Moment Matching [p. 767] <DD><I> Yehea I. Ismail </I><P> The new concept of Multi-node Moment Matching (MMM) is introduced in this paper. The MMM technique simultaneously matches the moments at several nodes of a circuit using explicit moment matching around s=0. As compared to the well-known Single-point Moment Matching (SMM) techniques (such as AWE), MMM has several advantages. First, the number of moments required by MMM is significantly lower than SMM for a reduced order model of the same accuracy, which directly translates into computational efficiency. This higher computational efficiency of MMM as compared to SMM increases with the number of inputs to the circuit. Second, MMM has much better numerical stability as compared to SMM. This characteristic allows MMM to calculate an arbitrarily high order approximation of a linear system, achieving the required accuracy for systems with complex responses. Finally, MMM is highly suitable for parallel processing techniques especially for higher order approximations while SMM has to calculate the moments sequentially and cannot be adapted to parallel processing techniques. <P> <DT> <A HREF="../../pdffiles/10d_3.pdf" NAME="10d_3"><IMG SRC="../../../../../icons/pdficon.gif"></A> 10D.3 Optimization Based Passive Constrained Fitting [p. 775] <DD><I> Carlos P. Coelho, Joel R. Phillips, L. Miguel Silveira </I><P> Verification of contemporary integrated circuits requires accurate modeling of high-frequency effects in all passive component subsystems. Often, descriptions of those subsystems are only available in the frequency-domain. In this paper, we propose a simple, scalar, constrained- passive rational approximation scheme that incorporates a grid-based test for strict positive realness. In contrast to similar recent work based on convex optimization and the positive real lemma, the methodology is potentially more efficient, since it does not introduce a quadratic number of auxiliary variables, is potentially more accurate, since pole locations can be re- adjusted during the optimization, but possibly less reliable, since it relies on the solution of optimization problems that are not convex. Therefore, because of the use of local constrained non-convex optimization, the generation of feasible initial guesses is also considered. <P> </DL> <HR> <H2>Session 11A: <A NAME="sess11a"> Embedded Tutorial: SAT & ATPG: Boolean Engines for Formal Hardware Verification </A></H2> <I>Moderators: Joao Marques-Silva, Technical University Lisbon, Lisboa, Portugal<BR> Karem A. Sakallah, University of Michigan, Ann Arbor, MI </I> <DL> <DT> <A HREF="../../pdffiles/11a_1.pdf" NAME="11a_1"><IMG SRC="../../../../../icons/pdficon.gif"></A> 11A.1 SAT and ATPG: Boolean Engines for Formal Hardware Verification [p. 782] <DD><I> Armin Biere, Wolfgang Kunz </I><P> In this survey, we outline basic SAT- and ATPG-procedures as well as their applications in formal hardware verification. We attempt to give the reader a trace trough literature and provide a basic orientation concerning the problem formulations and known approaches in this active field of research. <P> <DT> <A HREF="../../pdffiles/11a_2.pdf" NAME="11a_2"><IMG SRC="../../../../../icons/pdficon.gif"></A> 11A.2 ATPG-based Logic Synthesis: An Overview [p. 786] <DD><I> Chih-Wei Jim Chang, Malgorzata Marek-Sadowska </I><P> The ultimate goal of logic synthesis is to explore implementation flexibility toward meeting design targets, such as area, power, and delay. Traditionally, such flexibility is expressed using "don't cares" and we seek the best implementation that does not violate them. However, the calculation and storing of don't care information is CPU and memory-intensive. In this paper, we give an overview of logic synthesis approaches based on techniques developed for Automatic Test Pattern Generation (ATPG). Instead of calculating and storing don‰t cares explicitly, ATPG-based logic synthesis techniques calculate the flexibility implicitly. Low CPU and memory usage make those techniques applicable for practical industrial circuits. Also, the basic ATPG-based logic level operations create predictable, small layout perturbations, making an ideal foundation for efficient physical synthesis. Theoretical results show that an efficient, yet simple add-a-wire-and-remove-a-wire operation covers all possible complex logic transformations.<P> </DL> <HR> <H2>Session 11B: <A NAME="sess11b"> Embedded Tutorial: The A to Z of SoCs </A></H2> <I>Moderators: Grant E. Martin, Cadence Berkeley Labs., Berkeley, CA<BR> Kees A. Vissers, Chameleon Systems, Inc., San Jose, CA </I> <DL> <DT> <A HREF="../../pdffiles/11b.pdf" NAME="11b"><IMG SRC="../../../../../icons/pdficon.gif"></A> 11B.1 The A to Z of SoCs [p. 791] <DD><I> Reinaldo A. Bergamaschi, John Cohn </I><P> The exploding complexity of new chips and the ever decreasing time-to-market window are forcing fundamental changes in the way systems are designed. The advent of Systems-on-Chip (SoC) based on pre-designed intellectual-property (IP) cores has become an absolute necessity for embedded systems companies to remain competitive. Designing an SoC, however, is extremely complex, as it encompasses a range of difficult problems in hardware and software design. This paper explains a wide range of SoC issues including market drivers and trends, technology and integration aspects, early architecture definition, methodology, hardware and software design and verification techniques. <P> </DL>