We present a comprehensive methodology for the electrodynamic modeling of substrate noise coupling. A new and efficient method is introduced for the calculation of the Green's function that can accommodate arbitrary substrate doping profiles and thus facilitate substrate noise analysis using boundary element methods. In addition to a discussion of the application of the method and its validation in the context of substrate transfer resistance extraction, preliminary results from its application to frequency-dependent substrate noise modeling are presented also.
In mixed-signal designs, substrate noise originating from the digital part can seriously influence the functionality of the analog part. As such, accurately modeling the properties of the substrate as a noise-propagator is becoming ever more important. A model can be obtained through the Finite Element Method (FEM) or the Boundary Element Method (BEM). The FEM performs a full 3D discretization of the substrate, which makes this method very accurate and flexible but also slow. The BEM only discretizes the contact areas on the boundary of the substrate, which makes it less flexible, but significantly faster. A combination between BEM and FEM can be efficient when we need flexibility and speed at the same time. This paper briefly describes the BEM and the FEM and their combination, but mainly concentrates on the theoretical validation of the combined method and the experimental verification through implementation in the SPACE layout to circuit extractor and comparison with commercial BEM and FEM tools.
Full-wave analysis, based on rigorous solution of the differential or integral form of Maxwell's equations, is too slow for all but the smallest designs. Traditional on-chip extraction engines are, therefore, being pushed to extract inductance and provide accurate high-frequency interconnect modelling while maintaining computational efficiency and capacity. This paper describes further accuracy-improving enhancements to the commercial full-chip RLCK extraction engine, Assura RLCX[1], based on the return-limited inductance formulation. Specifically, we incorporate substrate losses due to eddy currents and power-ground losses while, based on design-driven assumptions, avoiding explicit extraction of the power-ground and substrate. Results are validated on small testcases where comparison with full-wave solution is practical.
Approaches to achieve low-power and high-speed VLSI's are described with the emphasis on techniques across multiple technology and design levels. To suppress the leakage current in a standby mode, Boosted Gate MOS (BGMOS) is effective, which is based on cooperation between technology level and circuit level. To reduce the power in an active mode, VDD-hopping and VTH-hopping are promising, which are cooperative approaches between circuit and software. Power consumed in interconnect system can be reduced by a cooperative approach between application and layout as in bus shuffling. Other low-power design approaches are also discussed.
It is essential to control VDD and VTH for low-power, high-speed CMOS design. In this paper, it is shown that these two parameters can be controlled by designers as objectives of design optimization to find better trade-offs between power and speed. Quantitative analysis of trade- offs between power and speed is presented. Some of the popular circuit techniques and design examples to control VDD and VTH are introduced. A simple theory to compute optimum multiple VDD's and VTH's is described. Scaling scenarios of variable and/or multiple VDD's and VTH's is discussed to show future technology directions.
This paper presents methods for efficient power minimization at circuit and micro-architectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimize energy consumption subject to a delay constraint. The true power minimization is achieved when the energy reduction potentials of all tuning variables are balanced. We derive the sensitivity of energy to delay for each of the tuning variables connecting its energy saving potential to the physical properties of the circuit. This helps to develop understanding of optimization performance and identify the most efficient techniques for energy reduction. The optimizations are applied to some examples that span typical circuit topologies including inverter chains, SRAM decoders, and adders. At a delay of 20% larger than the minimum, energy savings of 40% to 70% are possible, indicating that achieving peak performance is expensive in terms of energy. Energy savings of about 50% can be achieved without delay penalty with the balancing of sizes, supplies, and thresholds.
We propose in this paper a novel framework for multilevel routing considering both routability and performance. The two-stage multilevel framework consists of coarsening followed by uncoarsening. Unlike the previous multilevel routing, we integrate global routing, detailed routing, and resource estimation together at each level of the framework, leading to more accurate routing resource estimation during coarsening and thus facilitating the solution refinement during uncoarsening. Further, the exact routing information obtained at each level makes our framework more flexible in dealing with various routing objectives (such as crosstalk, power, etc). Experimental results show that our approach obtains significantly better routing solutions than previous works. For example, for a set of 11 commonly used benchmark circuits, our approach achieves 100% routing completion for all circuits while the previous multilevel routing, the three-level routing, and the hierarchical routing can complete routing for only 3, 0, 3 circuits, respectively. In particular, the number of routing layers used by our router is even smaller. We also have performed experiments on timing-driven routing. The results are also very promising.
In this paper, we present several novel techniques that make the recently published multilevel routing scheme [19] more effective and complete. Our contributions include: (1) resource reservation for local nets during the coarsening process, (2) congestion-driven, graph-based Steiner tree construction during the initial routing and the refinement process and (3) multi- iteration refinement considering the congestion history. The experiments show that each of these techniques helps to improve the completion rate considerately. Compared to [19], the new routing system reduces the number of failed nets by 2. to 18., with less than 50% increase in runtime in most cases.
Routing is one of the most complex stages in the back-end design process. Simple routing algorithms based on two stages of global routing and detailed routing do not offer appropriate opportunities to address problems arising from signal delay, cross-talk and process constraints. An intermediate stage of track assignment between global and detailed routing proves to be an ideal place to address these problems. With this stage it is possible to use global routing information to efficiently address these problems and to aid the detailed router in achieving the wiring completions. In this paper we formulate routing as a three stage process; global routing, track assignment and detailed routing. We describe the intermediate track assignment problem and suggest an efficient heuristic for its solution. We introduce cost metrics to model basic effects arising from connectivity. We discuss extensions to include signal integrity and process constraints. We propose a heuristic based on weighted bipartite matching as a core routine. To improve its performance additional heuristics based on lookahead and segment splitting are also suggested. Experimental results are given to highlight the efficacy of track assignment stage in routing process.
Design ECO commonly happens in industry due to constraints or target changes from manufacturing, marketing, reliability, or performance. At each step, designers usually want to modify the existing solution incrementally and keep the design as close as possible to the existing one. In this paper, we address the PSO (Power rail - Signal wire Overlap) problem which solves overlaps between power rails and signal wires due to the changes in power rail design on the top layer of a multiple layer routing region. PSO problems are frequently caused by changes from power delivery system or package design. The new routing solution satisfies the following constraints: 1) Keep the routing of power rails in the new design unchanged. 2) Only the routing of the top two layers is changed. 3) Horizontal (vertical) signal wire segments on the top layer can only move up/down (left/right). At the same time, the new routing solution keeps the routing pattern unchanged. This requires: a) If one end point of a horizontal (vertical) wire segment on the top layer is a fixed pin, this segment can not move. b) If vertical (horizontal) projections of two horizontal (vertical) signal wire segments have overlaps, then the up/down (left/right) relationship should not be changed. c) If two horizontal (vertical) segments belonging to different nets are on the same track, their left/right (up/down) relationship should not be changed as long as the two segments still exist in the new solution. 4) For each signal wire segment, the deviation (i.e., the difference between its new position and the old one) should not exceed the user-defined allowable deviation bound. Different bounds can be set on different segments. We propose two algorithms to solve the PSO problem. Both algorithms guarantee to find a feasible solution as long as one exists. One is faster, while the other makes effort to minimize the total deviation as well as the max deviation. According to time and quality requirements, users can choose an appropriate algorithm to solve the problem. For a set of industrial test circuits, we were able to remove all overlaps between power rails and signal wires with minimal wire deviation.
Solving a system of linear equations has been widely used to compute seeds for LFSR reseeding to compress test patterns. However, as chip size is growing, solving linear equations requires a large number of computations that is proportional to n3. This paper proposes a new scan chain architecture and algorithm so that the order of computation is proportional to the number of scan cells in a chip. The new architecture is a methodology change that does not require complex Design-For-Testability (DFT) as proposed in the previous techniques. Instead of solving linear equations, the proposed new seed computation algorithm topologically determines seeds for test vectors. The compression ratio might be slightly lower than the other approaches, but the proposed approach can handle larger designs in a reasonable amount of time. Computation analysis shows that, for 1 million scan cell design, if we assume it takes 1 msec for the proposed technique to compute seeds, it would take more than 14 minutes for other techniques that solve linear equations.
We provide a definition of undetectable faults in partial scan circuits under a test application scheme where a test consists of primary input vectors applied at-speed between scan operations. We also provide sufficient conditions for a fault to be undetectable under this test application scheme. We present experimental results on finite-state machine benchmarks to demonstrate the effectiveness of these conditions in identifying undetectable faults.
This work presents several new techniques for enhancing the performance of deterministic test pattern generation for VLSI circuits. The techniques introduced are called dynamic decision ordering, conflict driven recursive learning and conflict learning. An important feature shared by all these techniques is that they are triggered by the occurrence of a conflict in the generation of tests. Hence, they are not active all the time nor for all the faults. This feature allows the ATPG system that uses these techniques to resolve hard-to-resolve faults with far fewer backtracks and leaves the system as efficient as before in the absence of conflicts. We have incorporated these techniques into a commercial D-algorithm based ATPG tool. The experimental results on full scan versions of ITC'99 benchmark circuits demonstrate an improvement of the ATPG system both in the number of aborted faults and in test generation time.
In current industrial practice, critical path selection is an indispensable step for AC delay test and timing validation. Traditionally, this step relies on the construction of a set of worse-case paths based upon discrete timing models. The assumption of discrete timing models can be invalidated by delay effects in the deep submicron domain, where timing defects and process variation are statistical in nature. In this paper, we study the problem of optimizing critical path selection, under both fixed delay and statistical delay assumptions. With a novel problem formulation and new theoretical results, we prove that the problem in both cases are computationally intractable. We then discuss practical heuristics and their theoretical performance bounds, and demonstrate that among all heuristics under consideration, only one is theoretically feasible. Finally, we provide consistent experimental results based upon defect-injected simulation using an efficient statistical timing analysis framework.
This paper presents a way of encoding some kinds of dynamic reconfiguration behaviour in the interface portion of circuit descriptions. This has many advantages. The user of a reconfigurable circuit has some knowledge about the reconfigurable interface of the circuit. Static analysis tools can make better decisions about how to schedule virtual hardware. And most importantly the compiler can automatically synthesize the required interface between reconfigurable portions of the system and the regular portions of the design. Several existing models of dynamic reconfiguration from the literature are captured using our type system extension based on sum types. This is especially important in System-on-Chip (SoC) contexts where a reconfigurable IP block may have to communicate over a non-trivial IP bus like CoreConnectTM.
Interconnects (wires, buffers, clock distribution networks, multiplexers and busses) consume a significant fraction of total circuit power. In this work, we demonstrate the importance of optimizing on-chip interconnects for power during high-level synthesis. We present a methodology to integrate interconnect power optimization into high-level synthesis. Our binding algorithm not only reduces power consumption in functional units and registers in the resultant register-transfer level (RTL) architecture, but also optimizes interconnects for power. We take physical design information into account for this purpose. To estimate interconnect power consumption accurately for deep sub-micron (DSM) technologies, wire coupling capacitance is taken into consideration. We observed that there is significant spurious (i:e:, unnecessary) switching activity in the interconnects and propose techniques to reduce it. Compared to interconnect-unaware power-optimized circuits, our experimental results show that interconnect power can be reduced by 53.1% on an average, while reducing overall power by an average of 26.8% with 0.5% area overhead. Compared to area-optimized circuits, the interconnect power reduction is 72.9% and overall power reduction is 56.0% with 44.4% area overhead.
Predictability is the quantified form of accuracy. We propose a predictability driven design methodology. The novelty lies in defining and using the idea of predictability. In order to illustrate the basic concepts we focus on the low power binding problem. The binding problem for low power was solved in [3], [5], but in the presence of inaccuracies, their claims of optimality are imprecise. Our experiments show that these inaccuracies could be as high as 33%. Our methodology could improve this unpredictability to as low as 11% with minimal power penalty (7% on average).
We present an algorithm for simplifying the solution of conjunctive Boolean constraints of state and input variables, in the context of constrained random vector generation using BDDs. The basis of our approach is extraction of "hold-constraints" from constraint system. Hold-constraints are deterministic and trivially resolvable; in addition, they can be used to simplify the original constraints as well as refine the conjunctive partition. Experiments demonstrate significant reduction in the time and space needed for constructing the conjunction BDDs, and the time spent in vector generation during simulation.
We address verification of imprecise datapath circuits with sequential elements. Using Arithmetic Transform (AT) and its extensions, we verify the sequential datapath circuits with finite precision. An efficient formulation of the precision verification is presented as a polynomial maximization search over Boolean inputs. Using a branch-and-bound search for the precision error and the block-level composition of ATs, we verify the approximated, rounded and truncated pipelined datapaths.
An essential problem in component-based design is how to compose components designed in isolation. Several approaches have been proposed for specifying component interfaces that capture behavioral aspects such as interaction protocols, and for verifying interface compatibility. Likewise, several approaches have been developed for synthesizing converters between incompatible protocols. In this paper, we introduce the notion of adaptability as the property that two interfaces have when they can be made compatible by communicating through a converter that meets specified requirements. We show that verifying adaptability and synthesizing an appropriate converter are two faces of the same coin: adaptability can be formalized and solved using a game-theoretic framework, and then the converter can be synthesized as a strategy that always wins the game. Finally we show that this framework can be related to the rectification problem in trace theory.
As technology scales, subthreshold leakage currents grow exponentially and become an increasingly large component of total power dissipation. CAD tools to help model and manage subthreshold leakage currents will be needed for developing ultra low power and high performance integrated circuits. This paper gives an overview of current research to control leakage currents, with an emphasis on areas where CAD improvements will be needed. The first part of the paper explores techniques to model subthreshold leakage currents at the device, circuit, and system levels. Next, circuit techniques such as source biasing, dual Vt partitioning, MTCMOS, and VTCMOS are described. These techniques reduce leakage currents during standby states and minimize power consumption. This paper also explores ways to reduce total active power by limiting leakage currents and optimally trading off between dynamic and leakage power components.
One of the bottlenecks in the recent movement of hardware synthesis from behavioral C programs is the difficulty in reasoning about runtime pointer values at compile time. The pointer analysis problem has been investigated in the compiler community for two decades, which has yielded efficient, polynomial time algorithms for context-insensitive analysis. However, at the accuracy level for which hardware synthesis is desired, namely context and flow sensitive analysis, the time and space complexity of the best algorithms reported grow exponentially with program size. In this paper, we present our first step towards a new analysis technology which potentially leads to almost-linear time complexity and sub-linear space complexity algorithm even for the most accurate analysis. The key idea that contributes to this efficiency is to implicitly encode the pointer-to relation in the Boolean domain, thereby capturing the procedure transfer function completely, compactly and canonically. This represents a wide departure from the traditional techniques, all of which explicitly capture pointer-to relation using variations of point-to graph, which have to be re-evaluated for different calling contexts. Experiments for our first flow-insensitive algorithm on common benchmarks show promising result.
While previous compiler research indicates that significant improvements in energy efficiency may be possible if properly optimized code is used, the energy constraints under which a given application code should be optimized may not always be available at compile-time. More importantly, these constraints may change dynamically during the course of execution. In this work, we present a dynamic recompilation/linking framework using which the energy behavior of a given application can be optimized while the application is being executed. Our preliminary experiments indicate that large energy gains are possible through dynamic code recompilation/linking at the expense of a relatively small increase in execution time.
Partitioning an embedded system application among a microprocessor and custom hardware has
been shown to improve the performance, power or energy of numerous examples. The advent of
single-chip microprocessor/FPGA platforms makes such partitioning even more attractive.
Previous partitioning approaches have partitioned sequential program source code, such as C or
C++. We introduce a new approach that partitions at the software binary level. Although source
code partitioning is preferable from a purely technical viewpoint, binary-level partitioning
provides several very practical benefits for commercial acceptance. We demonstrate that binary-
level partitioning yields competitive speedup results compared to source-level partitioning,
achieving an average speedup of 1.4 compared to 1.5 for eight benchmarks partitioned on a
single-chip microprocessor/FPGA device.
Keywords:
Hardware/software partitioning, synthesis, binary translation, decompilation, low
power, assembly language, FPGA, codesign, synthesis.
Net weighting for timing-driven placement has been very popular in industry and academia. It has various advantages such as low complexity, high flexibility and ease of implementation. Existing net weighting algorithms, however, are often ad-hoc. There is generally no known good net weighting algorithms. In this paper, we present a novel net weighting algorithm based on the concept of path-counting, and apply it in timing-driven FPGA placement application. Theoretically this is the first ever known accurate, all-path counting algorithm. Experimental data shows that compared with the weighting algorithm used in state-of-the-art FPGA placement package VPR[1], this new algorithm can achieve the longest path delay reduction of up to 38.8%, 15.6% on average with no runtime overhead and only a 4.1% increase in total wirelength.
Design hierarchy plays an important role in timing-driven placement for large circuits. In this paper, we present a new methodology for delay budgeting based timing-driven placement. A novel slack assignment approach is described as well as its application on delay budgeting with design hierarchy information. The proposed timing-driven placement flow is implemented into a placement tool named Dragon (timing-driven mode), and evaluated using an industrial place and route flow. Compared to Cadence QPlace, timing-driven Dragon generates placement results with shorter clock cycle and better routability.
In this paper we present multi-objective hMetis partitioning for simultaneous cutsize and circuit delay minimization. We change the partitioning process itself by introducing a new objective function that incorporates a truly path-based delay component for the most critical paths. To avoid semi-critical paths from becoming critical, the traditional slack-based delay component is also included in the cost function. The proposed timing driven partitioning algorithm is built on top of the hMetis algorithm, which is very efficient. Simulations results show that 14% average delay improvement can be obtained. Smooth trade-off between cutsize and delay is possible in our algorithm.
This paper introduces a new hybrid ASIC/FPGA chip architecture that is being developed in collaboration between IBM and Xilinx, and highlights some of the design challenges this offers for designers and CAD developers. We will review recent data from both the ASIC and FPGA industries, including technology features, and trends in usage and costs. This background data indicates that there are advantages to using standard ASICs and FPGAs for many applications, but technical and financial considerations are increasingly driving the need for a hybrid ASIC/FPGA architecture at specific volume tiers and technology nodes. As we describe the hybrid chip architecture we will point out evolving tool and methodology issues that will need to be addressed to enable customers to effectively design hybrid ASIC/FPGAs. The discussion will highlight specific automation issues in the areas of logic partitioning, logic simulation, verification, timing, layout and test.
This paper discusses Voltage Islands, a system architecture and chip implementation methodology, that can be used to dramatically reduce active and static power consumption for System-on-Chip (SoC) designs. As technology scales for increased circuit density and performance, the need to reduce power consumption increases in significance as designers strive to utilize the advancing silicon capabilities. The consumer product market further drives the need to minimize chip power consumption. Effective use of Voltage Islands for meeting SoC power and performance requirements, while meeting Time to Market (TAT) demands, requires novel approaches throughout the design flow as well as special circuit components and chip powering structures. This paper outlines methods being used today to design Voltage Islands in a rapid-TAT product development environment, and discusses the need for industry EDA advances to create an industry-wide Voltage Island design capability.
Future high performance microprocessor design with technology scaling beyond 90nm will pose two major challenges: (1) energy and power, and (2) parameter variations. Design practice will have to change from deterministic design to probabilistic and statistical design. This paper discusses circuit techniques and design automation opportunities to overcome the challenges.
A novel circuit topology for inductive coupling between interconnecting wires is presented. The model is local, i.e., only coupling between neighboring wires is explicitly modeled. However, the topology accounts for long-range coupling by propagating the vector potential from one wire to the next. Examples of model calibration, both directly from layout and as model-order reduction of a given inductance matrix, are presented for simple wiring structures.
We develop a robust, efficient, and accurate tool, which integrates inductance extraction and simulation, called INDUCTWISE. This paper advances the state-of-the-art inductance extraction and simulation techniques and contains two major parts. In the first part, INDUCTWISE extractor, we discover the recently proposed inductance matrix sparsification algorithm, the K- method[1], albeit its great benefits of efficiency, has a major flaw on the stability. We provide both a counter example and a remedy for it. A window section algorithm is also presented to preserve the accuracy of the sparsification method. The second part, INDUCTWISE simulator, demonstrates great efficiency of integrating the nodal analysis formulation with the improved K- method. Experimental results show that INDUCTWISE has over 250x speedup compared to SPICE3. The proposed sparsification algorithm accelerates the simulator another 175x and speeds up the extractor 23.4x within 0.1% of error. INDUCTWISE can extract and simulate an 118K-conductor RKC circuit within 18 minutes. It has been well tested and released on the web for public usage. (http://vlsi.ece.wisc.edu/Inductwise.htm)
The simulation of on-chip inductance using PEEC-based circuit analysis methods often requires the solution of a subproblem where an extracted inductance matrix must be multiplied by a current vector, an operation with a high computational cost. This paper presents a highly accurate technique, based on a precorrected-FFT approach, that speeds up this calculation. Instead of computing the inductance matrix explicitly, the method exploits the properties of the inductance calculation procedure while implicitly considering the effects of all of the inductors in the layout. An optimized implementation of the method has been applied to accurately simulate large industrial circuits with up to 121,000 inductors and nearly 7 billion mutual inductive couplings in about 20 minutes. Techniques for trading off the CPU time with the accuracy using different approximation orders and grid constructions are also illustrated. Comparisons with a block diagonal sparsification method in terms of accuracy, memory and speed demonstrate that our method is an excellent approach for simulating on-chip inductance in a large circuit.
This paper describes the similarities and differences between two widely publicized methods for analyzing oscillator phase behavior. The methods were presented in [3] and [6]. It is pointed out that both methods are almost alike. While the one in [3] can be shown to be, mathematically, more exact, the approximate method in [6] is somewhat simpler, facilitating its use for purposes of analysis and design. In this paper, we show that, for stationary input noise sources, both methods produce equal results for the oscillator's phase noise behavior. However, when considering injection locking, it is shown that both methods yield different results, with the approximation in [6] being unable to predict the locking behavior. In general, when the input signal causing the oscillator phase perturbations is non-stationary, the exact model produces the correct results while results obtained using approximate model break down.
Circuitlevel simulation of ?S modulators is a timeconsuming task (taking one or more days for meaningful results). While there are a great variety of techniques and tools that speed up the simulations for discretetime (DT) ?S modulators, there is no rigorous methodology implemented in a tool to efficiently simulate and design the continuoustime (CT) counterpart. Yet, in todays lowpower, highaccuracy and/or very highspeed demands for AtoD converters, designers are often forced to resort to the use of CT ?S topologies. In this paper, we present a method for the highlevel simulation of continuoustime ?S modulators that is based on behavioral models and which exhibits the best tradeoff between accuracy, speed and extensibility compared to other possible techniques that are reviewed briefly in this work. A userfriendly tool, implementing this methodology, is then presented. Nonidealities such as finite gain, finite GBW, output impedance and also nonlinearities such as clipping, harmonic distortion and the important effect of jitter are modeled. Finally, experiments were carried out using the tool, exploring important design trade-offs.
Fourier-envelope algorithms are an important component of the mixed-signal/RF verification toolbox. In this paper, we address the unpredictability and lack of robustness that has been reported for these algorithms. We show that the problem stems from fast oscillations in envelopes that are expected to be slowly varying. We demonstrate that this is related to the fact that the envelope equations are always stiff, whether or not the underlying system is. We show that careful choice of envelope initial conditions is necessary to obtain useful solutions, and propose two techniques for finding good initial conditions. Applying these, and solving the envelope equations with stiffly-stable numerical methods, we improve the robustness and reliability of Fourier-envelope methods. We illustrate the new methods with a direct- downconversion mixer circuit.
Shrinking process geometries and the increasing use of IP components in SoC designs give rise to new problems in routing and buffer insertion. A particular concern is that cross-chip routing will require multiple clock cycles. Another is the integration of independently clocked components. This paper explores simultaneous routing and buffer insertion in the context of single and multiple clock domains. We present optimal and efficient polynomial algorithms that can be used to estimate communication overhead for interconnect and resource planning in single and multi-clock domain systems. Experimental results verify the correctness and practicality of our approach.
As the VLSI technology scaling down, the electromigration problem becomes one of the major concerns in high-performance IC design for both power network and signal interconnects. For a uniform width metal interconnect, the current flows through the driving point is much larger than that flows through the fan-out point since much of current bypasses to the ground through the parasitic capacitance. This causes the lifetime of driving point to be quite shorter than that of fanout point due to electromigration. In order to avoid breakdown at the driving point, wire sizing is an effective solution. Thus we present a wire shape, of which the current density as well as the lifetime is uniform along the wire. SPICE simulation results show the uniformity of current density of this wire shape. Under the same current density bound, we demonstrate that chip area and power consumption are significantly reduced for this wire shape compared to the uniform width wire. The wire shape functions we derived are continuous. However, it is not necessary to ultra-accurately reproduce the continuous shape on the silicon, since we can round the continuous shape to the nearest available litho width and this will not degrade the uniformity of current density.
We propose to introduce redundant interconnects for manufacturing yield and reliability
improvement. By introducing redundant interconnects, the potential for open faults is reduced at
the cost of increased potential for short faults; overall, manufacturing yield and fault tolerance
can be improved. We focus on a post-processing, tree augmentation approach which can be
easily integrated in current physical design flows. Our contributions are as follows:
We formulate the problem as a variant of the classical 2-edge-connectivity augmentation
problem in which we take into account such practical issues as wirelength increase
budget, routing obstacles, and use of Steiner points.
We show that an optimum solution can always be found on the Hanan grid defined by the
terminals and the corners of the feasible routing region.
We give a compact integer program formulation which, for up to 100 terminal nets, is
solved in practical runtime by the commercial optimization package CPLEX.
We give a well-scaling greedy algorithm which has practical runtime up to 1,000
terminals, and comes on the average within 1-2% of the optimum computed by CPLEX.
We give a comprehensive experimental study comparing the solution quality and runtime
of our methods with the best reported methods for 2-edge-connectivity augmentation,
including a sophisticated heuristic based on minimum-weight branchings [9] and a recent
genetic algorithm [14].
Experiments on randomly generated and industry testcases show that our greedy augmentation
method achieves significant increase in reliability (as measured by the percentage of biconnected
tree edges) with very small increase in wirelength. For example, on 1,000 terminal nets the
average percentage of biconnected tree edges is 34.19% for a wirelength increase of only 1%,
and 87.73% for a wirelength increase of 20%. SPICE simulations on industry routed nets show
that non-tree routing has the additional benefit of reducing maximum sink delay by an average of
28.26% compared to Steiner routing, and by an average of 3.72% compared to timing optimized
routing. SPICE simulations further imply that non-tree routing has smaller delay variation due to
process variability.
For many years, CMOS process scaling has allowed a steady increase in the operating frequency and integration density of integrated circuits. Only recently, however, have we reached a point where it takes several clock cycles for global signals to traverse a complex digital system such as a modern microprocessor. Thus, interconnect latency must be taken into account in current and future design tools at the architectural as well as synthesis level. To this purpose, this work proposes a new latency-aware technique for the performance-driven concurrent insertion of flip- flops and repeaters in VLSI circuits. Overwhelming evidence showing an exponential increase in the number of pipelined interconnects with process scaling, for high-performance microprocessors as well as high-end ASICs, is also presented. This increase indicates a radical change in current design methodologies to cope with this new emerging problem.
As the scale of system integration continues to grow, the on-chip communication becomes the ultimate bottleneck of system performance and the primary determinant of system architecture. In this paper we propose a throughput-driven synthesis methodology for on-chip communication fabrics based on optimized bus models. Compared with traditional delay-driven, wire-by-wire planning methods, the throughput-driven methodology provides a feasible and accurate system- level solution to address delay and congestion problems simultaneously during early-phase design planning. Unlike the conventional methods which are based on rather inaccurate RC models and simplistic delay metrics, in our methodology the communication fabrics are characterized in terms of realistic Partial Element Equivalent Circuits (PEEC) extracted from the multi-layer interconnects and transistor level transient analysis via SPICE-like tools. The characterized models facilitate a flexible interconnect fabric optimization engine that can be embedded into a system planner for throughput-driven synthesis. Furthermore, engineering trade-offs considering repeater area and interconnect power consumption are further considered as part of this methodology.
As technology advances towards billion transistor systems, the cost of complex wire networks will require area efficient wiring methodologies. This paper explores the tradeoffs between wire latency, throughput and area for deep submicron (DSM) interconnect technologies. From basic physical models, optimal wiring sizing for repeater networks are rigorously derived and compared to HSPICE simulations. Key case studies from 250nm to 70nm technologies reveal that significant wire area reduction (20-50%) can be achieved with optimal wire sizing to maximize the throughput per unit wire area.
With increasing design sizes and adoption of System on a Chip (SoC) methodology, design synthesis and test automation tools are hitting capacity and performance bottlenecks. Currently, hierarchical synthesis flows for large designs lack complete design-for-test (DFT) support. With this paper, we address a solution, involving the introduction of test models in a traditional DFT synthesis flow, that we term Hierarchical DFT Synthesis (HDS). We discuss the use of Core Test Language (CTL) based test models combined with physical and timing models to provide a complete flow for chip-level DFT. In doing so we address some challenges the new flow presents such as Design Rule Checking (DRC), DFT architecting and optimization. We describe methods to overcome these challenges thereby presenting a new methodology to handle complex next generation designs.
We present a new method of built-in-self-test (BIST) for sequential circuits and system-on-a- chip (SOC) using characteristic faults and circuit-specific spectral information in the form of one or more Hadamard coefficients. The Hadamard coefficients are extracted from the test sequences for a small set of characteristic faults of the circuit. By extracting a few characteristic faults from the circuit, we show that detection of these characteristic faults is sufficient in detecting a vast majority of the remaining faults in the circuit. The small number of characteristic faults allows us to reduce the coefficients necessary for BIST. State relaxation is performed on the compacted test sequences to reduce the spectral noise further. Since we are targeting only a very small number of characteristic faults, the execution times for computing the spectra are greatly reduced. Our experimental results show that our new method can achieve high BIST coverage with both lower computational efforts and storage, with very few characteristic faults.
Scan-based testing methodologies remedy the testability problem of sequential circuits; yet they suffer from prolonged test time and excessive test power due to numerous shift operations. The high density of the unspecified bits in test data enables the utilization of the test response data captured in the scan chain for the generation of the subsequent test stimulus, thus reducing both test time and test data volume. The proposed scan-based test scheme accesses only a subset of scan cells for loading the subsequent test stimulus while freezing the remaining scan cells with the response data captured, thus decreasing the scan chain transitions during shift operations. The experimental results confirm the significant reductions in test application time, test data volume and test power achieved by the proposed scan-based testing methodology.
This paper describes an optimization technique able to optimize a complete wireless receiver architecture in a reasonable amount of time. The optimizer alternates between spice level optimizations of simple building blocks and a full architecture optimization of the whole based on accurate models of the building blocks. The models of the building blocks are interpolated over the data points acquired in the Spice level simulations. The optimizer technique has been applied to the optimization of an architecture for a GPS receiver. The optimal design has been implemented in a standard 0.25µm CMOS process.
The relentless move to single chip integration of RF, analog and digital blocks results in a challenging noise coupling effects that should be controlled. In this paper, we propose a methodology that uses a suite of commercial tools in combination with a high-speed extractor based on an innovative semi-analytical method to deal with noise coupling problems, and enable RF designers to achieve a first silicon-success of their chips. The integration of the methodology in a typical RF design flow is illustrated and its successful application to help a single-chip integration of a transceiver demonstrated.
In this paper we present a method for the design of analog-to-digital converters (ADCs). This
method computes the sizes of the different components (transistors, capacitors, etc.) in a
predefined ADC topology so that the design specifications are met in the desired process
technology.
The method is based on formulating the ADC design constraints such as specifications on power,
signal-to-noise ratio (SNR), area, and sampling frequency in special convex form in terms of the
component sizes of the ADC and intermediate design variables. More specifically, we cast the
problem of sizing the components of the ADC as a geometric program. Therefore, all design
constraints are formulated as posynomial inequality or monomial equality constraints. Very
efficient numerical algorithms are then used to solve the resulting geometric program and to
compute the component sizes of an ADC that meets the desired specifications. The synthesis
method is fast, and determines the globally optimal design; in particular the final solution is
completely independent of the starting point (which can even be infeasible), and infeasible
specifications are unambiguously detected.
This paper introduces the concept of hierarchical problem formulation within a geometric
programming framework. This modular formulation allows a high re-use of the ADC
posynomial model.
Modeling the exponentially varying current distributions in conductor interiors associated
with high frequency interconnect behavior causes a rapid increase in the computation time
and memory required even by recently developed fast electromagnetic analysis programs.
In this paper we describe a procedure to generate numerically a set of basis functions
which efficiently represent conductor current variation, and thus improving solver
efficiency. The method is based on solving a sequence of template problems, and is
easily generalized to arbitrary conductor cross-sections. Results are presented to
demonstrate that the numerically computed basis functions are seven to twenty times
more efficient than the commonly used piece-wise constant basis functions.
Categories & Subject Descriptors:
B.7.2 Simulation, B.8.2 Performance
Analysis & Design Aids, I.6 Simulation & Modeling.l
General Terms:
Algorithms, Performance.l
Keywords:
Skin Effect, Proximity Effect, Parasitic extraction, Interconnect analysis.
We investigate appropriate regimes for transmission line propagation of signals on digital integrated circuits. We start from exact solutions to the transmission line equations proposed by Davis and Meindl. We make appropriate modifications due to finite rise time. They affect the delay calculation and hypothesis pertaining the constancy of the electromagnetic parameters. We study these effects in detail. To find the domain of physical variables where transmission line behavior is feasible, we pose the problem as a nonlinear minimization problem in a space spanned by two continuous variables, with four parameters. From the resulting solutions and employing monotonicity properties of the functional we extract regimes of validity. These regimes of validity happen to be commensurate with what is reachable and doable with todays leading technologies. We complete this study with a qualitative analysis of driver insertion in the presence of transmission lines. The resulting configurations are suitable for the development of an improved clock design discipline.
In this paper, we present a novel wire duplication-based interconnect modeling technique. The proposed modeling technique exploits the sparsity of the L-1 matrix, where L is the inductance matrix, and constructs a sparse and stable equivalent RLC circuit by windowing the original inductance matrix. The model avoids matrix inversions. Most important, it is more accurate and more efficient than many existing techniques.
The challenge of extending Moore's Law past the physical and economic barriers of present semiconductor technologies calls for novel nanoelectronic solutions. Circuits composed of mixed silicon semiconductors and nanoelectronics can provide a means for gradually switching technology paradigms. We suggest a design methodology to accompany this concept. Furthermore, we explore design tradeoffs for a nanoscale crossbar technology that supports CMOS/nano co-design.
Reversible or information-lossless circuits have applications in digital signal processing, communication, computer graphics and cryptography. They are also a fundamental requirement in the emerging field of quantum computation. We investigate the synthesis of reversible circuits that employ a minimum number of gates and contain no redundant input-output line-pairs (temporary storage channels). We prove constructively that every even permutation can be implemented without temporary storage using NOT, CNOT and TOFFOLI gates. We describe an algorithm for the synthesis of optimal circuits and study the reversible functions on three wires, reporting distributions of circuit sizes. Finally, in an application important to quantum computing, we synthesize oracle circuits for Grover's search algorithm, and show a significant improvement over a previously proposed synthesis algorithm.
As design of integrated MicroElectroMechanical Systems (MEMS) matures, there is an
increasing need for verification of MEMS layouts. This requires a mixed-domain LVS (layout-
versus-schematic) methodology capable of extracting an integrated schematic from the mixed-
domain layout and verifying it against the designed schematic. This paper reports on a prototype
implementation of MEMS LVS and a MEMS extractor, which, in addition to reconstructing the
extracted schematic also captures the domain-specific parasitics in the individual devices. This
schematic is then used by a custom schematic-versus-schematic comparator to match
connectivity of various elements between the designed and extracted schematics. Finally,
simulation of the extracted schematic also helps in capturing the true behavior of the system.
Keywords:
MEMS extraction, parasitics, MEMS LVS, verification, integrated MEMS
Schematic-based lumped parameterized behavioral modeling and simulation methodologies have
become available since the emergence of analog HDLs. They greatly ease iterative hierarchical
multi-domain simulation, which is critical to the design of MEMS. NODAS is one of such tools,
with models written in VerilogA and simulation performed within the Cadence framework. This
paper focuses on several key modeling issues in NODAS, including schematic representation,
element communication, linear, nonlinear and multi-domain modeling, and extensibility to new
physical effects, processes and physical domains. A nonlinear beam model and an electrostatic
gap model are discussed as examples. Simulation comparison to finite element analyses and
experimental data verifies the accuracy of the models and validates the simulation methodology.
Keywords:
MEMS, schematic-based, lumped, parameterized, behavioral, nonlinear beam,
electrostatic gap
This paper presents a novel enumerative approach, with provable and efficient pruning techniques, for dual threshold voltage (Vt) assignment at the transistor level. Since the use of low Vt may entail a substantial increase in leakage power, we formulate the problem as one of combined optimization for leakage-delay tradeoffs under Vt optimization and sizing. Based on an analysis of the effects of these two transforms on the delay and leakage, we justify a two-step procedure for performing this optimization. Results are presented on the ISCAS85 benchmark suite favorably comparing our approach with an existing sensitivity-based optimizer.
Due to increasing clock speeds, increasing design sizes and shrinking technologies, it is becoming more and more challenging to distribute a single global clock throughout a chip. In this paper we study the effect of using a Globally Asynchronous Locally Synchronous (GALS) organization for a superscalar, out-of-order processor, both in terms of power and performance. To this end, we propose a novel modeling and simulation environment for multiple clock cores with static or dynamically variable voltages for each synchronous block. Using this design exploration environment we were able to assess the power/performance tradeoffs available for Multiple Clock, Single Voltage (MCSV), as well as Multiple Clock, Dynamic Voltage (MCDV) cores. Our results show that MCSV processors are 10% more power efficient when compared to single-clock single voltage designs with a performance penalty of about 10%. By exploiting the flexibility of independent dynamic voltage scaling the various clock domains, the power efficiency of GALS designs can be improved by 12% on average, and up to 20% more in select cases. The power efficiency of MCDV cores becomes comparable with the one of Single Clock, Dynamic Voltage (SCDV) cores, while being up to 8% better in some cases. Our results show that MCDV cores consume 22% less power at an average 12% performance loss.
An effective way to compare logic techniques, logic families, or cell libraries is by means of power (or area) versus delay plots, since the efficiency of achieving a particular delay is of crucial significance. In this paper we describe a method of producing an optimized power versus delay curve for a combinational circuit. We then describe a method for comparing the relative merits of a set of power versus delay curves for a circuit, each generated with a different cell library. Our results indicate that very few combinational functions need to be in a cell library, at most 11. The power-delay points achieved by Design Compiler from Synopsys using the state- of-the-art Artisan Sage-X library compare unfavorably to our approach. In terms of minimum energy-delay product, our approach is superior by 79% on average. Our approach yields the same delay points with a 107% savings in power consumption, on average. We also show that the specified VDD for a process technology should only be used for the absolute fastest implementations of a circuit.
In this paper, we present Forge, an optimal algorithm for gate sizing using the Elmore delay model. The algorithm utilizes Lagrangian relaxation with a fast gradient-based pre-processing step that provides an effective set of initial Lagrange multipliers. Compared to the previous Lagrangian-based approach, Forge is considerably faster and does not have the inefficiencies due to difficult-to-determine initial conditions and constant factors. We compared the two algorithms on 30 benchmark designs, on a Sun UltraSparc-60 workstation. On average Forge is 200 times faster than the previously published algorithm. We then improved Forge by incorporating a slew- rate-based convex delay model, which handles distinct rise and fall gate delays. We show that Forge is 15 times faster, on average, than the AMPS transistor-sizing tool from Synopsys, while achieving the same delay targets and using similar total transistor area.
In this paper, we present a novel sequence generator based on a Markov chain model.
Specifically, we formulate the problem of generating a sequence of vectors with given average
input probability p, average transition density d, and spatial correlation s as a transition matrix
computation problem, in which the matrix elements are subject to constraints derived from the
specified statistics. We also give a practical heuristic that computes such a matrix and generates a
sequence of l n-bit vectors in O(nl+n2) time. Derived from a strongly mixing Markov chain, our
generator yields binary vector sequences with accurate statistics, high uniformity, and high
randomness. Experimental results show that our sequence generator can cover more than 99% of
the parameter space. Sequences of 2,000 48-bit vectors are generated in less than 0.05 seconds,
with average deviations of the signal statistics p, d, and s equal to 1.6%, 1.8%, and 2.8%,
respectively.
Our generator enables the detailed study of power macromodeling. Using our tool and the
ISCAS-85 benchmark circuits, we have assessed the sensitivity of power dissipation to the three
input statistics p, d, and s. Our investigation reveals that power is most sensitive to transition
density, while only occasionally exhibiting high sensitivity to signal probability and spatial
correlation. Our experiments also show that input signal imbalance can cause estimation errors as
high as 100% in extreme cases, although errors are usually within 25%.
This paper proposes a circuit power estimation method using Bayesian inference and neural networks. Based on statistical distribution of circuit leakage power and switching energy, the entire state and transition space of a circuit are classified using neural networks into a limited few classes that represent different power consumption average values. This technique enables efficient table-lookup of circuit power of the entire state and transition space. Theoretical basis of Bayesian inference, feature extraction for neural networks of circuit leakage power and switching energy are discussed. Experiments on a wide range of circuit topologies demonstrated the robustness of the proposed method for estimating circuit leakage power of all possible states and switching energy of all possible transitions.
Delay due to capacitive coupling of interconnects has become an important reliability issue in the design of nanometer circuits. In this paper we present a probabilistic approach towards analyzing the impact of capacitive coupling noise on signal delay. The variation in the delay is due to the variation in the relative arrival times of the aggressors and the victim. We derive expressions for the moments of the victim voltage in the presence of noise. From these we compute estimates of the earliest and latest possible arrival times of the victim. We compare the analytical results with Monte Carlo simulations using SPICE. Even though the analytical calculations are 200 times faster than the Monte Carlo simulations, the differences in the estimates of the mean and standard deviation of the arrival time is no more than 2.8%. In addition, the width of the timing intervals using the proposed approach is reduced by as much as 48% with a confidence level of 0.984. That is 98.4% of the Monte Carlo simulations result in an arrival time that falls within the derived interval which is 48% shorter.
Every 18 to 24 months, the areal density of VLSI doubles and the predicted date for the End Of CMOS Scaling is pushed out approximately 18 to 24 months. This rate of growth has been controlled mainly by the increasing capabilities of lithographic patterning. However, the rate of improvement of lithography systems in key physical parameters such as illumination wavelength has begun to slow. To make up the shortfall, lithography has increasingly turned to using CAD tools to transform the geometric shapes in designs into shapes on photomasks which have been compensated for systematic pattern-distorting effects. As the requirements for this compensation grow, however, it becomes increasingly difficult to hide in post-tapeout data preparation, and must be considered in back-end design tools and methodologies. We describe a number of the challenges to lithographic patterning, highlighting the factors that limit "physical scaling" and introduce the layout-to-mask shape transformations that compensate for these limitations. We describe the implementation of these transformations in general-purpose and specialized CAD tools, pointing out challenges like growth of computation effort. Finally, we describe how limitations of post-tapeout compensation drive the need for "litho-aware" physical design tools, showing examples in cell design, place-and-route, and layout migration.
New electronics technologies are emerging which may carry us beyond the limits of lithographic processing down to molecularscale feature sizes. Devices and interconnects can be made from a variety of molecules and materials including bistable and switchable organic molecules, carbon nanotubes, and, single-crystal semiconductor nanowires. They can be self-assembled into organized structures and attached onto lithographic substrates. This tutorial reviews emerging molecular-scale electronics technology for CAD and system designers and highlights where ICCAD research can help support this technology.
Within the verification community, there has been a recent increase in interest in Quantified Boolean Formula evaluation (QBF) as many interesting sequential circuit verification problems can be formulated as QBF instances. A closely related research area to QBF is Boolean Satisfiability (SAT). Recent advances in SAT research have resulted in some very efficient SAT solvers. One of the critical techniques employed in these solvers is Conflict Driven Learning. In this paper, we adapt conflict driven learning for application in a QBF setting. We show that conflict driven learning can be regarded as a resolution process on the clauses. We prove that under certain conditions, tautology clauses obtained from resolution in QBF also obey the rules for implication and conflicts of regular (nontautology) clauses; and therefore they can be treated as regular clauses and used in future search. We have implemented this idea in a new QBF solver called Quaffle and our initial experiments show that conflict driven learning can greatly speed up the solution process for most of the benchmarks we tested.
Optimized solvers for the Boolean Satisfiability (SAT) problem have many applications in areas
such as hardware and software verification, FPGA routing, planning, etc. Further uses are
complicated by the need to express "counting constraints" in conjunctive normal form (CNF).
Expressing such constraints by pure CNF leads to more complex SAT instances. Alternatively,
those constraints can be handled by Integer Linear Programming (ILP), but generic ILP solvers
may ignore the Boolean nature of 0-1 variables. Therefore specialized 0-1 ILP solvers extend
SAT solvers to handle these so-called "pseudo-Boolean" constraints.
This work provides an update on the on-going competition between generic ILP techniques and
specialized 0-1 ILP techniques. To make a fair comparison, we generalize recent ideas for fast
SAT-solving to more general 0-1 ILP problems that may include counting constraints and
optimization. Another aspect of our comparison is evaluation on 0-1 ILP benchmarks that
originate in Electronic Design Automation (EDA), but that cannot be directly solved by a SAT
solver. Specifically, we solve instances of the Max-SAT and Max-ONEs optimization problems
which seek to maximize the number of satisfied clauses and the "true" values over all satisfying
assignments, respectively. Those problems have straightforward applications to SAT-based
routing and are additionally important due to reductions from Max-Cut, Max-Clique, and Min
Vertex Cover. Our experimental results show that specialized 0-1 techniques tend to outperform
generic ILP techniques on Boolean optimization problems as well as on general EDA SAT
problems.
This paper introduces a new method for performing time-frame expansion based on writing the number of time frames in terms of powers of two. In the proposed method, the behavior of a circuit for t time frames, where 0 = t < n is modeled by unrolling the circuit 20, 21, 22, į , 2(log n-1) times and combining them. This formulation of the problem makes it possible to prune the search space quickly when the problem is infeasible. To show the advantage of this method, we have used it to model the state justification problem and solve the problem using a SAT-solver. Experimental results show several orders of magnitude speedup for some non-trivial infeasible problems. Furthermore, in most cases the CPU time requirement grows linearly in terms of the number of time frames.
Computer simulation is an important tool for improving our understanding of biomolecule electrostatics, in part to aid in drug design. However, the numerical techniques used in these simulation tools do not exploit fast solver approaches widely used in analyzing integrated circuit interconnects. In this paper we describe one popular formulation used to analyze biomolecule electrostatics, present an integral formulation of the problem, and apply the precorrected-FFT method to accelerate the solution of the integral equations.
We present efficient computational methods for scattered point and meshless analysis of
electrostatic microelectromechanical systems (MEMS). Electrostatic MEMS are governed by
coupled mechanical and electrostatic energy domains. A self-consistent analysis of electrostatic
MEMS is implemented by combining a finite cloud method based interior mechanical analysis
with a boundary cloud method based exterior electrostatic analysis. Lagrangian descriptions are
used for both mechanical and electrostatic analyses. Meshless finite cloud and boundary cloud
methods combined with fast algorithms and Lagrangian descriptions are flexible, efficient and
attractive alternatives compared to conventional finite element/boundary element methods for
self-consistent electromechanical analysis. Numerical results are presented for an electrostatic
comb drive device.
Keywords:
coupled electro-mechanical analysis, meshless, finite cloud method, boundary cloud
method, Lagrangian electrostatics
In this paper we present a fast and efficient program for extraction of the frequency dependent inductance of structures with permeable materials. The program, FastMag, uses a magnetic surface charge formulation, efficient techniques for evaluating the required integrals, and a preconditioned GMRES method to solve the resulting linear system. Results from examples are presented to demonstrate the accuracy and versatility of the FastMag program.
We present a novel approach to optimization-based variation-tolerant analog circuit sizing. Using formal methods based on affine arithmetic, we calculate guaranteed bounds on the worst-case behavior and deterministically find the global optimum of the sizing problem by means of branch-and-bound optimization. To solve the nonlinear circuit equations with parameter variations, we define a novel affine-arithmetic Newton operator that gives a significant improvement in computational efficiency over an implementation using interval arithmetic. The calculation of guaranteed worst-case bounds and the global optimization are demonstrated by a prototype implementation.
This paper presents a simulator for the statistical analysis of MOS integrated circuits affected by
mismatch effect. The tool is based on a rigorous formulation of circuit equations including
random current sources to take into account technological tolerances. The simulator requires a
simulation time of several orders of magnitude lower than that required by Montecarlo analysis,
while ensuring a good accuracy.
Keywords:
Device mismatch, stochastic simulation, MNA, MOS ICs, non-Montecarlo analysis.
The traditional way of approaching device-level placement problems for analog layout is to explore a huge search space of absolute placement representations, where cells are allowed to illegally overlap during their moves [2, 7, 8]. This paper presents a novel analog placement technique operating on the set of binary tree representations of the layout [3], where the typical presence of an arbitrary number of symmetry groups of devices is directly taken into account during the exploration of the solution space. The efficiency of the novel approach is due to a data structure called segment tree, mainly used in computational geometry.
Existing layout optimization methods often assume a set of interconnects with given RLC crosstalk bounds in a routing region. RLC crosstalk bound partitioning is critical for effectively applying these methods at the full-chip level. In this paper, we develop an optimal crosstalk budgeting scheme based on linear programming (LP) formulation, and apply it to shield insertion and net ordering at the full-chip level. Experiment results show that compared to the best alter- native approach, the LP based method reduces the total routing area by up to 7.61% and also uses less runtime. To the best of our knowledge, this is the first in-depth work that studies the RLC crosstalk budgeting problem.
The challenges for developing an ESD (Electro-static Discharge) layout extractor originate from unconventional layout patterns of ESD protection devices, parasitic ESD device extraction and device count reduction. This paper reports a new technology-independent layout extractor, ESDExtractor, which is capable of extracting all types of ESD devices and answers the demands for ESD design verification. General methodology to extract both intentional and parasitic ESD devices, specific algorithms and implementation methods for efficiency-enhancement are presented, followed by a design example.
Electron projection lithography (EPL) is a leading candidate for next generation lithography (NGL) in VLSI production. The membrane mask used in EPL is divided into sub-fields by struts for structural support. A layout must be partitioned into these sub-fields on mask and then stitched back together by the EPL tool on wafer. To minimize possible stitching errors, partitioning of a mask layout should minimize cuts of layout features in the overlapping area between two adjacent sub-fields. This paper presents the first formulation of the mask layout partitioning problem for EPL as a graph problem. The graph formulation is optimally solved with a shortest path approach. Two other techniques are also presented to speed up computation. Experimental runs on data from a real industry design show excellent results.
In this paper we present an advanced functional extraction tool for automatic generation of high-
level RTL from switch-level circuit netlist representation. The tool is called FEV-Extract and is
part of a comprehensive Formal Equivalence Verification (FEV) system developed at Intel to
verify modern microprocessor designs. FEV-Extract employs a powerful hierarchical analysis
procedure, and advanced and generic algorithms for automatic recognition of logical primitives,
to cope with variety of circuit design styles and their complexity. Logic equations are then
extracted to generate a behavioral RTL model described in industrial standard HDL languages, to
be used in the formal equivalence verification, logic simulation, synthesis and testability flows.
Categories and Subject Descriptors
D.3 [VERIFICATION, MODELING AND SIMULATION]:
Formal verification techniques. Switch, logic and high-level
simulation, design validation, HW/SW co-simulation,
combinational and sequential equivalence checking. Model
checking. Theorem proving.
General Terms
Algorithms for design verification.
Keywords:
Switch Level Analysis, Functional Abstraction, Formal Equivalence Verification
(FEV), Design For Testability (DFT), Synthesis, Logic simulation, Binary Decision Diagrams
(BDDs), Satisfiability procedures, Hardware Description Languages (HDL).
Circuits can be simplified for combinational equivalence checking by transforming internal
functions, while preserving their ranges. In this paper, we investigate how to effectively apply
the idea to improve equivalence checking. We propose new heuristics to identify groups of nets
in a cut, and elaborate detailed aspects of the new equivalence checking method. With a given
miter, we identify a group of nets in a cut and transform the function of each net into a more
compact representation with less variables. These new compact parametric representations
preserve the range of nets as well as of the cut. This transformation significantly reduces the size
of intermediate BDDs and enables the verification to be conclusive for many designs which
state-of-the-art equivalence checkers fail to verify. Iterative groupings and transformations are
performed until no grouping is possible for a cut. Then we proceed to the next cut and continue
until the compare point is reached. Our experimental results show the effectiveness of our
strategy and new grouping heuristics on the new method.
Categories and Subject Descriptors
J.6 [Computer-aided engineering]: Verification
General Terms
Algorithms, Experimentation, Verification
Keywords:
Combinational verification, equivalence checking
Generalized Symbolic Trajectory Evaluation (GSTE) [17, 18, 19] is a very significant extension of STE that has the power to verify all w-regular properties but at the same time preserves the benefits of the original STE [16]. It also extends the symbolic quaternary model used by STE to support seamless model refinement for efficiency and accuracy trade-off in GSTE model checking. In this paper, we present a case study on FIFO verification to illustrate the strength of GSTE and demonstrate its methodology in specifying and verifying large scale designs.
A regular circuit structure called a Whirlpool PLA (WPLA) is proposed. It is suitable for the implementation of finite state machines as well as combinational logic. A WPLA is logically a four-level Boolean NOR network. By arranging the four logic arrays in a cycle, a compact layout is achieved. Doppio-ESPRESSO, a four-level logic minimization algorithm is developed for WPLA synthesis. No technology mapping, placement or routing is necessary for the WPLA. Area and delay trade-off is absent, because these two goals are usually compatible in WPLA synthesis.
Routability or wiring congestion in a VLSI chip is becoming increasingly important as chip complexity increases. Congestion has a significant impact on performance, yield and chip area. Although advances in placement algorithms have attempted to alleviate this problem, the inherent structure of the logic netlist has a significant impact on the routability irrespective of the placement algorithm used. Placement algorithms find optimal assignment of locations to the logic and do not have the ability to change the netlist structure. Significant decisions regarding the circuit structure are made early in synthesis such as during the technology independent logic optimization step. Optimizations in this step use literal count as a metric for optimization and do not adequately capture the intrinsic entanglement of the netlist. Two circuits with identical literal counts may have significantly different congestion characteristics post placement. In this paper, we motivate that a property of the network structure called adhesion can make a significant contribution to routing congestion. We then provide a metric to measure this property. We also show that adhesion as measured by this metric can be used in addition to literal counts to estimate and optimize post routing congestion early in the design flow.
We discuss the simplification of non-deterministic MV networks and their internal nodes using internal flexibilities. Given the network structure and its external specification, the flexibility at a node is derived as a non-deterministic MV relation. This flexibility is used to simplify the node representation and enhance the effect of Boolean resubstitution. We show that the flexibility derived is maximum. The proposed approach has been implemented and tested in MVSIS [16]. Experimental results show that it performs well on a variety of MV and binary benchmarks.
With the increasing cost of global communication on-chip, high-performance designs for data-
intensive applications require architectures that distribute hardware resources (computing logic,
memories, interconnect, etc.) throughout a chip, while restricting computations and
communications to geographic proximities. In this paper, we present a methodology for high-
level synthesis (HLS) of distributed logic-memory architectures, i.e., architectures that have
logic and memory distributed across several partitions in a chip. Conventional HLS tools are
capable of extracting parallelism from a behavior for architectures that assume a monolithic
controller/datapath communicating with a memory or memory hierarchy. This work provides
techniques to extend the synthesis frontier to more general architectures that can extract both
coarse- and fine-grained parallelism from data accesses and computations in a synergistic
manner. Our methodology selects many possible ways of organizing data and computations,
carefully examines the trade-offs (i.e., communication overheads, synchronization costs, area
overheads) in choosing one solution over another, and utilizes conventional HLS techniques for
intermediate steps.
We have evaluated the proposed framework on several benchmarks by generating register-
transfer level (RTL) implementations using an existing commercial HLS tool with and without
our enhancements, and by subjecting the resulting RTL circuits to logic synthesis and layout.
The results show that circuits designed as distributed logic-memory architectures using our
framework achieve significant (up to 5.31X, average of 3.45X) performance improvements over
well-optimized conventional designs with small area overheads (up to 19.3%, 15.1% on
average).
Multiport memories are extensively used in modern system designs because of the performance advantages they offer. The increased memory access throughput could lead to significantly faster schedules in behavioral synthesis. However, they also have an associated area and energy penalty. We describe a technique for mapping data accesses to multiport memories during behavioral synthesis that results in significantly better energy characteristics than an unoptimized multiport design. The technique consists of an initial colouring of the array access nodes in the data flow graph based on spatial locality, followed by attempts to consecutively access memory locations with the same colour on the same port. Our experiments on several applications indicate a significant reduction in address bus switching activity, leading to an overall energy reduction over an unoptimized design, while still maintaining a performance advantage over a single-port solution.
Data transfer intensive applications consume a significant amount of energy in memory access.
The selection of a memory location from a memory array involves driving row and column
select lines. A signal transition on a row select line often consumes significantly more energy
than a transition on a column select line. In order to exploit this difference in energy
consumption of row and column select lines, we propose a novel address assignment
methodology that aims to minimize high energy row transitions by assigning spatially and
temporally local data items to the same row. The problem of energy efficient address assignment
has been formulated as a multi-way graph partitioning problem and solved with a heuristic. Our
experiments demonstrate that our methodology achieves row transition counts very close to the
optimum and that the methodology can, for some examples, reduce row transition count by 40-
70% over row major mapping. Moreover, we also demonstrate that our methodology is capable
of handling access sequences with over 15 million accesses in moderate time.
Categories and Subject Descriptors
B.5.1 [Register-Transfer-Level Implementation]: DesignÖMemory
design; B.5.2 [Register-Transfer-Level Implementation]: Design
AidsÖAutomatic synthesis; Optimization
General Terms
Algorithms, Design, Experimentation, Performance
Keywords
address assignment, data layout, memory synthesis
For crosstalk noise calculation, computing switching windows of a net helps us identify noise sources accurately. Traditional approaches use a single continuous switching window for a net. Under this model, signal switching is assumed to happen any time within the window. Although conservative and sound, this model can result in too much pessimism since in reality the exact timing of signal switching is determined by a path delay up to the net, i.e. the underlying circuit structure does not always allow signal switching at arbitrary time within the continuous switching window. To address this inherent inaccuracy of the continuous switching window, we propose a refinement of the traditional approaches, where signal switching is characterized by a set of discontinuous switching windows instead of a single continuous window. Each continuous switching window is divided into multiple windows, called time slots, and the signal switching activity of each slot is analyzed separately to calculate the maximum noise with more accuracy. By controlling the size of a time slot we can trade off accuracy and runtime, which makes this approach highly scalable. We have confirmed by experiments on industrial circuits that up to 90% of the noise violations detected by the traditional approach can be unreal.
Noise analysis has become a critical concern in advanced chip designs. Traditional methods suffer from two common issues. First, noise that is propagated through the driver of a net is combined with noise injected by capacitively coupled aggressor nets using linear summation. Since this ignores the non-linear behavior of the driver gate the noise that develops on a net can be significantly underestimated. We therefore propose a new linear model that accurately combines propagated and injected noise on a net and which maintains the efficiency of linear simulation. After the propagated and injected noise are correctly combined on a victim net, it is necessary to determine if the noise can result in a functional failure. This is the second issue that we discuss in this paper. Traditionally, noise failure criteria have been based on unity gain points of the DC or AC transfer curves. However, we will show that for digital designs, these approaches can result in a pessimistic analysis in some cases, while in other cases, they allow circuit operation that is extremely close to regions that are unstable and do not allow sufficient margin for error in the analysis. In this paper, we compare the effectiveness of the discussed noise failure criteria and also present a propagation based method, which is intended to overcome these drawbacks. The proposed methods were implemented in a noise analysis tool and we demonstrate results on industrial circuits.
This paper describes a fast method to estimate crosstalk noise in the presence of multiple aggressor nets for use in physical design automation tools. Since noise estimation is often part of the inner-loop of optimization algorithms, very efficient closed-form solutions are needed. Previous approaches have typically used simple lumped 3-4 node circuit templates. One aggressor net is modeled at a time assuming that the coupling capacitances to all quiet aggressor nets are grounded. They also model the load from interconnect branches as a lumped capacitor and use a dominant pole approximation to solve the template circuit. While these approximations allow for very fast analysis, they result in significant underestimation of the noise. In this paper, we propose a new and more comprehensive fast noise estimation model. We use a 6 node template circuit and propose a novel reduction technique for modeling quiet aggressor nets based on the concept of coupling point admittance. We also propose a reduction method to replace tree branches with effective capacitors which models the effect of resistive shielding. Finally, we propose a new double pole approach to solve the template circuit. We tested the proposed method on noise-prone interconnects from an industrial high performance processor. Our results show a worst-case error of 7.8% and an average error of 2.7%, while allowing for very fast analysis.
This paper presents a heuristic scheduling algorithm for heterogeneous specifications, those formed by operations of different types and widths. The algorithm extracts the common operative kernel of the operations, and binds afterwards operations to cycles with the aim of distributing uniformly the number of bits calculated per cycle. In consequence, operations may be fragmented and executed during a set of non-necessarily consecutive cycles, and over a set of several linked simple hardware resources. The proposed algorithm, in combination with allocation algorithms able to guarantee bit-level reuse of hardware resources, obtains considerably smaller datapaths than the ones proposed by conventional synthesis algorithms. In the datapaths produced the type, number, and width of the hardware resources are independent of the type, number, and width of the specification operations and variables.
Ultra deep submicron (UDSM) technology and system-on-chip (SoC) have resulted in a considerable portion of power dissipated on buses, in which the major sources of the power dissipation are (1) the transition activities on the signal lines and (2) the coupling capacitances of the lines. However, there has been no easy way of optimizing (1) and (2) simultaneously at an early stage of the synthesis process. In this paper, we propose a new (on-chip) bus synthesis algorithm to minimize the total sum of (1) and (2) in the microarchitecture synthesis. Specifically, unlike the previous approaches in which (1) and (2) are minimized sequentially without any interaction between them, or only one of them is minimized, we, given a scheduled dataflow graph to be synthesized, minimize (1) and (2) simultaneously by formulating and solving the two important issues in an integrated fashion: binding data transfers to buses and determining a (physical) order of signal lines in each bus, both of which are the most critical factors that affect the results of (1) and (2). Experimental results on a number of benchmark problems show that the proposed integrated low-power bus synthesis algorithm reduces power consumption by 24.8%, 40.3% and 18.1% on average over those in [12] (for minimizing (1) only), [1] (for (2) only) and [12, 1] (for (1) and then (2)), respectively.
In deep submicron (DSM) technology, the interconnects are equally as or more important than the logic gates. In particular, to achieve timing closure in DSM technology, it is very necessary and critical to consider the interconnect delay at an early stage of the synthesis process. It has been known that resource sharing in high-level synthesis is one of the major synthesis tasks which greatly affect the final synthesis/layout results. In this paper, we propose a new layout- driven resource sharing approach to overcome some of the limitations of the previous works in which the effects of layout on the synthesis have never been taken into account or considered in local and limited ways, or whose computation time is excessively large. The proposed approach consists of two steps: (Step 1) We relax the integrated resource sharing and placement into an efficient linear programming (LP) formulation based on the concept of discretizing placement space; (Step 2) We derive a feasible solution from the solution obtained in Step 1. Then, we employ an iterative mechanism based on the two steps to tightly integrate resource sharing and placement tasks so that the slack time violation due to interconnect delay (determined by placement) as well as logic delay (determined by resource sharing) should be minimized. From experiments using a set of benchmark designs, it is shown that the approach is effective, and efficient, completely removing the slack time violation produced by conventional methods.
Physical design optimizations such as placement, interconnect synthesis, floorplanning, and routing require fast and accurate analysis of RC networks. Because of its simple close form and fast evaluation, the Elmore delay metric has been widely adopted. The recently proposed delay metrics PRIMO and H-gamma match the first three circuit moments to the probability density function of a Gamma statistical distribution. Although these methods demonstrate impressive accuracy compared to other delay metrics, their implementations tend to be challenging. As an alternative to matching to the Gamma distribution, we propose to match the first two circuit moments to a Weibull distribution. The result is a new delay metric called Weibull based Delay (WED). The primary advantages of WED over PRIMO and Hgamma are its efficiency and ease of implementation. Experiments show that WED is robust and has satisfactory accuracies at both near-and far-end nodes.
Existing static timing analyzers make several assumptions about circuits, implicitly trading off accuracy for speed. In this paper we examine the validity of these assumptions, notably the slope approximation to waveforms, single-input transitions, and the choice of a propagating signal based on a single voltage-time point. We provide data on static CMOS gates that show delays obtained in this way can be optimistic by more than 30%. We propose a new approach, Waveform-based Timing Analysis that employs a state-of-the-art circuit simulator as the underlying delay modeler. We show that such an approach can achieve more accurate delays than slope-based timing analyzers at a computation cost that still allows iterations between design modification and delay analysis.
The paper presents a simple yet powerful general theoretical framework and efficient
implementation for removal of clock network timing pessimism. We address pessimism in static
timing analysis (STA) tools caused by considering delay variation along common segments of
clock paths. The STA tools compute setup (hold) timing slack based on conservative
combinations of late (early) launching and early (late) capturing arrival times. To avoid
exponential-time path-based analysis the STA tools use both early and late arrival times on gates
common to both launching and capturing paths. It is impossible in real circuit and is observed as
the clock network pessimism in STA. Our approach supports any kind of delay variation though
the typical causes of the pessimism are process, voltage, and temperature on-chip variation, and
reconvergence in clock network. We propose a new theoretical framework that allows to apply
known graph algorithms instead of time consuming forward and backward multi-pass tracing
algorithms and heuristics that are limited to some network topologies [4]. The new graph-based
framework supports clock networks of virtually any size and type, e.g., tree, mesh, hybrid, clock
gating, chains of multipliers and dividers, loops in such chains, etc. The implementation based on
the proposed framework has proven its strength in a commercial sign-off static timing analyzer
and thus is helping hundreds of designers to achieve faster clock speeds of their chips.
Keywords:
Clock network reconvergence, static timing analysis, process, voltage and
temperature delay variation, deep sub-micron.
Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design.
The recent emergence of extensible processors promises a favorable tradeoff between efficiency
and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of
the program that are to be implemented as custom instructions.
In this work, we describe an automatic methodology to select custom instructions to augment an
extensible processor, in order to maximize its efficiency for a given application program. We
demonstrate that the number of custom instruction candidates grows rapidly with program size,
leading to a large design space, and that the quality (speedup) of custom instructions varies
significantly across this space, motivating the need for the proposed ow. Our methodology
features cost functions to guide the custom instruction selection process, as well as static and
dynamic pruning techniques to eliminate inferior parts of the design space from consideration.
Further, we employ a two-stage process, wherein a limited number of promising instruction
candidates are first selected, and then evaluated in more detail through cycle-accurate instruction
set simulation and synthesis of the corresponding hardware, to identify the custom instruction
combinations that result in the highest program speedup or maximize speedup under a given area
constraint.
We have evaluated the proposed techniques using a state-of-the-art extensible processor
platform, in the context of a commercial design ow. Experiments with several benchmark
programs indicate that custom processors synthesized using automatic custom instruction
selection can result in large improvements in performance (upto 5.4X, average of 3.4X), energy
(upto 4.5X, average of 3.2X), and energy-delay product (upto 24.2X, average of 12.6X), while
speeding up the design process significantly.
Application-specific instructions can significantly improve the performance, energy, and code size of configurable processors. A common approach used in the design of such instructions is to convert application-specific operation patterns into new complex instructions. However, processors with a fixed instruction bitwidth cannot accommodate all the potentially interesting operation patterns, due to the limited code space afforded by the fixed instruction bitwidth. We present a novel instruction set synthesis technique that employs an efficient instruction encoding method to achieve maximal performance improvement. We build a library of complex instructions with various encoding alternatives and select the best set of complex instructions while satisfying the instruction bitwidth constraint. We formulate the problem using integer linear programming and also present an effective heuristic algorithm. Experimental results using our technique generate instruction sets that show improvements of up to 38% over the native instruction set for several realistic benchmark applications running on a typical embedded RISC processor.
Embedded system programs tend to spend much time in small loops. Introducing a very small
loop cache into the instruction memory hierarchy has thus been shown to substantially reduce
instruction fetch energy. However, loop caches come in many sizes and variations using the
configuration best on the average may actually result in worsened energy for a specific program.
We therefore introduce a loop cache exploration tool that analyzes a particular program's profile,
rapidly explores the possible configurations, and generates the configuration with the greatest
power savings. We introduce a simulation-based approach and show the good energy savings
that a customized loop cache yields. We also introduce a fast estimation-based approach that
obtains nearly the same results in seconds rather than tens of minutes or hours.
Keywords:
Low power, low energy, tuning, loop cache, embedded systems, instruction fetching,
customized architectures, memory hierarchy, estimation, synthesis, architecture tuning.
The communication sub-system of complex IC systems is increasingly critical for achieving system performance. Given this, it is important that the on-chip communication architecture should be included in any quantitative evaluation of system design during design space exploration. While there are several mature methodologies for the modeling and evaluation of architectures of processing elements, there is relatively little work done in modeling of an extensive range of on-chip communication architectures, and integrating this into a single modeling and simulation environment combining processing element and on-chip communication architectures. This paper describes a modeling framework with accompanying simulation tools that attempts to fill this gap. Based on an analysis of a wide range of on-chip communication architectures, we describe how a specific hierarchical class library can be used to develop new on-chip communication architectures, or variants of existing ones with relatively little incremental effort. We demonstrate this through three case studies including two commercial on-chip bus systems and an on-chip packet switching network. Here we show that through careful analysis and construction it is possible for the modeling environment to support the common features of these architectures as part of the library and permit instantiation of the individual architectures as variants of the library design. As part of this methodology we also show how different levels of abstraction of the model can be supported and viewed as different variants that can be used in an accuracy versus simulation time trade-off.
This paper presents an in-depth study of the theory and algorithms for the SPFD-based (Set of Pairs of Functions to be Distinguished) rewiring, and explores the flexibility in the SPFD computation. Our contributions are in the following two areas: (1) We present a theorem and a related algorithm for more precise characterization of feasible SPFD-based rewiring. Extensive experimental results show that for LUT-based FPGAs, the rewiring ability of our new algorithm is 70% higher than SPFD-based local rewiring algorithms (SPFD-LR) [19][21] and 18% higher than the recently developed SPFD-based global rewiring algorithm (SPFD-GR)[20]. (2) In order to achieve more rewiring ability on certain selected wires used in various optimizations, we study the impact of using different atomic SPFD pair assignment methods during the SPFD- based rewiring. We develop several heuristic atomic SPFD pair assignment methods for area or delay minimization and show that they lead to 10% more selected rewiring ability than the random (or arbitrary) assignment methods. When combining (1) and (2) together, we can achieve 38.1% higher general rewiring ability.
SPFDs, a mechanism for expressing flexibilily during logic synthesis, were first introduced for FPGA synthesis. They were then extended to general, combinational Boolean networks and later the concept of sequential SPFDs was introduced. In this paper, we explore the idea of using SPFDs for functional decomposition. A new type of functional decomposition called topologically constrained decomposition is introduced. An algorithm is provided for solving this problem using SPFDs. Preliminary experimental results are encouraging and indicate the feasibility of the approach. A scheme is also presented for generating instances of the topologically constrained decomposition problem.
We apply recently introduced constructive multi-level synthesis in the resynthesis loop targeting convergence of industrial designs. The incremental ability of the resynthesis approach allows more predictable circuit implementations while allowing their aggressive optimization. The approach is based on a very general symbolic decomposition template for logic synthesis that uses information-theoretical properties of a function to infer its decomposition patterns (rather than more conventional measures such as literal counts). Using this template the decomposition is done in a Boolean domain unrestricted by the representation of a function, enabling superior implementation choices driven by additional technological constraints. The symbolic optimization is applied in resynthesis of industrial circuits which have tight timing constraints yielding their much improved timing properties.
The paper describes the folding method of logic functions to reduce the size of memories for keeping the functions. The folding is based on the relation of fractions of logic functions. We show that the fractions of the full adder function have the bit-wise NOT relation and the bit-wise OR relation, and that the memory size becomes half (8-bit). We propose a new 3-1 LUT with the folding mechanisms which can implement a full adder with one LUT. A fast carry propagation line is introduced for a multi-bit addition. The folding and fast carry propagation mechanisms are shown to be useful to implement other multi-bit operations and general 4 input functions without extra hardware resources. The paper shows the reduction of the area consumption when using our LUTs compared to the case using 4-1 LUTs on several benchmark circuits. also useful to implement general 4 input functions. We show that more than half of general 4 input functions can be implemented with two 3-1 LUTs. We compare the area consumption on the case using our 3-1 LUTs and that using 4-1 LUTs. By using our 3-1 LUTs, the area can be reduced up to 56% on several benchmark circuits.
This paper presents an approach to the analysis of task sets implemented on multiprocessor systems, when the task execution times are specified as generalized probability distributions. Because of the extreme complexity of the problem, an exact solution is practically impossible to be obtained even for toy examples. Therefore, our methodology is based on approximating the generalized probability distributions of execution times by Coxian distributions of exponentials. Thus, we transform the generalized semi-Markov process, corresponding to the initial problem, into a continuous Markov chain (CTMC) which, however, is extremely large and, hence, most often is impossible to be stored in memory. We have elaborated a solution which allows to generate and analyze the CTMC in an efficient way, such that only a small part has to be stored at a given time. Several experiments investigate the impact of various parameters on complexity, in terms of time and memory, as well as the trade-offs regarding the accuracy of generated results.
This paper is concerned with the problem of maximizing capacity utilization of the battery power source in a portable electronic system under latency and loss rate constraints. First, a detailed stochastic model of a power-managed, battery-powered electronic system is presented. The model, which is based on the theories of continuous-time Markovian decision processes and stochastic networks, captures two important characteristics of today's rechargeable battery cells, i.e., the current rate-capacity characteristic and the relaxation-induced recovery. Next, the battery-aware dynamic power management problem is formulated as a policy optimization problem and solved exactly by using a linear programming approach. Experimental results show that the proposed method outperforms existing heuristic methods for battery management by as much as 17% in terms of the average energy delivered per unit weight of battery cells.
In this paper, we study leakage power reduction using power gating in the forms of the Virtual power/ground Rails Clamp (VRC) and Multi-threshold CMOS (MTCMOS) techniques. We apply power gating to two circuit types: memory-based units and datapath components. Using a microarchitecturelevel power simulator, as well as power and timing models derived from detailed circuit designs, we further study leakage power modeling and reduction at the system level for modern high-performance VLIW processors. We show that the leakage power can be over 40% of the total power for such processors. Moreover, we propose time-out scheduling of VRC to reduce power up to 85.65% for L2 cache. This power savings results in close to 1/3 total power dissipation for the VLIW processors we study.
Dynamic voltage scaling (DVS) reduces the power consumption of processors when peak performance is unnecessary. However, the achievable power savings by DVS alone is becoming limited as leakage power increases. In this paper, we show how the simultaneous use of adaptive body biasing (ABB) and DVS can be used to reduce power in high-performance processors. Analytical models of the leakage current, dynamic power, and frequency as functions of supply voltage and body bias are derived and verified with SPICE simulation. We then show how to determine the correct trade-off between supply voltage and body bias for a given clock frequency and duration of operation. The usefulness of our approach is evaluated on real workloads obtained using real-time monitoring of processor utilization for four applications. The results demonstrate that application of simultaneous DVS and ABB results in an average energy reduction of 48% over DVS alone.
Voltage scheduling is indispensable for exploiting the benefit of variable voltage processors. Though extensive research has been done in this area, current processor limitations such as transition overhead and voltage level discretization are often considered insignificant and are typically ignored. We show that for hard, real-time applications, disregarding such details can lead to sub-optimal or even invalid results. We propose two algorithms that guarantee valid solutions. The first is a greedy yet simple approach, while the second is more complex but significantly reduces energy consumption under certain conditions. Through experimental results on both real and randomly generated systems, we show the effectiveness of both algorithms, and explore what conditions make it beneficial to use the complex algorithm over the basic one.
This paper describes a dynamic voltage and frequency scaling (DVFS) technique for MPEG decoding to reduce the energy consumption while maintaining a quality of service (QoS) constraint. The computational workload for an incoming frame is predicted using a frame-based history so that the processor voltage and frequency can be scaled to provide the exact amount of computing power needed to decode the frame. More precisely, the required decoding time for each frame is separated into two parts: a frame-dependent (FD) part and a frame-independent (FI) part. The FD part varies greatly according to the type of the incoming frame whereas the FI part remains constant regardless of the frame type. In the DVFS scheme presented in this paper the FI part is used to compensate for the prediction error that may occur during the FD part such that a significant amount of energy can be saved while meeting the frame rate requirement. The proposed DVFS algorithm has been implemented on a StrongArm-1110 based evaluation board. Measurement results demonstrate a higher than 50% CPU energy saving as a result of DVFS.
This paper presents a new congestion minimization technique for standard cell global placement. The most distinct feature of this approach is that it does not follow the traditional "estimate-then- eliminate" strategy. Instead, it avoids the excessive usage of routing resources by the "local" nets so that more routing resources are available for the uncertain "global" nets. The experimental results show that our new technique, SPARSE, achieves better routability than the traditional total wire length (Bounding Box) guided placers, which had been shown to deliver the best routability results among the placers optimizing different cost functions [2]. Another feature of SPARSE is the capability of allocating white space implicitly. SPARSE exploits the well known empirical Rent's rule and is able to improve the routability even more in the presence of white space. Compared to the most recent academic routability-driven placer Dragon[8], SPARSE is able to produce solutions with equal or better routability.
IP blocks and large macro cells are increasingly prevalent in physical design, actually causing an increase in the available free space for the dust logic. We observe that top-down placement based on recursive bisection with multilevel partitioning performs poorly on these porous designs. However, analytic solvers have the ability to find the natural distribution of cells in the layout. Consequently, we propose an enhancement to cut-based placement called Analytic Constraint Generation (ACG). ACG utilizes an analytic engine to set constraints for the multi-level partitioner. We show that for real industry designs, ACG significantly improves the performance of cut-based placement, as implemented within a state-of-the-art industrial placer.
This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocks that have numerous architectural constraints. Experiments indicate that our technique produces results of extremely high quality.
Numerous approaches have been proposed to address the overwhelming modeling problems that result from the emergence of magnetic coupling as a dominant performance factor for ICs and packaging. Firstly, model order reduction (MOR) methods have been extended to robustly capture very high frequency behaviors for large RLC systems via methods such as PRIMA[8] with guaranteed passivity. In addition, new models of the magnetic couplings in terms of susceptance (inverse of inductance) have shown great promise for robust sparsification of otherwise intractable inductance coupling-matrix problems[3-5]. However, model order reduction via PRIMA for circuits that include susceptance elements does not guarantee passivity. Moreover, susceptance elements are incompatible with the path tracing algorithms that provide the fundamental runtime efficiency of RICE [10]. In this paper a novel MOR algorithm, SMOR, is proposed as an extension of ENOR [11] which exploits the matrix properties of susceptance- based circuits for runtime efficiency, and provides for a numerically stable, provably passive MOR using a new orthonormalization strategy.
The new concept of Multi-node Moment Matching (MMM) is introduced in this paper. The MMM technique simultaneously matches the moments at several nodes of a circuit using explicit moment matching around s=0. As compared to the well-known Single-point Moment Matching (SMM) techniques (such as AWE), MMM has several advantages. First, the number of moments required by MMM is significantly lower than SMM for a reduced order model of the same accuracy, which directly translates into computational efficiency. This higher computational efficiency of MMM as compared to SMM increases with the number of inputs to the circuit. Second, MMM has much better numerical stability as compared to SMM. This characteristic allows MMM to calculate an arbitrarily high order approximation of a linear system, achieving the required accuracy for systems with complex responses. Finally, MMM is highly suitable for parallel processing techniques especially for higher order approximations while SMM has to calculate the moments sequentially and cannot be adapted to parallel processing techniques.
Verification of contemporary integrated circuits requires accurate modeling of high-frequency effects in all passive component subsystems. Often, descriptions of those subsystems are only available in the frequency-domain. In this paper, we propose a simple, scalar, constrained- passive rational approximation scheme that incorporates a grid-based test for strict positive realness. In contrast to similar recent work based on convex optimization and the positive real lemma, the methodology is potentially more efficient, since it does not introduce a quadratic number of auxiliary variables, is potentially more accurate, since pole locations can be re- adjusted during the optimization, but possibly less reliable, since it relies on the solution of optimization problems that are not convex. Therefore, because of the use of local constrained non-convex optimization, the generation of feasible initial guesses is also considered.
In this survey, we outline basic SAT- and ATPG-procedures as well as their applications in formal hardware verification. We attempt to give the reader a trace trough literature and provide a basic orientation concerning the problem formulations and known approaches in this active field of research.
The ultimate goal of logic synthesis is to explore implementation flexibility toward meeting design targets, such as area, power, and delay. Traditionally, such flexibility is expressed using "don't cares" and we seek the best implementation that does not violate them. However, the calculation and storing of don't care information is CPU and memory-intensive. In this paper, we give an overview of logic synthesis approaches based on techniques developed for Automatic Test Pattern Generation (ATPG). Instead of calculating and storing dont cares explicitly, ATPG-based logic synthesis techniques calculate the flexibility implicitly. Low CPU and memory usage make those techniques applicable for practical industrial circuits. Also, the basic ATPG-based logic level operations create predictable, small layout perturbations, making an ideal foundation for efficient physical synthesis. Theoretical results show that an efficient, yet simple add-a-wire-and-remove-a-wire operation covers all possible complex logic transformations.
The exploding complexity of new chips and the ever decreasing time-to-market window are forcing fundamental changes in the way systems are designed. The advent of Systems-on-Chip (SoC) based on pre-designed intellectual-property (IP) cores has become an absolute necessity for embedded systems companies to remain competitive. Designing an SoC, however, is extremely complex, as it encompasses a range of difficult problems in hardware and software design. This paper explains a wide range of SoC issues including market drivers and trends, technology and integration aspects, early architecture definition, methodology, hardware and software design and verification techniques.