SIGDA, Super Compendium, FPGA 1999, Abstracts

FPGA'99 ABSTRACTS

Sessions: [1] [2] [3] [4] [5] [Panel] [6] [7] [8] [9] [Poster Paper Abstracts]

Session 1: Commercial FPGA Architectures

Chair: Jonathan Rose, University of Toronto

1.1 A new high density and very low cost reprogrammable FPGA architecture [p. 3]

Sinan Kaptanoglu, Greg Bakker, Arun Kundu, Ivan Corneillet, Actel Corporation; Ben Ting, BTR Inc.

A new reprogrammable FPGA architecture is described which is specifically designed to be of very low cost. It covers a range of 35K to a million usable gates. In addition, it delivers high performance and it is synthesis efficient. This architecture is loosely based on an earlier reprogrammable Actel architecture named ES. By changing the structure of the interconnect and by making other improvements, we achieved an average cost reduction by a factor of three per usable gate. The first member of the family based on this architecture is fabricated on a 2.5V standard 0.25u CMOS technology with a gate count of up to 130K which also includes 36K bits of two port RAM. The gate count of this part is verified in a fully automatic design flow starting from a high level description followed by synthesis, technology mapping, place and route, and timing extraction.

1.2 Hybrid Product Term and LUT Based Architectures Using Embedded Memory Blocks [p. 13]

Frank Heile, Andrew Leaver, Altera Corporation

The Embedded System Block (ESB) of the APEX20K programmable logic device family from Altera Corporation includes the capability of implementing product term macrocells in addition to flexibly configurable ROM and dual port RAM. In product term mode, each ESB has 16 macrocells built out of 32 product terms with 32 literal inputs. The ability to reconfigure memory blocks in this way represents a new and innovative use of resources in a programmable logic device, requiring creative solutions in both the hardware and software domains. The architecture and features of this Embedded System Block are described.
Keywords: Product terms, RAM, heterogeneous architecture.

1.3 An Innovative, Segmented High Performance FPGA Family with Variable-Grain- Architecture and Wide-gating Functions [p. 17]

Om Agrawal, Herman Chang, Brad Sharpe-Geisler, Nick Schmitz, Bai Nguyen, Jack Wong, Giap Tran, Fabiano Fontana, Bill Harding, Vantis Corporation

This paper describes the Vantis VF1 FPGA architecture, an innovative architecture based on 0.25u (drawn) (0.18u Leff)/4-metal technology. It was designed from scratch for high performance, routability and ease-of-use. It supports system level functions (including wide gating functions, dual-port SRAMs, high speed carry chains, and high speed IO blocks) with a symmetrical structure. Additionally, the architecture of each of the critical elements including: variable-grain logic blocks, variable-length-interconnects, dual-port embedded SRAM blocks, I/O blocks and on-chip PLL functions will be described.

Session 2: Mapping, Packing and Floorplanning

Chair: Margaret Marek-Sadowska, University of California, Santa Barbara

2.1 Cut Ranking and Pruning: Enabling a General and Efficient FPGA Mapping Solution [p. 29]

Jason Cong, Chang Wu, University of California, Los Angeles, Yuzheng Ding, Lucent Technologies

Cut enumeration is a common approach used in a number of FPGA synthesis and mapping algorithms for consideration of various possible LUT implementations at each node in a circuit. Such an approach is very general and flexible, but often suffers high computational complexity and poor scalability. In this paper, we develop several efficient and effective techniques on cut enumeration, ranking and pruning. These techniques lead to much better runtime and scalability of the cut-enumeration based algorithms; they can also be used to compute a tight lower-bound on the size of an area-minimum mapping solution. For area-oriented FPGA mapping, experimental results show that the new techniques lead to over 160X speed-up over the original optimal duplication-free mapping algorithm, achieve mapping solutions with 5-21% smaller area for heterogenous FPGAs compared to those by Chortle-cfr [6], MIS-pga-new [9], and TOS-TUM [4], yet with over 100X speed-up over MIS-pga-new [9] and TOS-TUM [4].

2.2 Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density [p. 37]

Alexander Marquardt, Vaughn Betz, Jonathan Rose, University of Toronto

In this paper, we investigate the speed and area-efficiency of FPGAs employing "logic clusters" containing multiple LUTs and registers as their logic block. We introduce a new, timing-driven tool (T-VPack) to "pack" LUTs and registers into these logic clusters, and we show that this algorithm is superior to an existing packing algorithm. Then, using a realistic routing architecture and sophisticated delay and area models, we empirically evaluate FPGAs composed of clusters ranging in size from one to twenty LUTs, and show that clusters of size seven through ten provide the best area-delay trade-off. Compared to circuits implemented in an FPGA composed of size one clusters, circuits implemented in an FPGA with size seven clusters have 30% less delay (a 43% increase in speed) and require 8% less area, and circuits implemented in an FPGA with size ten clusters have 34% less delay (a 52% increase in speed), and require no additional area.

2.3 A Methodology for Fast FPGA Floorplanning [p. 47]

John Emmert, Dinesh Bhatia, University of Cincinatti

Floorplanning is an important problem in FPGA circuit mapping. As FPGA capacity grows, new innovative approaches will be required for efficiently mapping circuits to FPGAs. In this paper we present a macro based floorplanning methodology suitable for mapping large circuits to large, high density FPGAs. Our method uses clustering techniques to combine macros into clusters, and then uses a tabu search based approach to place clusters while enhancing both circuit routability and performance. Our method is capable of handling both hard (fixed size and shape) macros and soft (fixed size and variable shape) macros. We demonstrate our methodology on several macro based circuit designs and compare the execution speed and quality of results with commercially available CAE tools. Our approach shows a dramatic speedup in execution time without any negative impact on quality.
Key Words: Floorplanning, Placement, FPGA, Clustering, Tabu Search

Session 3: FPGA Architecture Studies

Chair: Tim Southgate, Altera

3.1 FPGA Routing Architecture: Segmentation and Buffering to Optimize Speed and Density [p. 59]

Vaughn Betz, Jonathan Rose, University of Toronto

In this work we investigate the routing architecture of FPGAs, focusing primarily on determining the best distribution of routing segment lengths and the best mix of pass transistor and tri-state buffer routing switches. While most commercial FPGAs contain many length 1 wires (wires that span only one logic block) we find that wires this short lead to FPGAs that are inferior in terms of both delay and routing area. Our results show instead that it is best for FPGA routing segments to have lengths of 4 to 8 logic blocks. We also show that 50% to 80% of the routing switches in an FPGA should be pass transistors, with the remainder being tri-state buffers. Architectures that employ the best segmentation distributions and the best mixes of pass transistor and tri-state buffer switches found in this paper are not only 11% to 18% faster than a routing architecture very similar to that of the Xilinx XC4000X but also considerably simpler. These results are obtained using an architecture investigation infrastructure that contains a fully timing-driven router and detailed area and delay models.

3.2 Balancing Interconnect and Computation in a Reconfigurable Computing Array (or, why you don't really want 100% LUT utilization) [p. 69]

André DeHon, University of California, Berkeley

FPGA users often view the ability of an FPGA to route designs with high LUT (gate) utilization as a feature, leading them to demand high gate utilization from vendors. We present initial evidence from a hierarchical array design showing that high LUT utilization is not directly correlated with efficient silicon usage. Rather, since interconnect resources consume most of the area on these devices (often 80-90%), we can achieve more area efficient designs by allowing some LUTs to go unused - allowing us to use the dominant resource, interconnect, more efficiently. This extends the "Sea-of-gates" philosophy, familiar to mask programmable gate arrays, to FPGAs. Also introduced in this work is an algorithm for "depopulating" the gates in a hierarchical network to match the limited wiring resources.

Session 4: Rapid Reconfiguration

Chair: Mike Butts, Synopsys

4.1 Configuration Cloning: Exploiting Regularity in Dynamic DSP Architectures [p. 81]

S.R. Park, W. Burleson, University of Massachusetts

Existing FPGAs have fairly simple and inefficient configuration mechanisms due to the relative infrequency of reconfiguration. However a large class of dynamically configurable architectures for DSP and communications can benefit from special-purpose configuration mechanisms which allow significant savings in configuration speed, power and memory. Light weight configuration mechanisms allow much finer grained dynamic reconfiguration techniques of DSP and communications functions that tune algorithm and architecture parameters incrementally to track data and environment variations. These adaptive techniques exploit the time-varying nature of many DSP applications and avoid the power costs associated with worst-case design. In this paper we develop a new FPGA configuration method called configuration cloning from the analogy of biological cloning. The main motivation behind configuration cloning is to exploit spatial and temporal regularity and locality in algorithms and architectures by copying and operating on the configuration bit-stream already resident in an FPGA. This results in a speed and power improvement over off-chip partial reconfiguration techniques but does require additional interconnects and control hardware on the FPGA. But in contrast to multiple-context and striped configuration approaches, cloning requires only a small amount of hardware overhead to greatly increase the on-chip configuration bandwidth. Details of the configuration cloning mechanism are described. Two familiar DSP applications, motion estimation and FIR filtering, are explained to demonstrate order of magnitude reductions in configuration time and power compared to existing configuration techniques. Resource recycling is also presented as a generic method of reclaiming freed up logic resources and using them as local memory to greatly reduce I/O requirements.

4.2 Don't Care Discovery for FPGA Configuration Compression [p. 91]

Zhiyuan Li, Scott Hauck, Northwestern University

One of the major overheads in reconfigurable computing is the time it takes to reconfigure the devices in the system. The configuration compression algorithm presented in our previous paper [Hauck98c] is one efficient technique for reducing this overhead. In this paper, we develop an algorithm for finding Don't Care bits in configurations to improve the compatibility of the configuration data. With the help of the Don't Cares, higher configuration compression ratios can be achieved by using our modified configuration compression algorithm. This improves compression ratios of a factor of 7, where our original algorithm only achieved a factor of 4.

Session 5: Computational Applications

Chair: Brad Hutchings, Brigham Young University

5.1 A FPGA-Based Hardware Implementation of Generalized Profile Search Using Online Arithmetic [p. 101]

Emeka Mosanya, Eduardo Sanchez, Swiss Federal Institute of Technology

This paper describes the hardware implementation of the Generalized Profile Search algorithm using online arithmetic and redundant data representation. This is part of the GenStorm project, aimed at providing a dedicated computer for biological sequence processing based on reconfigurable hardware using FPGAs. The serial evaluation of the result made possible by a redundant data representation leads to a significant increase of data throughput in comparison with standard non redundant data coding.

5.2 Procedural Texture Mapping on FPGAs [p. 112]

Andy G. Ye, David M. Lewis, University of Toronto

Procedural textures can be effectively used to enhance the visual realism of computer rendered images. Procedural textures can provide higher realism for 3-D objects than traditional hardware texture mapping methods which use memory to store 2-D texture images. This paper proposes a new method of hardware texture mapping in which texture images are synthesized using FPGAs. This method is very efficient for texture mapping procedural textures of more than two input variables. By synthesizing these textures on the fly, the large amount of memory required to store their multidimensional texture images is eliminated, making texture mapping of 3-D textures and parameterized textures feasible in hardware. This paper shows that using FPGAs, procedural textures can be synthesized at high speed, with a small hardware cost. Data on the performance and the hardware cost of synthesizing procedural textures in FPGAs are presented. This paper also presents, the FPGA implementations of two Perlin noise based 3-D procedural textures.

Panel: FPGAs in the Era of Systems-on-a-Chip [p. 121]

Moderator: Scott Hauck, Northwestern University

Session 6: FPGAs for Custom Computing

Chair: Carl Ebeling, University of Washington

6.1 HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array [p. 125]

William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani, Varghese George, John Wawrzynek, André DeHon, University of California, Berkeley

There is no inherent characteristic forcing Field Programmable Gate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Modern FPGAs seldom achieve application clock rates close to their processor cousins because (1) resources in the FPGAs are not balanced appropriately for high-speed operation, (2) FPGA CAD does not automatically provide the requisite transforms to support this operation, and (3) interconnect delays can be large and vary almost continuously, complicating high frequency mapping. We introduce a novel reconfigurable computing array, the High-Speed, Hierarchical Synchronous Reconfigurable Array (HSRA), and its supporting tools. This package demonstrates that computing arrays can achieve efficient, high-speed operation. We have designed and implemented a prototype component in a 0.4um logic design on a DRAM process which will support 250MHz operation for CAD mapped designs.

6.2 A Reconfigurable Arithmetic Array for Multimedia Applications [p. 135]

Alan Marshall, Hewlett Packard; Jean Vuillemin, Ecole Normale Supérieure; Tony Stansfield, Igor Kostarnov, Hewlett Parkard; Brad Hutchings, Brigham Young University

In this paper we describe a reconfigurable architecture optimised for media processing, and based on 4-bit ALUs and interconnect.
Keywords: Reconfigurable computing, multimedia, 4-bit ALU, FPGA.

6.3 Memory Interfacing and Instruction Specification for Reconfigurable Processors [p. 145]

Jeffrey A. Jacob, Paul Chow, University of Toronto

As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that overcomes the interconnection bottleneck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98. The OneChip-98 system has a seamless programming model that enables the programmer to easily specify instructions without additional complex instruction decoding hardware. As well, there is a simple scheme for mapping instructions to the corresponding programming bits. To allow the processor and the reconfigurable array to execute concurrently, the programming model utilizes a novel memory-consistency scheme implemented in the hardware. To evaluate the feasibility of the OneChip-98 architecture, a 32-bit MIPS-like processor and several performance enhancement applications were mapped to the Transmogrifier-2 field programmable system. For two typical applications, the 2-dimensional discrete cosine transform and the 64-tap FIR filter, we were capable of achieving a performance speedup of over 30 times that of a stand-alone state-of-the-art processor.
Keywords: Reconfigurable computer, FPGA, reconfigurable processor, memory interfacing, memory coherence

Session 7: Placement and Routing

Chair: Martine Schlag, University of California, Santa Cruz

7.1 Trading Quality for Compile Time: Ultra-Fast Placement for FPGAs [p. 157]

Yaska Sankar, Jonathan Rose, University of Toronto

The demand for high-speed FPGA compilation tools has occurred for three reasons: first, as FPGA device capacity has grown, the computation time devoted to placement and routing has grown more dramatically than the compute power of the available computers. Second, there exists a subset of users who are willing to accept a reduction in the quality 1 of result in exchange for a high-speed compilation. Third, high-speed compile has been a long-standing desire of users of FPGA-based custom computing machines, since their compile time requirements are ideally closer to those of regular computers. This paper focuses on the placement phase of the compile process, and presents an ultra-fast placement algorithm targeted to FPGAs. The algorithm is based on a combination of multiple-level, bottom-up clustering and hierarchical simulated annealing. It provides superior area results over a known high-quality placement tool on a set of large benchmark circuits, when both are restricted to a short run time. For example, it can generate a placement for a 100,000-gate circuit in 10 seconds on a 300 MHz Sun UltraSPARC workstation that is only 33% worse than a high-quality placement that takes 524 seconds using a pure simulated annealing implementation. In addition, operating in its fastest mode, this tool can provide an accurate estimate of the wirelength achievable with good quality placement. This can be used, in conjunction with a routing predictor, to very quickly determine the routability of a given circuit on a given FPGA device.

7.2 Satisfiability-Based Layout Revisited: Detailed Routing of Complex FPGAs Via Search-Based Boolean SAT [p. 167]

Gi-Joon Nam, Karem A. Sakallah, University of Michigan; Rob A. Rutenbar, Carnegie Mellon University

Boolean-based routing transforms the geometric FPGA routing task into a single, large Boolean equation with the property that any assignment of input variables that "satisfies" the equation (that renders equation identically "1") specifies a valid routing. The formulation has the virtue that it considers all nets simultaneously, and the absence of a satisfying assignment implies that the layout is unroutable. Initial Boolean-based approaches to routing used Binary Decision Diagrams (BDDs) to represent and solve the layout problem. BDDs, however, limit the size and complexity of the FPGAs that can be routed, leading these approaches to concentrate only on individual FPGA channels. In this paper, we present a new search-based Satisfiability (SAT) formulation that can handle entire FPGAs, routing all nets concurrently. The approach relies on a recently developed SAT engine (GRASP) that uses systematic search with conflict-directed non-chronological backtracking, capable of handling very large SAT instances. We present the first comparisons of search-based SAT routing results to other routers, and offer the first evidence that SAT methods can actually demonstrate the unroutability of a layout. Preliminary experimental results suggest that this approach to FPGA routing is more viable than earlier BDD-based methods.
Keywords: Boolean satisfiability, FPGA routing, conflict-directed search

7.3 Multi-Terminal Net Routing for Partial Crossbar-Based Multi-FPGA Systems [p. 176]

Abdel Ejnioui; University of South Florida; N. Ranganathan, University of Texas

Multi-FPGA systems are used as custom computing machines to solve compute intensive problems and also in the verification and prototyping of large circuits. In this paper, we address the problem of routing multi-terminal nets in a multi-FPGA system that uses partial crossbars as interconnect structures. First, we model the multi-terminal routing problem as a partitioned bin packing problem and formulate it as an integer linear programming problem where the number of variables is exponential. A fast heuristic is applied to compute an upper bound on the routing solution. Then, a column generation technique is used to solve the linear relaxation of the initial master problem in order to obtain a lower bound on the routing solution. This is followed by an iterative branch-and-price procedure that attempts to find a routing solution somewhere between the two established bounds. In this regard, the proposed algorithm guarantees an exact routing solution by searching a branch-and-price tree. Due to the tightness of the bounds, the branch-and-price tree is small resulting in shorter execution times. Experimental results are provided for different netlists and board configurations in order to demonstrate the algorithm's performance. The obtained results show that the algorithm finds an exact routing solution in a very short time.
Keywords: FPGA routing, FPGA architecture, layout synthesis, interconnect optimization, branch-and-price, integer programming.

Session 8: DPGAs and Pipeline Configurable FPGAs

Chair: David Lewis, University of Toronto

8.1 Circuit Partitioning for Dynamically Reconfigurable FPGAs [p. 187]

Huiqun Liu, D.F. Wong, University of Texas

Dynamically reconfigurable FPGAs have the potential to dramatically improve logic density by time-sharing a physical FPGA device. This paper presents a network- flow based partitioning algorithm for dynamically reconfigurable FPGAs based on the architecture in [2]. Experiments show that our approach outperforms the enhanced force-directed scheduling method in [2] in terms of communication cost.

8.2 Fast Compilation for Pipelined Reconfigurable Fabrics [p. 195]

Mihai Budiu, Seth Copen Goldstein, Carnegie Mellon University

In this paper we describe a compiler which quickly synthesizes high quality pipelined datapaths for pipelined reconfigurable devices. The compiler uses the same internal representation to perform synthesis, module generation, optimization, and place and route. The core of the compiler is a linear time place and route algorithm more than two orders of magnitude faster than traditional CAD tools. The key behind our approach is that we never backtrack, rip-up, or re-route. Instead, the graph representing the computation is preprocessed to guarantee routability by inserting lazy noops. The preprocessing steps provides enough information to make a greedy strategy feasible. The compilation speed is approximately 3000 bit-operations/second (on a PII/400Mhz) for a wide range of applications. The hardware utilization averages 60% on the target device, PipeRench.

8.3 Configuration Caching Vs Data Caching for Striped FPGAs [p. 206]

Deepali Deshpande, Arun K. Somani, Akhilesh Tyagi, Iowa State University

Striped FPGA [1], or pipeline-reconfigurable FPGA provides hardware virtualization by supporting fast run-time reconfiguration. In this paper we show that the performance of striped FPGA depends on the reconfiguration pattern, the run time scheduling of configurations through the FPGA. We study two main configuration scheduling approaches- Configuration Caching and Data Caching. We present the quantitative analysis of these scheduling techniques to compute their total execution cycles taking into account the overhead caused by the IO with the external memory. Based on the analysis we can determine which scheduling technique works better for the given application and for the given hardware parameters.

Session 9: Applications

Chair: Ray Andraka, Andraka Consulting Group

9.1 String Matching on Multicontext FPGAs using Self-Reconfiguration [p. 217]

Reetinder P.S. Sidhu, Alessandro Mei, Viktor K. Prasanna, University of Southern California

FPGAs can perform better than ASICs if the logic mapped onto them is optimized for each problem instance. Unfortunately, this advantage is often canceled by the long time needed by CAD tools to generate problem instance dependent logic and the time required to configure the FPGAs. In this paper, a novel approach for runtime mapping is proposed that utilizes self-reconfigurability of multicontext FPGAs to achieve very high speedups over existing approaches. The key idea is to design and map logic onto a multicontext FPGA that in turn maps problem instance dependent logic onto other contexts of the same FPGA. As a result, CAD tools need to be used just once for each problem and not once for every problem instance as is usually done. To demonstrate the feasibility of our approach, a detailed implementation of the KMP string matching algorithm is presented which involves runtime construction of a finite state machine. We implement the KMP algorithm on a conventional FPGA (Xilinx XC 6216) and use it to obtain accurate estimates of performance on a multicontext device. Speedups in mapping time of ~ 10⁶ over CAD tools and more than 1800 over a program written specifically for FSM generation were obtained. Significant speedups were obtained in overall execution time as well, including a speedup ranging from 3 to 16 times over a software implementation of the KMP algorithm running on a Sun Ultra 1 Model 140 workstation.

9.2 Reduction of Latency and Resource Usage in Bit-Level Pipelined Data Paths for FPGAs [p. 227]

P. Kollig, B.M. Al-Hashimi, Staffordshire University

Pipelining of data path structures increases the throughput rate at the expense of enlarged resource usage and latency unless architectures optimized towards specific applications are used. This paper describes a novel methodology for the design of generic bit-level pipelined data paths that have the low resource usage and latency of specifically tailored architectures but still allow the flexible trade-off between speed and resource requirements inherent in generic circuits. This is achieved through the elimination of all skew and alignment flip-flops from the data path whilst still maintaining the original pipelining scheme, hence allowing more compact structures with decreased circuit delays. The resulting low latency is beneficial in the realization of all recursive signal-processing applications and the reduced resource usage enables particularly the efficient FPGA realization of high performance signal processing functions. The design process is illustrated through the high level synthesis-based FPGA realization of a 9 th -order wave digital filter, demonstrating that high performance and efficient resource usage are possible. For example, the implementation of a wave digital filter with 10-bit signal word length and 6 bit coefficients using a Xilinx XC4013XL-1 device supports sample rates of 2.5 MHz.
Keywords: Bit-level pipelined, circuit latency, FPGA, recursive algorithms

9.3 Exploiting FPGA-Features during the Emulation of a Fast Reactive Embedded System [p. 235]

Karlheinz WeiB, Thorsten Steckstor, Gernot Koch, Wolfgang Rosenstiel, University of Tübingen

This paper presents the emulation of an embedded system with hard real time constraints and response times of about 220ms. We show that for such fast reactive systems, the software overhead of a Real Time Operating System (RTOS) becomes a limiting factor, consuming up to 77% of the total execution performance. We analyze features of different FPGA architectures in order to solve the system performance bottleneck. We show that moving functionality from software to hardware through exploiting the fine grained on-chip SRAM capability of the Xilinx XC4000 architecture, that feature eliminates the RTOS overhead by only a slight increase of about 28% of the used FPGA CLB resources. These investigations have been conducted using our own emulation environment called SPYDER-CORE-P1.