|
FPGA'99 ABSTRACTS
Sessions:
[1]
[2]
[3]
[4]
[5]
[Panel]
[6]
[7]
[8]
[9]
[Poster Paper Abstracts]
Chair: Jonathan Rose, University of Toronto
-
1.1 A new high density and very low cost reprogrammable FPGA architecture [p. 3]
- Sinan Kaptanoglu, Greg Bakker, Arun Kundu, Ivan Corneillet, Actel
Corporation; Ben Ting, BTR Inc.
A new reprogrammable FPGA architecture is described which is specifically
designed to be of very low cost. It covers a range of 35K to a million
usable gates. In addition, it delivers high performance and it is synthesis
efficient. This architecture is loosely based on an earlier reprogrammable
Actel architecture named ES. By changing the structure of the interconnect
and by making other improvements, we achieved an average cost reduction by
a factor of three per usable gate. The first member of the family based
on this architecture is fabricated on a 2.5V standard 0.25u CMOS technology
with a gate count of up to 130K which also includes 36K bits of two port
RAM. The gate count of this part is verified in a fully automatic design
flow starting from a high level description followed by synthesis, technology
mapping, place and route, and timing extraction.
-
1.2 Hybrid Product Term and LUT Based Architectures Using Embedded Memory
Blocks [p. 13]
- Frank Heile, Andrew Leaver, Altera Corporation
The Embedded System Block (ESB) of the APEX20K programmable logic device
family from Altera Corporation includes the capability of implementing
product term macrocells in addition to flexibly configurable ROM and dual
port RAM. In product term mode, each ESB has 16 macrocells built out of
32 product terms with 32 literal inputs. The ability to reconfigure
memory blocks in this way represents a new and innovative use of resources
in a programmable logic device, requiring creative solutions in both
the hardware and software domains. The architecture and features of this
Embedded System Block are described.
Keywords: Product terms, RAM, heterogeneous architecture.
-
1.3 An Innovative, Segmented High Performance FPGA Family with Variable-Grain-
Architecture and Wide-gating Functions
[p. 17]
- Om Agrawal, Herman Chang, Brad Sharpe-Geisler, Nick Schmitz, Bai Nguyen,
Jack Wong, Giap Tran, Fabiano Fontana, Bill Harding, Vantis Corporation
This paper describes the Vantis VF1 FPGA architecture, an innovative
architecture based on 0.25u (drawn) (0.18u Leff)/4-metal technology.
It was designed from scratch for high performance, routability and
ease-of-use. It supports system level functions (including wide gating
functions, dual-port SRAMs, high speed carry chains, and high speed
IO blocks) with a symmetrical structure. Additionally, the architecture
of each of the critical elements including: variable-grain logic blocks,
variable-length-interconnects, dual-port embedded SRAM blocks, I/O blocks
and on-chip PLL functions will be described.
Chair: Margaret Marek-Sadowska, University of California, Santa Barbara
-
2.1 Cut Ranking and Pruning: Enabling a General and Efficient FPGA Mapping
Solution [p. 29]
- Jason Cong, Chang Wu, University of California, Los Angeles,
Yuzheng Ding, Lucent Technologies
Cut enumeration is a common approach used in a number of FPGA synthesis
and mapping algorithms for consideration of various possible LUT
implementations at each node in a circuit. Such an approach is very general
and flexible, but often suffers high computational complexity and poor
scalability. In this paper, we develop several efficient and effective
techniques on cut enumeration, ranking and pruning. These techniques
lead to much better runtime and scalability of the cut-enumeration based
algorithms; they can also be used to compute a tight lower-bound on the size
of an area-minimum mapping solution. For area-oriented FPGA mapping,
experimental results show that the new techniques lead to over 160X speed-up
over the original optimal duplication-free mapping algorithm, achieve
mapping solutions with 5-21% smaller area for heterogenous FPGAs compared
to those by Chortle-cfr [6], MIS-pga-new [9], and TOS-TUM [4], yet with
over 100X speed-up over MIS-pga-new [9] and TOS-TUM [4].
-
2.2 Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve
FPGA Speed and Density [p. 37]
- Alexander Marquardt, Vaughn Betz, Jonathan Rose, University of Toronto
In this paper, we investigate the speed and area-efficiency of
FPGAs employing "logic clusters" containing multiple LUTs and
registers as their logic block. We introduce a new, timing-driven
tool (T-VPack) to "pack" LUTs and registers into these logic
clusters, and we show that this algorithm is superior to an existing
packing algorithm. Then, using a realistic routing architecture and
sophisticated delay and area models, we empirically evaluate
FPGAs composed of clusters ranging in size from one to twenty
LUTs, and show that clusters of size seven through ten provide the
best area-delay trade-off. Compared to circuits implemented in an
FPGA composed of size one clusters, circuits implemented in an
FPGA with size seven clusters have 30% less delay (a 43% increase
in speed) and require 8% less area, and circuits implemented in an
FPGA with size ten clusters have 34% less delay (a 52% increase in
speed), and require no additional area.
-
2.3 A Methodology for Fast FPGA Floorplanning [p. 47]
- John Emmert, Dinesh Bhatia, University of Cincinatti
Floorplanning is an important problem in FPGA circuit mapping.
As FPGA capacity grows, new innovative approaches will be required
for efficiently mapping circuits to FPGAs. In this paper
we present a macro based floorplanning methodology suitable for
mapping large circuits to large, high density FPGAs. Our method
uses clustering techniques to combine macros into clusters, and
then uses a tabu search based approach to place clusters while
enhancing both circuit routability and performance. Our method
is capable of handling both hard (fixed size and shape) macros
and soft (fixed size and variable shape) macros. We demonstrate
our methodology on several macro based circuit designs and compare
the execution speed and quality of results with commercially
available CAE tools. Our approach shows a dramatic speedup in
execution time without any negative impact on quality.
Key Words: Floorplanning, Placement, FPGA, Clustering, Tabu
Search
Chair: Tim Southgate, Altera
-
3.1 FPGA Routing Architecture: Segmentation and Buffering to Optimize Speed
and Density [p. 59]
- Vaughn Betz, Jonathan Rose, University of Toronto
In this work we investigate the routing architecture of FPGAs,
focusing primarily on determining the best distribution of routing
segment lengths and the best mix of pass transistor and tri-state
buffer routing switches. While most commercial FPGAs contain
many length 1 wires (wires that span only one logic block) we find
that wires this short lead to FPGAs that are inferior in terms of both
delay and routing area. Our results show instead that it is best for
FPGA routing segments to have lengths of 4 to 8 logic blocks. We
also show that 50% to 80% of the routing switches in an FPGA
should be pass transistors, with the remainder being tri-state buffers.
Architectures that employ the best segmentation distributions
and the best mixes of pass transistor and tri-state buffer switches
found in this paper are not only 11% to 18% faster than a routing
architecture very similar to that of the Xilinx XC4000X but also
considerably simpler. These results are obtained using an architecture
investigation infrastructure that contains a fully timing-driven
router and detailed area and delay models.
-
3.2 Balancing Interconnect and Computation in a Reconfigurable Computing
Array (or, why you don't really want 100% LUT utilization) [p. 69]
- André DeHon, University of California, Berkeley
FPGA users often view the ability of an FPGA to route designs with
high LUT (gate) utilization as a feature, leading them to demand
high gate utilization from vendors. We present initial evidence
from a hierarchical array design showing that high LUT utilization
is not directly correlated with efficient silicon usage. Rather, since
interconnect resources consume most of the area on these devices
(often 80-90%), we can achieve more area efficient designs by allowing
some LUTs to go unused - allowing us to use the dominant
resource, interconnect, more efficiently. This extends the "Sea-of-gates"
philosophy, familiar to mask programmable gate arrays, to
FPGAs. Also introduced in this work is an algorithm for "depopulating"
the gates in a hierarchical network to match the limited
wiring resources.
Chair: Mike Butts, Synopsys
-
4.1 Configuration Cloning: Exploiting Regularity in Dynamic DSP Architectures
[p. 81]
- S.R. Park, W. Burleson, University of Massachusetts
Existing FPGAs have fairly simple and inefficient configuration
mechanisms due to the relative infrequency of reconfiguration.
However a large class of dynamically configurable architectures
for DSP and communications can benefit from special-purpose
configuration mechanisms which allow significant savings in
configuration speed, power and memory. Light weight configuration
mechanisms allow much finer grained dynamic reconfiguration
techniques of DSP and communications functions that tune algorithm
and architecture parameters incrementally to track data and environment
variations. These adaptive techniques exploit the
time-varying nature of many DSP applications and avoid
the power costs associated with worst-case design.
In this paper we develop a new FPGA configuration
method called configuration cloning from the analogy of
biological cloning. The main motivation behind configuration
cloning is to exploit spatial and temporal regularity and
locality in algorithms and architectures by copying
and operating on the configuration bit-stream already resident
in an FPGA. This results in a speed and power improvement
over off-chip partial reconfiguration techniques
but does require additional interconnects and control hardware
on the FPGA. But in contrast to multiple-context and striped
configuration approaches, cloning requires only a small amount
of hardware overhead to greatly increase the on-chip configuration
bandwidth. Details of the configuration cloning mechanism are
described. Two familiar DSP applications, motion estimation and FIR
filtering, are explained to demonstrate order of magnitude reductions
in configuration time and power compared to existing configuration
techniques. Resource recycling is also presented as a generic
method of reclaiming freed up logic resources and using them as
local memory to greatly reduce I/O requirements.
-
4.2 Don't Care Discovery for FPGA Configuration Compression [p. 91]
- Zhiyuan Li, Scott Hauck, Northwestern University
One of the major overheads in reconfigurable computing is the time it takes
to reconfigure the devices in the system. The configuration compression
algorithm presented in our previous paper [Hauck98c] is one efficient
technique for reducing this overhead. In this paper, we develop an
algorithm for finding Don't Care bits in configurations to improve the
compatibility of the configuration data. With the help of the Don't
Cares, higher configuration compression ratios can be achieved by using our
modified configuration compression algorithm. This improves compression
ratios of a factor of 7, where our original algorithm only achieved a
factor of 4.
Chair: Brad Hutchings, Brigham Young University
-
5.1 A FPGA-Based Hardware Implementation of Generalized Profile Search Using
Online Arithmetic [p. 101]
- Emeka Mosanya, Eduardo Sanchez, Swiss Federal Institute of Technology
This paper describes the hardware implementation of
the Generalized Profile Search algorithm using online
arithmetic and redundant data representation. This is
part of the GenStorm project, aimed at providing a
dedicated computer for biological sequence processing
based on reconfigurable hardware using FPGAs. The
serial evaluation of the result made possible by a redundant
data representation leads to a significant increase
of data throughput in comparison with standard non
redundant data coding.
-
5.2 Procedural Texture Mapping on FPGAs [p. 112]
- Andy G. Ye, David M. Lewis, University of Toronto
Procedural textures can be effectively used to enhance the visual realism
of computer rendered images. Procedural textures can provide
higher realism for 3-D objects than traditional hardware texture
mapping methods which use memory to store 2-D texture images.
This paper proposes a new method of hardware texture mapping in
which texture images are synthesized using FPGAs. This method is
very efficient for texture mapping procedural textures of more than
two input variables. By synthesizing these textures on the fly, the
large amount of memory required to store their multidimensional
texture images is eliminated, making texture mapping of 3-D textures
and parameterized textures feasible in hardware. This paper
shows that using FPGAs, procedural textures can be synthesized at
high speed, with a small hardware cost. Data on the performance
and the hardware cost of synthesizing procedural textures in FPGAs
are presented. This paper also presents, the FPGA implementations
of two Perlin noise based 3-D procedural textures.
Moderator: Scott Hauck, Northwestern University
Chair: Carl Ebeling, University of Washington
-
6.1 HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array [p. 125]
- William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony
Tung, Omid Rowhani, Varghese George, John Wawrzynek, André DeHon,
University of California, Berkeley
There is no inherent characteristic forcing Field Programmable Gate
Array (FPGA) or Reconfigurable Computing (RC) Array cycle
times to be greater than processors in the same process. Modern
FPGAs seldom achieve application clock rates close to their
processor cousins because (1) resources in the FPGAs are not balanced
appropriately for high-speed operation, (2) FPGA CAD does
not automatically provide the requisite transforms to support this
operation, and (3) interconnect delays can be large and vary almost
continuously, complicating high frequency mapping. We introduce
a novel reconfigurable computing array, the High-Speed, Hierarchical
Synchronous Reconfigurable Array (HSRA), and its supporting
tools. This package demonstrates that computing arrays can achieve
efficient, high-speed operation. We have designed and implemented
a prototype component in a 0.4um logic design on a DRAM process
which will support 250MHz operation for CAD mapped designs.
-
6.2 A Reconfigurable Arithmetic Array for Multimedia Applications [p. 135]
- Alan Marshall, Hewlett Packard; Jean Vuillemin, Ecole Normale
Supérieure; Tony Stansfield, Igor Kostarnov, Hewlett Parkard; Brad Hutchings,
Brigham Young University
In this paper we describe a reconfigurable architecture optimised
for media processing, and based on 4-bit ALUs and interconnect.
Keywords:
Reconfigurable computing, multimedia, 4-bit ALU, FPGA.
-
6.3 Memory Interfacing and Instruction Specification for Reconfigurable
Processors [p. 145]
- Jeffrey A. Jacob, Paul Chow, University of Toronto
As custom computing machines evolve, it is clear that
a major bottleneck is the slow interconnection architecture
between the logic and memory. This paper
describes the architecture of a custom computing
machine that overcomes the interconnection bottleneck
by closely integrating a fixed-logic processor, a
reconfigurable logic array, and memory into a single
chip, called OneChip-98.
The OneChip-98 system has a seamless programming
model that enables the programmer to easily specify
instructions without additional complex instruction
decoding hardware. As well, there is a simple scheme
for mapping instructions to the corresponding programming
bits. To allow the processor and the reconfigurable
array to execute concurrently, the
programming model utilizes a novel memory-consistency
scheme implemented in the hardware.
To evaluate the feasibility of the OneChip-98 architecture,
a 32-bit MIPS-like processor and several performance
enhancement applications were mapped to the
Transmogrifier-2 field programmable system. For two
typical applications, the 2-dimensional discrete cosine
transform and the 64-tap FIR filter, we were capable of
achieving a performance speedup of over 30 times that
of a stand-alone state-of-the-art processor.
Keywords: Reconfigurable computer, FPGA, reconfigurable processor,
memory interfacing, memory coherence
Chair: Martine Schlag, University of California, Santa Cruz
-
7.1 Trading Quality for Compile Time: Ultra-Fast Placement for FPGAs [p. 157]
- Yaska Sankar, Jonathan Rose, University of Toronto
The demand for high-speed FPGA compilation tools has occurred
for three reasons: first, as FPGA device capacity has grown, the
computation time devoted to placement and routing has grown
more dramatically than the compute power of the available computers.
Second, there exists a subset of users who are willing to
accept a reduction in the quality 1 of result in exchange for a high-speed
compilation. Third, high-speed compile has been a long-standing
desire of users of FPGA-based custom computing
machines, since their compile time requirements are ideally closer
to those of regular computers.
This paper focuses on the placement phase of the compile process,
and presents an ultra-fast placement algorithm targeted to FPGAs.
The algorithm is based on a combination of multiple-level, bottom-up
clustering and hierarchical simulated annealing. It provides
superior area results over a known high-quality placement tool on a
set of large benchmark circuits, when both are restricted to a short
run time. For example, it can generate a placement for a 100,000-gate
circuit in 10 seconds on a 300 MHz Sun UltraSPARC workstation
that is only 33% worse than a high-quality placement that
takes 524 seconds using a pure simulated annealing implementation.
In addition, operating in its fastest mode, this tool can provide
an accurate estimate of the wirelength achievable with good quality
placement. This can be used, in conjunction with a routing predictor,
to very quickly determine the routability of a given circuit on a
given FPGA device.
-
7.2 Satisfiability-Based Layout Revisited: Detailed Routing of Complex
FPGAs Via Search-Based Boolean SAT [p. 167]
- Gi-Joon Nam, Karem A. Sakallah, University of Michigan; Rob A. Rutenbar,
Carnegie Mellon University
Boolean-based routing transforms the geometric
FPGA routing task into a single, large Boolean
equation with the property that any assignment of
input variables that "satisfies" the equation (that
renders equation identically "1") specifies a valid
routing. The formulation has the virtue that it considers
all nets simultaneously, and the absence of a
satisfying assignment implies that the layout is
unroutable. Initial Boolean-based approaches to
routing used Binary Decision Diagrams (BDDs) to
represent and solve the layout problem. BDDs,
however, limit the size and complexity of the
FPGAs that can be routed, leading these
approaches to concentrate only on individual
FPGA channels. In this paper, we present a new
search-based Satisfiability (SAT) formulation that
can handle entire FPGAs, routing all nets concurrently.
The approach relies on a recently developed
SAT engine (GRASP) that uses systematic
search with conflict-directed non-chronological
backtracking, capable of handling very large SAT
instances. We present the first comparisons of
search-based SAT routing results to other routers,
and offer the first evidence that SAT methods can
actually demonstrate the unroutability of a layout.
Preliminary experimental results suggest that this
approach to FPGA routing is more viable than
earlier BDD-based methods.
Keywords: Boolean satisfiability, FPGA routing,
conflict-directed search
-
7.3 Multi-Terminal Net Routing for Partial Crossbar-Based Multi-FPGA Systems
[p. 176]
- Abdel Ejnioui; University of South Florida; N. Ranganathan,
University of Texas
Multi-FPGA systems are used as custom computing machines to
solve compute intensive problems and also in the verification
and prototyping of large circuits. In this paper, we address the
problem of routing multi-terminal nets in a multi-FPGA system
that uses partial crossbars as interconnect structures. First, we
model the multi-terminal routing problem as a partitioned bin
packing problem and formulate it as an integer linear
programming problem where the number of variables is
exponential. A fast heuristic is applied to compute an upper
bound on the routing solution. Then, a column generation
technique is used to solve the linear relaxation of the initial
master problem in order to obtain a lower bound on the routing
solution. This is followed by an iterative branch-and-price
procedure that attempts to find a routing solution somewhere
between the two established bounds. In this regard, the proposed
algorithm guarantees an exact routing solution by searching a
branch-and-price tree. Due to the tightness of the bounds, the
branch-and-price tree is small resulting in shorter execution
times. Experimental results are provided for different netlists
and board configurations in order to demonstrate the algorithm's
performance. The obtained results show that the algorithm finds
an exact routing solution in a very short time.
Keywords:
FPGA routing, FPGA architecture, layout synthesis,
interconnect optimization, branch-and-price, integer
programming.
Chair: David Lewis, University of Toronto
-
8.1 Circuit Partitioning for Dynamically Reconfigurable FPGAs [p. 187]
- Huiqun Liu, D.F. Wong, University of Texas
Dynamically reconfigurable FPGAs have the potential to
dramatically improve logic density by time-sharing a physical
FPGA device. This paper presents a network- flow based
partitioning algorithm for dynamically reconfigurable FPGAs
based on the architecture in [2]. Experiments show
that our approach outperforms the enhanced force-directed
scheduling method in [2] in terms of communication cost.
-
8.2 Fast Compilation for Pipelined Reconfigurable Fabrics [p. 195]
- Mihai Budiu, Seth Copen Goldstein, Carnegie Mellon University
In this paper we describe a compiler which quickly synthesizes
high quality pipelined datapaths for pipelined reconfigurable
devices. The compiler uses the same internal representation to
perform synthesis, module generation, optimization, and place
and route. The core of the compiler is a linear time place
and route algorithm more than two orders of magnitude faster
than traditional CAD tools. The key behind our approach
is that we never backtrack, rip-up, or re-route. Instead, the
graph representing the computation is preprocessed to guarantee
routability by inserting lazy noops. The preprocessing
steps provides enough information to make a greedy strategy
feasible. The compilation speed is approximately 3000
bit-operations/second (on a PII/400Mhz) for a wide range of
applications. The hardware utilization averages 60% on the
target device, PipeRench.
-
8.3 Configuration Caching Vs Data Caching for Striped FPGAs [p. 206]
- Deepali Deshpande, Arun K. Somani, Akhilesh Tyagi, Iowa State University
Striped FPGA [1], or pipeline-reconfigurable FPGA provides
hardware virtualization by supporting fast run-time
reconfiguration. In this paper we show that the performance
of striped FPGA depends on the reconfiguration
pattern, the run time scheduling of configurations through
the FPGA. We study two main configuration scheduling
approaches- Configuration Caching and Data Caching. We
present the quantitative analysis of these scheduling techniques
to compute their total execution cycles taking into
account the overhead caused by the IO with the external
memory. Based on the analysis we can determine which
scheduling technique works better for the given application
and for the given hardware parameters.
Chair: Ray Andraka, Andraka Consulting Group
-
9.1 String Matching on Multicontext FPGAs using Self-Reconfiguration [p. 217]
- Reetinder P.S. Sidhu, Alessandro Mei, Viktor K. Prasanna, University
of Southern California
FPGAs can perform better than ASICs if the logic mapped onto
them is optimized for each problem instance. Unfortunately, this
advantage is often canceled by the long time needed by CAD tools
to generate problem instance dependent logic and the time required
to configure the FPGAs.
In this paper, a novel approach for runtime mapping is proposed
that utilizes self-reconfigurability of multicontext FPGAs to achieve
very high speedups over existing approaches. The key idea is to
design and map logic onto a multicontext FPGA that in turn maps
problem instance dependent logic onto other contexts of the same
FPGA. As a result, CAD tools need to be used just once for each
problem and not once for every problem instance as is usually done.
To demonstrate the feasibility of our approach, a detailed implementation
of the KMP string matching algorithm is presented which
involves runtime construction of a finite state machine. We implement
the KMP algorithm on a conventional FPGA (Xilinx XC
6216) and use it to obtain accurate estimates of performance on
a multicontext device. Speedups in mapping time of ~ 106
over CAD tools and more than 1800 over a program written specifically
for FSM generation were obtained. Significant speedups were obtained
in overall execution time as well, including a speedup ranging
from 3 to 16 times over a software implementation of the KMP
algorithm running on a Sun Ultra 1 Model 140 workstation.
-
9.2 Reduction of Latency and Resource Usage in Bit-Level Pipelined
Data Paths for FPGAs [p. 227]
- P. Kollig, B.M. Al-Hashimi, Staffordshire University
Pipelining of data path structures increases the throughput rate at
the expense of enlarged resource usage and latency unless
architectures optimized towards specific applications are used.
This paper describes a novel methodology for the design of
generic bit-level pipelined data paths that have the low resource
usage and latency of specifically tailored architectures but still
allow the flexible trade-off between speed and resource
requirements inherent in generic circuits. This is achieved through
the elimination of all skew and alignment flip-flops from the data
path whilst still maintaining the original pipelining scheme, hence
allowing more compact structures with decreased circuit delays.
The resulting low latency is beneficial in the realization of all
recursive signal-processing applications and the reduced resource
usage enables particularly the efficient FPGA realization of high
performance signal processing functions. The design process is
illustrated through the high level synthesis-based FPGA
realization of a 9 th -order wave digital filter, demonstrating that
high performance and efficient resource usage are possible. For
example, the implementation of a wave digital filter with 10-bit
signal word length and 6 bit coefficients using a Xilinx
XC4013XL-1 device supports sample rates of 2.5 MHz.
Keywords:
Bit-level pipelined, circuit latency, FPGA, recursive algorithms
-
9.3 Exploiting FPGA-Features during the Emulation of a Fast Reactive
Embedded System [p. 235]
- Karlheinz WeiB, Thorsten Steckstor, Gernot Koch, Wolfgang Rosenstiel,
University of Tübingen
This paper presents the emulation of an embedded system with
hard real time constraints and response times of about 220ms.
We show that for such fast reactive systems, the software
overhead of a Real Time Operating System (RTOS) becomes a
limiting factor, consuming up to 77% of the total
execution performance. We analyze features
of different FPGA architectures in order to
solve the system performance bottleneck. We
show that moving functionality from software
to hardware through exploiting the fine
grained on-chip SRAM capability of the
Xilinx XC4000 architecture, that feature eliminates
the RTOS overhead by only a slight
increase of about 28% of the used FPGA
CLB resources. These investigations have
been conducted using our own emulation
environment called SPYDER-CORE-P1.
|