Abstract

The routing architecture of an FPGA consists of the length of the wires, the type of switch used to connect wires (buffered, unbuffered, fast or slow) and the topology of the interconnection of the switches and wires. FPGA routing architecture has a major influence on the logic density and speed of FPGA devices. Previous work [1] based on a 0.35um CMOS process has suggested that an architecture consisting of length 4 wires (where the length of a wire is measured in terms of the number of logic blocks it passes before being switched) and half of the programmable switches are active buffers, and half are pass transistors. In that work, however, the topology of the routing architecture prevented buffered tracks from connecting to pass-transistor tracks. This restriction prevents the creation of interconnection trees for high fanout nets that have a mixture of buffers and pass transistors. Electrical simulations suggest that connections closer to the leaves on interconnection trees are faster using pass transistors, but it is essential to buffer closer to the source. This latter effect is well known in regular ASIC routing [2].

In this work we propose a new routing architecture that allows liberal switching between buffered and pass transistor tracks. We explore various versions of the architecture to determine the density-speed trade-off. We show that one version of the new architecture results in FPGAs with 10% faster critical path delay yet uses the same area as the previous architecture that does not allow such switching. We also show that the new architecture allows a useful area-speed trade-off and several versions of the new architecture result in FPGAs with 8% gain in area-delay product than the previous architecture that does not allow the switching.

1 Introduction

The routing of an FPGA consumes most of the chip area and is the dominating factor of the overall circuit delay [3]. The routing architecture of an FPGA consists of:

1. The length of each routing wire in the FPGA measured in terms of number of logic blocks that it passes before being switched.
2. The type and quantity of switches attached to each routing wire - pass transistor, multiplexor, or buffers.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

FPGA 2001, February 11-13, 2001, Monterey, California, USA.
Copyright 2001 ACM 1-58113-341-3/01/0002...$5.00.

3. The sizes of transistors that are used to build pass transistor and buffer switches.
4. The topology of the interconnection of the switches and routing wires in the switch blocks and connections blocks [3].
5. The routing wire width and spacing.

In this work we profile the delay properties of a previously developed routing architecture [1] in order to determine how it might be made faster and more area-efficient. The delay profile shows that some critical nets implemented in the previous architecture formed a large portion of the critical path delay because too many pass transistors were used in series. This occurred because the previous architecture had no way to easily mix buffers and pass transistors in a single source-sink connection. This drove us to design an architecture which allows buffers and pass transistors to be mixed within a single net. We will show that this new architecture delivers superior performance and density.

Several commercial architectures allow mixing of buffers and pass transistors, including the Xilinx 4000X architecture [11]. The XC4000X switch block for quad lines has one buffer and six pass transistors available, allowing the router to choose between buffering or pass-transistor connections.

This paper is organized as follows: Section 2 outlines the experimental CAD flow which is used to profile the previous architecture and produce comparisons between different routing architectures. Section 3 describes the previous routing architecture [1] and its delay profile. Section 4 describes the new architecture that allows more liberal buffer-pass transistor mixing. Section 5 presents the comparison between several versions of the new architecture and the previous architectures, and Section 6 concludes.

2 Experimental Methodology and Basic Architecture

Figure 2 illustrates the empirical methodology and CAD flow used to evaluate routing architectures. It is taken from [1][4]. Each input benchmark circuit goes through technology-independent logic optimization using SIS [6] and is technology mapped to 4-input lookup tables (LUTs) using Flowmap and Flowpack [7]. Then T-VPACK [8] is used to group 4-input LUTs and registers into “clustered” logic blocks. Following [1], we use clusters that contain four 4-LUTs each with a flip-flop, and assume a symmetric, island-style type architecture. The total number of inputs to the cluster is set at 10. VPR [4], is used to do timing-driven placement and timing-driven routing of the circuit. A brief discussion of VPR’s timing driven routing algorithms is presented in 4.1. One key output from this flow is the critical path delay of the circuit, which is determined by the timing analyzer within VPR. The routing delay is determined by calculating the Elmore delay [12] of the RC-tree network(s) of each net. Resistance arises from the wires, pass transistor resistances and tri-state buffer output resistance. Capacitance arises from the metal wires (both per unit and fringing...
capacitance), and from the parasitic capacitance of the buffers and pass transistors. Under the Elmore delay model, the signal delay from the source node $s_0$ to destination node $u$ is given by the following two equations:

$$d_{elmore}(s_0, u) = \sum_{e, v \in Path(s_0, u)} r_e \left( \frac{C_e}{2} + C_v \right)$$

$$d_{buffer}(v, b) = r_b C_v$$

- $C_e$: Capacitance of the pass transistor switch and the metal wire
- $r_e$: Resistance of the pass transistor switch
- $C_v$: Total downstream capacitance not isolated by buffers
- $d_b$: Intrinsic delay of the tri-state buffer switch
- $r_b$: Output resistance of the tri-state buffer switch

The delay of the logic elements (LUTs, flip-flops, intra-cluster routing multiplexers) is determined by spice-level design and simulation in a 0.18um CMOS. Internal buffers and drivers are independently sized. Table 1 gives the delays of the different elements of the cluster illustrated in Figure 1 which itself is taken from [10].

The second key output from this flow is the total area required by each circuit in each architecture being evaluated. To do this, we first determine the minimum number of tracks needed to successfully route each circuit, $W_{\text{min}}$. Clearly this isn't possible in real FPGAs, but we believe this is meaningful as part of a logic density metric for an architecture. The router is repeatedly invoked until it finds the minimum number of tracks ($W_{\text{min}}$) that can route the circuit. We call this a “high stress” routing since at this track count, the circuit is barely routable. To measure the complete active area of the implementation of each benchmark circuit in each architecture, we employ the method described by Betz [1][4]. Each circuit element (e.g. LUT, multiplexor, buffer, inverter, pass transistor, configuration memory bit) is designed and properly sized at the transistor level. That is, each has been designed at spice-level and is appropriately sized to a reasonable area-delay tradeoff [4]. We measure the area of each circuit element in terms of the number of equivalent minimum-width transistors areas in the 0.18um technology. Larger transistors are counted as an appropriate number of minimum width transistors. So, once the total number of clusters is known, and the number of tracks per channel is known, the total number of minimum width transistor areas can be calculated. While this metric does not measure metal area, our communication with FPGA vendors indicates that most layouts are active-area limited [1].

It is important to note that since most designers will pick the FPGA device which has more than the minimum routing resources available, we re-route the circuit with the number of tracks per channel set to be $1.2W_{\text{min}}$. The critical path delay and the total FPGA area required are based on this so-called “low stress” routing.

### 3 Delay Profile of an Existing FPGA Architecture

Most of the critical path delay in FPGAs is due to routing in between logic blocks, or clusters. The first goal of this work is to identify those parts of the architecture that incur the most delay in a circuit after placement and routing. From that, we will try to improve the overall circuit speed (without sacrificing too much area) by proposing a modified architecture. We will profile the FPGA architecture proposed in [1][4], which is illustrated in Figure 3 and has the following attributes:

1. The architecture is implemented in 0.18um CMOS
2. The logic block cluster contains four 4-input lookup tables (LUTs) and flip-flops, and a total of 10 inputs and 4 outputs.
3. All routing wires span four logic blocks and have minimum-width wires.
4. Routing wire spacing is set to be double the minimum metal spacing allowed by the IC process.
5. The size of the pass transistor switch is set to be ten times the minimum-size transistor in the IC process.
6. The size of the routing buffer is set to be five times the minimum-size buffer.
7. 50% of the length 4 wires are switched by pass transistors and 50% are switched by buffers.
8. The switch block employs a purely “planar” (also called domain based) topology, which means that once a path is
connected to a pass-transistor-driven track, it can only connect to other tracks through pass transistors, and similarly for buffered tracks.

9. The flexibility of the switch block, $F_s = 3$ [1].
10. The flexibility of the connection block, for inputs, $F_c(\text{input}) = 0.6W$ and $F_c(\text{output}) = 0.25W$, where $W$ is the number of tracks [1].

For simplicity, only those routing switches that are connected to the left four horizontal routing tracks are shown in Figure 3. We name this routing architecture the 50-50_NO_MIX architecture, primarily to point of the percentage of pass transistors and buffers, and the fact that the two kinds of tracks cannot be inter-routed.

Figure 4 illustrates a simple routing of two nets, net A and net B in this architecture (for simplicity in the Figure we use unit-length segments rather than length 4 segments). Net A is routed by tri-state buffer switches and net B is routed by pass transistor switches.

We profile the delay of a given circuit when implemented in this architecture by measuring the portion of the total critical path delay that is attributable to the total logic block delay (delay within a cluster, including the muxing within the cluster) and the total routing delay. We further break the total routing delay into three components:

1. Source buffer delay — the delay of the buffer driving out of the logic block, and all downstream resistance and capacitance until the next buffer is encountered, either in the routing itself or at the terminating connection block. Each net has only one source buffer delay.
2. Routing buffer delay — the delay of all inside-the-routing buffers, and downstream resistance and capacitance of that buffer. Each net can have several routing buffer delays, which are summed to produce the total. If a net is routed only on pass transistor tracks then it will have zero routing buffer delay.
3. Input connection multiplexor delay — this is the delay of the multiplexors that take a net from the routing tracks into the inputs of the cluster. (A to B in Figure 1)

In Figure 4, net A has one source buffer delay, two routing buffer delays, and the input connection mux delay, while net B only has source buffer delay and the input connection mux delay since routing buffer switches are not used in the routing.

We profile the delay of a given circuit when implemented in this architecture by measuring the portion of the total critical path delay that is attributable to the total logic block delay (delay within a cluster, including the muxing within the cluster) and the total routing delay. We further break the total routing delay into three components:

1. Source buffer delay — the delay of the buffer driving out of the logic block, and all downstream resistance and capacitance until the next buffer is encountered, either in the routing itself or at the terminating connection block. Each net has only one source buffer delay.
2. Routing buffer delay — the delay of all inside-the-routing buffers, and downstream resistance and capacitance of that buffer. Each net can have several routing buffer delays, which are summed to produce the total. If a net is routed only on pass transistor tracks then it will have zero routing buffer delay.
3. Input connection multiplexor delay — this is the delay of the multiplexors that take a net from the routing tracks into the inputs of the cluster. (A to B in Figure 1)

In Figure 4, net A has one source buffer delay, two routing buffer delays and the input connection mux delay, while net B only has source buffer delay and the input connection mux delay since routing buffer switches are not used in the routing.

Table 1: Delays of Basic FPGA Circuit Elements in 0.18um CMOS

<table>
<thead>
<tr>
<th>Circuit Element</th>
<th>Input Connection MUX (A to B)</th>
<th>Intra-cluster Routing MUX (B to C or D to C)</th>
<th>4-input LUT (C to E)</th>
<th>Flip-Flop Setup Time</th>
<th>Flip-Flop Clock to Out (E to F)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Delay (ps)</td>
<td>377</td>
<td>301</td>
<td>401</td>
<td>295</td>
<td>242</td>
</tr>
</tbody>
</table>

Table 1: Delays of Basic FPGA Circuit Elements in 0.18um CMOS

3.1 Profile of 50-50_NO_MIX Switch Block

The delay profile for the 20 largest MCNC circuits [9] implemented in the 50-50_NO_MIX routing architecture was calculated as described above. Table 2 gives the profile; the first column and the second column give the circuit name and its size in terms of the number of 4-input BLEs respectively. The third column gives the total critical path delay while the fourth column gives the percentage of the total delay due to the logic cluster. The fifth column gives the percentage of delay due to the extra-cluster routing delay, and the sixth column gives the percentage of total delay due to the source buffer alone. The seventh column gives the percentage delay due to the input routing multiplexer. The second last row gives the geometric average of each column and the last row gives the arithmetic average.

Notice that the “Routing Buffer Delay” column is 0% in four cases. This occurs because the entire critical path is routed on
tracks that use only pass transistors, and therefore exhibit no routing buffer delay.

The key item to observe from the profile is that the source buffer delay accounts for more than 50\% of the routing delay on average. When this delay is large, most of this delay comes from the source buffer driving pass-transistor only routing trees, because in the 50-50_NO_MIX architecture, once the source buffer starts driving a pass transistor track, it isn’t possible to switch to drive a buffered track, or vice-versa. This can and does result in a large pass-transistor only RC network, which exhibits quadratic delay in the distance traversed. The reader may question why this happens with a timing-driven router that should prevent the use of slow resources for critical nets. The reason critical nets are assigned to slower resources is that even more critical nets are given the fast resources, and they are all used up.

One solution is to provide more buffered tracks than the 50\% number in the 50-50_NO_MIX architecture. Appendix A shows results of different percentage of mix between buffers and pass transistors for non-mix architectures. There are two reasons this solution is not good: 1. Buffers are far more expensive in silicon area than pass transistors, and so this would cost a great deal. 2. Pass transistors are faster for shorter connections; removing them means that critical nets that travel a short distance will be less likely to achieve good speed.

An alternative is to architect a routing fabric that allows liberal switching between tracks that are switched by pass transistors and tracks that are switched by buffers. This would permit the use of buffered connections near the source of a large fanout tree, and pass transistors near the destination, which is the best of both worlds. We propose such an architecture in the next section.

4 \hspace{1em} A New Architecture That Allows Pass Transistor and Buffer Mixing

We present a new routing architecture, called the Mixed Buffer-Pass routing architecture, which allows routes to switch between buffers and pass transistors within a single source-sink connection. Figure 5 illustrates the Mixed Buffer-Pass routing architecture. Note that, for simplicity, Figure 5 shows only the programmable connectivity of the wires entering the switch block from the left hand side. This architecture divides the routing tracks into the following three classes of track:

1. **Straight-Planar**: These tracks are switched using pass transistors, in the planar switch topology and have no programmable connections to the other two classes described below.
2. **Mixed-Buffer**: These tracks programmably connect to each other using tri-state buffer switches in a planar (Fs=3) topology, and connect to the Mixed Pass tracks (described below) using pass transistors.
3. **Mixed-Pass**: These tracks programmably connect to each other using pass transistor switches in a planar (Fs=3) topology, and also connect to the Mixed Buffer tracks (described above) using pass transistors.

The Mixed-Buffer and Mixed-Pass transistors tracks are present in equal numbers in each channel so as to allow the creation of a simple pattern of interconnection between them. The connections between the Mixed-Buffer and Mixed-Pass transistor tracks are two additional pass transistors per track that occur on every track at the point at which it switches. These are illustrated by the **bold** transistors in Figure 5. While regular tracks typically switch to three directions upon termination at a switch block, these two additional switches allow each track to turn in the left and right directions - there is no additional “straight” connection using a pass transistor in order to reduce area cost and capacitive loading of the mixed buffer-pass tracks.

These Mixed-Pass transistor tracks are more expensive, and are slightly more loaded (by the extra switches) than the straight-planar tracks. This is the reason that we have included the straight-planar tracks - they are cheaper and somewhat faster than the Mixed-Pass tracks. Experiments presented below will explore the appropriate portion of each of the three classes of tracks.

4.1 Timing-Driven Routing Algorithm

In this section we discuss a relevant feature of the timing-driven routing algorithm that is used to exploit the mixed buffer and pass-transistor tracks. We use the timing-driven router in VPR [4] which is based on the Pathfinder negotiated congestion router [13]. We will not describe the VPR router in detail but instead refer the reader to [4]. Briefly, it employs a directed maze-type expansion which uses a node costing function that accounts for congestion and delay. There are two portions of the delay calculation: 1. The determination of the delay from the source to the current wavefront expansion point. This delay can be calculated exactly because the resistance and capacitance to this point is exactly known. 2. The estimation of the delay from the current wavefront expansion point to the target sink. Since the routing isn’t complete, this RC network is unknown. The VPR router assumes that the subsequent route will use routing resources identical to the type employed at the wavefront point. For example, if the current wave
<table>
<thead>
<tr>
<th>Circuit Name</th>
<th># of 4-Input BLEs</th>
<th>Total Critical Path Delay (ns)</th>
<th>Breakdown of Total Delay</th>
<th>Breakdown of Routing Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Logic Block Delay (%)</td>
<td>Routing Delay (%)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Source Buffer Delay (%)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Routing Buffer Delay (%)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Input Connection MUX Delay (%)</td>
</tr>
<tr>
<td>alu4</td>
<td>1522</td>
<td>13.4</td>
<td>38.8</td>
<td>61.2</td>
</tr>
<tr>
<td>apex2</td>
<td>1878</td>
<td>15.1</td>
<td>34.4</td>
<td>65.6</td>
</tr>
<tr>
<td>apex4</td>
<td>1262</td>
<td>18.4</td>
<td>24.4</td>
<td>75.6</td>
</tr>
<tr>
<td>bigkey</td>
<td>1707</td>
<td>7.8</td>
<td>28.6</td>
<td>71.4</td>
</tr>
<tr>
<td>clma</td>
<td>8383</td>
<td>29.9</td>
<td>26.3</td>
<td>73.7</td>
</tr>
<tr>
<td>des</td>
<td>1591</td>
<td>11.6</td>
<td>32.6</td>
<td>67.4</td>
</tr>
<tr>
<td>diffeq</td>
<td>1497</td>
<td>16.3</td>
<td>61.1</td>
<td>38.9</td>
</tr>
<tr>
<td>dsip</td>
<td>1370</td>
<td>5.9</td>
<td>38.1</td>
<td>61.9</td>
</tr>
<tr>
<td>elliptic</td>
<td>3604</td>
<td>19.7</td>
<td>54.3</td>
<td>45.7</td>
</tr>
<tr>
<td>ex1010</td>
<td>4598</td>
<td>30.7</td>
<td>16.9</td>
<td>83.1</td>
</tr>
<tr>
<td>ex5p</td>
<td>1064</td>
<td>12.6</td>
<td>35.6</td>
<td>64.4</td>
</tr>
<tr>
<td>frisc</td>
<td>3556</td>
<td>24.4</td>
<td>63.8</td>
<td>36.2</td>
</tr>
<tr>
<td>misex3</td>
<td>1397</td>
<td>14.2</td>
<td>31.7</td>
<td>68.3</td>
</tr>
<tr>
<td>pdc</td>
<td>4575</td>
<td>36.1</td>
<td>14.4</td>
<td>85.6</td>
</tr>
<tr>
<td>s298</td>
<td>1931</td>
<td>30.5</td>
<td>32.6</td>
<td>67.4</td>
</tr>
<tr>
<td>s38417</td>
<td>6406</td>
<td>15.9</td>
<td>49.3</td>
<td>50.7</td>
</tr>
<tr>
<td>s38584</td>
<td>6447</td>
<td>12.6</td>
<td>52.5</td>
<td>47.5</td>
</tr>
<tr>
<td>seq</td>
<td>1750</td>
<td>12.2</td>
<td>37.0</td>
<td>63.0</td>
</tr>
<tr>
<td>spla</td>
<td>3690</td>
<td>22.8</td>
<td>22.8</td>
<td>77.2</td>
</tr>
<tr>
<td>tseng</td>
<td>1047</td>
<td>14.9</td>
<td>62.1</td>
<td>37.9</td>
</tr>
<tr>
<td>Geometric Average</td>
<td>2390</td>
<td>16.54</td>
<td>35.0</td>
<td>62.0</td>
</tr>
<tr>
<td>Arithmetic Average</td>
<td>2964</td>
<td>18.25</td>
<td>37.9</td>
<td>62.1</td>
</tr>
</tbody>
</table>

Table 2: Critical Path Delay Distribution of the 50-50_NO_MIX Architecture
front point is a buffered segment of length 4 then the router assumes, for purpose of calculating forward-looking delay, that the entire remainder of the route will consist of buffered segments of length 4. Similarly, if the segment was unbuffered, the forward-looking delay estimator would assume all subsequent segments to the sink would be connected with pass transistors. Clearly this approach is more accurate for non-mixed architectures. However, since in mixed architectures the forward-looking route is truly unknown, there is no better guess to make. Also, once any future point is reached, the exact calculation (described in 1 above) is correct. Depending on how directed the router is the search will be either more breadth-first or depth-first. The breadth-first expansion will be more accurate, as it always has the most correct delay calculation. Our empirical experience has shown that the VPR router does sufficient breadth-first searching to achieve a good quality answer. It typically uses buffers near the source of high-fanout nets and pass transistors close to the sink in the mixed architectures.

5 Experimental Results

In this section we explore which proportion of each type of track described in Section 4 provides the best area, delay and area-delay product for the new architecture. We use the same set of benchmark circuits, the 20 largest MCNC circuits, as the circuits we use to profile the 50-50_NO_MIX routing architecture in Section 3. We first present four figures of merit (track count, the critical path delay, total area and area-delay) of the new Mixed Buffer-Pass architecture as a function of the percentage of straight-planar tracks. Then we compare the figures of merit of several versions of the new architecture to several versions of the non-mixed architecture.

5.1 Properties of the Mixed Buffer-Pass Routing Architecture

Figure 6 plots the geometric average, over 20 circuits, of the minimum number of tracks per channel required to successfully route each circuit as a function of the percentage of straight-planar tracks. For routing architectures with low percentage of straight-planar tracks, the total track count is lower because of the increased flexibility between mixed-buffer and mixed-pass tracks provided by the two additional switches. When there is a high percentage of straight-planar tracks, there is a slightly increase in track count. This is likely because the timing-driven router [4] is forced to route nets in a more star-like pattern in order to achieve reasonable timing, which uses up more tracks that, for example, a steiner tree.

Figure 7 is a plot of the geometric average of the total area (as described in Section 2) which includes the logic area and routing area (for all 20 circuits) versus the percentage of straight-planar tracks. The total area decreases as the percentage of the straight-planar tracks increases because tri-state buffer switches consume twice as much area as pass transistor switches [1]. In addition mixed-pass tracks are more expensive than straight-planar tracks in terms of area because of the extra switches attached. Note that the track count increase observed in Figure 6, is not sufficient to offset the significantly higher area of mixed-buffer and mixed-pass tracks. Observe also that the use of mixed-buffer and pass tracks causes a significant increase in total area, over 40% more area compared to 100% pure planar tracks.

Figure 8 is a plot the geometric average of the critical path delay as a function of the percentage of straight-planar tracks. As the percentage of the mixed buffer-pass tracks decreases, the critical path delay increases. This is expected when there is not enough buffered routing resources to route high fanout nets.

Figure 9 is a plot of the geometric average of the total area delay product as a function of the percentage of straight-planar tracks. Notice that the total area delay product reaches its minimum when the percentage of the straight planar tracks is approximately 70%.
### 5.2 Comparison of Mixing and Non-Mixing Architectures

In this Section we compare several versions of the Mixed Buffer-Pass routing architecture (with differing amounts of straight-planar tracks) to several versions of a non-mixing architectures (which have different amounts of pass transistor tracks). Table 3 summarizes the comparison. Appendix A provides the plots of track count, area, critical path delay and area-delay product of the non-mixed architecture as a function of the percentage of pass-transistor tracks in the same 0.18um CMOS process. (Note that [1] and [4] work in 0.35um).

<table>
<thead>
<tr>
<th>Comparison and Architecture</th>
<th>Area (x10^6 min width transistor)</th>
<th>Critical Path Delay (ns)</th>
<th>Area-Delay Product</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Comparison 1</strong> (Best Delay)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mixed Buffer-Pass Routing Architecture (20% Straight Planar tracks, 80% Mixed Buffer-Pass tracks)</td>
<td>5.38</td>
<td>14.94</td>
<td>6.51</td>
</tr>
<tr>
<td>NO_MIX Routing Architecture (20% pass transistor tracks, 80% buffered tracks)</td>
<td>6.20</td>
<td>14.31</td>
<td>7.33</td>
</tr>
<tr>
<td><strong>Comparison 2</strong> (Best Area-Delay Product)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mixed Buffer-Pass Routing Architecture (70% Straight Plane tracks, 30% Mixed Buffer-Pass tracks)</td>
<td>4.35</td>
<td>17.48</td>
<td>5.72</td>
</tr>
<tr>
<td>NO_MIX Routing Architecture (80% pass transistor tracks, 20% buffered tracks)</td>
<td>4.38</td>
<td>18.89</td>
<td>6.25</td>
</tr>
<tr>
<td><strong>Comparison 3</strong> (Best new vs. 50_50 NO_MIX)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mixed Buffer-Pass Routing Architecture (20% Straight Plane tracks, 80% Mixed Buffer-Pass tracks)</td>
<td>5.38</td>
<td>14.94</td>
<td>6.51</td>
</tr>
<tr>
<td>NO_MIX Routing Architecture (50% pass transistor tracks, 50% buffered tracks)</td>
<td>5.25</td>
<td>16.54</td>
<td>6.90</td>
</tr>
</tbody>
</table>

Table 3: Comparisons Between Mixed and Non-mixed Architectures

Figure 7: Total FPGA (Routing + Logic Block) Area (low stress routing) for Mixed Buffer-Pass Transistor Architecture

Figure 8: Critical Path Delay for Mixed Buffer-Pass Transistor Architecture

The first comparison (“Comparison 1” in Table 3) is between the fastest mixed architecture and the fastest non-mixed architecture. The Mixed Buffer-Pass architecture with 20% Straight-Planar tracks achieves almost the same critical path delay (just 4% higher) achieved by a non-mixed architecture with 80% buffered tracks, yet uses 13% less area. The new architecture results in 11% gain in area-delay product.

The second comparison (“Comparison 2” in Table 3) is between the architectures that achieve the best area-delay product for the mixed and non-mixed architectures. The Mixed Buffer-Pass architecture with 70% Straight-Plane track percentage consumes the same area consumed by the non-mixed architecture with 20% buffered tracks, yet results in 8% faster in critical path delay. The new architecture results in 8% gain in area-delay product.
The third comparison is between a mixed architecture with 20% Straight-Planar tracks and the 50-50_NO_MIX architecture selected in [1]. The new architecture consumes almost the same area (just 2.4% more) consumed by the 50-50_NO_MIX architecture, yet results in 10% faster in critical path delay on average. The new architecture results in 6% gain in area-delay product.

5.3 Delay Profile of the Mixed Buffer-Pass Architecture

Table 4 gives the delay profile of a version of the Mixed Buffer-Pass architecture with 0% straight-planar tracks. Compared to the 50-50_NO_MIX delay profile presented in Table 2 this architecture is, on average 11.6% faster. Observe also that the percentage of delay attributed to the source buffer is significantly reduced. For each circuit, the speed gain (between this mixed architecture and the 50-50_NO_MIX architecture) ranges from -6.1% for the circuit elliptic to +47.1% for the circuit pdc. Notice that if circuit does not have many high fan-out nets, the benefits of the new architecture diminish. This is because that the new architectures pays the price of adding two pass transistor switches per mixed-buffer track and therefore increase the capacitive loading for each mixed-buffer track. If the number of high fan-out nets is relatively small, straight-planar tracks are very effective to route low fan-out nets.

<table>
<thead>
<tr>
<th>Circuit Name</th>
<th># of 4-Input BLEs</th>
<th>Total Critical Path Delay (ns)</th>
<th>Breakdown of Total Delay</th>
<th>Breakdown of Routing Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>Logic Block Delay (%)</td>
<td>Source Buffer Delay (%)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Routing Delay (%)</td>
<td>Routing Buffer Delay (%)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Input Connection MUX Delay (%)</td>
</tr>
<tr>
<td>alu4</td>
<td>1522</td>
<td>11.8</td>
<td>44.2</td>
<td>19.5</td>
</tr>
<tr>
<td>apex2</td>
<td>1878</td>
<td>14.2</td>
<td>41.5</td>
<td>29.2</td>
</tr>
<tr>
<td>apex4</td>
<td>1262</td>
<td>11.9</td>
<td>32.0</td>
<td>40.4</td>
</tr>
<tr>
<td>bigkey</td>
<td>1707</td>
<td>6.8</td>
<td>32.8</td>
<td>21.4</td>
</tr>
<tr>
<td>clima</td>
<td>8383</td>
<td>25.9</td>
<td>41.2</td>
<td>19.9</td>
</tr>
<tr>
<td>des</td>
<td>1591</td>
<td>11.6</td>
<td>38.9</td>
<td>34.2</td>
</tr>
<tr>
<td>diffeq</td>
<td>1497</td>
<td>16.4</td>
<td>60.7</td>
<td>35.9</td>
</tr>
<tr>
<td>dsip</td>
<td>1370</td>
<td>6.0</td>
<td>37.2</td>
<td>21.5</td>
</tr>
<tr>
<td>elliptic</td>
<td>3604</td>
<td>20.9</td>
<td>24.1</td>
<td>17.4</td>
</tr>
<tr>
<td>ex1010</td>
<td>4598</td>
<td>17.4</td>
<td>29.9</td>
<td>13.7</td>
</tr>
<tr>
<td>ex5p</td>
<td>1064</td>
<td>12.9</td>
<td>40.2</td>
<td>23.9</td>
</tr>
<tr>
<td>frisc</td>
<td>3556</td>
<td>25.1</td>
<td>62.1</td>
<td>42.5</td>
</tr>
<tr>
<td>misex3</td>
<td>1397</td>
<td>13.6</td>
<td>33.1</td>
<td>11.4</td>
</tr>
<tr>
<td>pdc</td>
<td>4575</td>
<td>19.1</td>
<td>30.9</td>
<td>18.7</td>
</tr>
</tbody>
</table>

Table 4: Critical Path Delay Distribution of Mixed Buffer-Pass Architecture
6 Conclusions

We have shown the importance of mixing tri-state buffer switches and pass transistor switches which make up the inter-cluster routing connections. The routing architectures which allow router to choose switch type during the routing phase are faster or more area-efficient. A version of the new architecture with 20% straight-planar tracks and 80% mixed buffer-pass tracks results in 10% gain in speed without area penalty compared to the 50-50_NO_MIX architecture or 13% gain in area without critical path delay penalty compared to the 20-80_NO_MIX architecture.

7 Acknowledgments

We would like to thank Vaughn Betz and Alexander Marquardt for providing us the CAD framework, VPR, upon which our work is built. We would also like to express special thanks to Elias Ahmed for providing us the timing information of the logic blocks for the 0.18 um CMOS technology.

8 References

Appendix A: Experimental Results for 0.18um Non-Mixed Architectures