# A 5-20 GHz, Low Power FPGA Implemented by SiGe HBT BiCMOS Technology

Chao You\* Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2513 youc@rpi.edu

Kuan Zhou Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2513 zhouk@rpiu.edu Jong-Ru Guo\* Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2513 quoj@rpi.edu

Michael Chu Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2513 chum2@rpi.edu Russell P. Kraft Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2765 kraftr2@rpi.edu

John F. McDonald Rensselaer Polytechnic Institute 110 8<sup>th</sup> St, Troy NY 12180 1-518-276-2919 mcdonald@unix.cie.rpi.edu

#### Abstract

A high speed, low power FPGA design is presented in this paper. This gigahertz FPGA design has an improved XC6200 structure. Redundant multiplexers are eliminated from critical signal path to enhance the performance of the previous design. By balancing between the power consumption and performance, the simulated clock rate is from 5 GHz to 20 GHz and the power consumption is from 4 mW to 12 mW per single cell in the IBM 7HP SiGe HBT BiCMOS process.

Cateogories & Subject Descriptors: VLSI, Gate Array

General Terms: design

Keywords: FPGA, CML, BC, BCII, Dynamic Routing

## **1. INTRODUCTION**

A Field Programmable Gate Array (FPGA) is a multipurpose device that can be configured to perform different tasks. More and more applications demand high speed FPGAs, such as wireless communications, high-speed networks and control systems. The first gigahertz 4 x 4 FPGA chip was introduced in 2000 at Rensselaer Polytechnic Institute by this research group. It utilizes Current Mode Logic (CML) multiplexers to implement a high-speed XC6200 structure [1], [2]. A pitfall of CML is its high power consumption compared to CMOS. Total cell power consumption can be calculated with the following equation.

$$P_{total} = (V \times I \times N_{CML}) \times N_{Cell}$$

'V' is the supply voltage, 'I' is the current in the CML tree, ' $N_{CML}$ ' is the number of current trees in a cell and ' $N_{Cell}$ ' is the total number of cells in a gate array.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

GLSVLSI'03, April 28-29, 2003, Washington, DC, USA.

A Basic Cell (BC) of this first gigahertz FPGA consisted of 26 CML current trees and each tree using 0.7 mA current with a 3.4 V power supply. At this power consumption level, a single BC consumes 62 mW and a 64x64 scaled up gate array would consume a staggering 253 W! Thus reducing power consumption becomes the primary goal to permit the implementation of a scaled up gate array. Various methods have been tried to lower the total CML power consumption. All those methods focus on three key factors of power consumption: voltage supply, number of current trees, and amount of current running in each current tree. A novel BC structure (BCII) is introduced in this paper. It reduces gate delay from 7 gates to 4 gates. With a reduced gate delay number, 0.4 mA can be used while maintaining a shorter gate delay than the old design. The BCII also uses a different multiplexer structure from the old design, cutting the voltage supply from 3.4 V to 2 V [3]<sup>-</sup>

## 2. NEW MULTIPLEXER STRUCTURE

Multiplexers, the building blocks of a cell, are introduced first.

The previous 4:1 multiplexer is shown in Figure 1. Input signals are sent into BJT pairs. Selection bits are sent into a two-level selection tree. The MSB is one level lower than the LSB as shown in Figure 1. These two levels of the selection bits work as a decoder. Each time, a single BJT pair on the top is selected. Since selection bits come in pairs, a CML tree can't be turned off without an additional control circuit.



Figure 1 Previous 4:1 multiplexer

Copyright 2003 ACM 1-58113-677-3/03/0004...\$5.00.

<sup>\*</sup> Both authors have the same contribution to this work.

The new 4:1 multiplexer is a single-level selection tree, as shown in Figure 2. The selection bits are on the same level, but without their complement signal. After configuration, one of four selection bits is set. The BJT pair above that selection bit is enabled. The new structure needs a separate decoder to set the selection bit. If none of the selection bits are set, the whole multiplexer is turned off. Part of this paper shows that dynamic routing takes advantage of this feature to turn off unused multiplexers.



Figure 2 New 4:1 multiplexer

The new structure saves one voltage level on the selection bit. It allows a lower voltage supply and thus uses less power. The highest CML tree in a cell determines the chip voltage. In the previous design, the highest tree is a three-level selection tree in an 8:1 multiplexer. In the new design, the 8:1 multiplexer won't require more levels than a 2:1 multiplexer. As a direct result of the new multiplexer structure, the power supply drops from 3.4 V to 2 V. Forty percent of the power is saved even without changing other parts of the design.

Another benefit from the new multiplexer structure is that versatile multiplexers can be implemented. A previous multiplexer requires  $2^n$  inputs, where n is the number of selection bits. The new structure allows for an arbitrary number of inputs. For example, 9:1 multiplexers are needed in part of this report. It can be easily implemented with the new structure.

## **3. IMPROVED BCII STRUCTURE 3.1. BC LOGIC DESCRIPTION:**

The original BC is shown in Figure 3. Each BC has two inputs from each direction. One is from its neighbor cell. The other is from a FastLANE, which is a shared bus for four cells in the same row or column.



Figure 3 Simplified basic cell

One thing slowing the XC6200 cell is that a signal passes

through 5 multiplexers when traveling through a cell. The output of a BC is chosen from one of 3 possible signals, namely combinational logic, sequential logic or the redirected input of a neighboring cell. At the input side of a neighboring BC, those 3 signals in addition to the FastLANE signal are selected again. The desire is to make the signal path shorter and eliminate the redundancy in the selection process by directing the output of a BC straight to the input of the next BC and solely using the input side multiplexers for selection.

#### 3.2. New BCII structure description:

Figure 4 shows the BCII structure, which can be broken into two parts: the output part and the input part.



**Figure 4 BCII structure** 

The Output Part:

The output part collects inputs from the "input part," computes the logic function results and sends them together with the redirected signals to the neighboring cells. The combinational logic result goes directly from the 2:1 multiplexer to neighboring cells. The sequential logic result goes directly from the Master-Slave latch to neighboring cells. The redirection multiplexer gets its inputs from three directions and selects one signal to pass to a neighboring cell. After a combinational or a sequential logic result is computed, it is sent to the neighboring cell directly. Therefore, the combinational logic result bypasses a CS multiplexer, a 4:1 multiplexer and an emitter follower. The sequential logic result bypasses a CS multiplexer, a 4:1 multiplexer and an emitter follower.

A redirection multiplexer routes an input from one neighbor cell to another neighbor cell. It obtains inputs from three neighbor cells. Since each neighbor cell sends out three outputs now, the redirection multiplexer receives nine inputs from the neighbor cells and sends out one output. A 9:1 multiplexer can be implemented by the structure introduced in Section 2. Figure 5 shows the changes in the output part.

#### The Input Part:

The input part collects inputs from all neighbor cells, selects three signals and sends them to the "output part." The signals that the input part collects are the combinational results, sequential results, redirection results and FastLANEs from all four directions. Which signals are selected depends on what kind of function will be performed by the cell.

One of the three input multiplexers needs sixteen inputs from all directions. The other two multiplexers have an extra input from the sequential logic. In practice the gate delay of 16:1 or 17:1 multiplexer is quite large. In the actual circuit, the 16:1 multiplexer is replaced by five 4:1 multiplexers, as shown in Figure 5(a), which has less gate delay than a 16:1 multiplexer. The 17:1 multiplexer is implemented by the circuit in Figure 5(b).



Figure 5 Actual multiplexer implementation

Even though the BCII structure is quite different from the original XC6200 cell, it preserves all the logic functionalities and has three less gate delays than a XC6200 cell. The saved gate delays can be used to compensate for the speed loss due to a lower current used in the current trees.

The RP multiplexer is merged into the master-slave latch, thus further reducing the number of CML trees. It is shown in Figure 6. The original first level MS-latch receives its signal from the RP multiplexer. To remove the RP multiplexer, two current trees are used here. The RP multiplexer selection bits are used as an enable-bit for those two trees. In practice, only one of the two trees is turned on at a time. Only the selected signal (P or R) goes through the first stage. P and R both can be off to turn off the first stage MS-latch and save power.



Figure 6 First stage of the MS-latch

The second stage of the MS-latch has changed very little. One enable bit "MS" has been used to turn off the MS-latch. If both MS and CLR are cleared, the second stage of the MS-latch will be turned off.



Figure 7 Second stage of the MS-latch

## 4. SIMULATION RESULT

This first 4 x 4 gigahertz FPGA has great potential for high-speed FPGA operation implemented in SiGe. The continuation of this work is focusing on high speed, low power, and small area, where high speed is still the primary goal. The BCII structure has a shorter gate delay allowing the use of a smaller tree current [3], [4]. The trade-off trend of the performance of power consumption in the IBM 7HP process is shown in Figure 8. The peak  $f_T$  current is 1.2 mA.



Figure 8 Ic versus f<sub>T</sub> in the IBM 7HP SiGe BiCMOS

Several tree current are used to trade between the power and the performance. In the IBM 7HP process, one original basic cell has a gate delay of 80 ps with 53 mW per cell (3.4 V power supply, 0.6 mA current tree and combinational logic). The BCII has a gate delay of 55 ps with 12 mW per cell (2 V power supply, 0.6 mA current tree and combinational logic). To save more power, a smaller current can be used while still maintaining the high-speed performance. For example, when 0.4 mA is used in the current tree, the total power consumption is 8 mW for combinational logic and the gate delay is 70 ps, which is still faster than the BC.

As shown in Table 1, the BCII has a very low power consumption and good performance.

| Design      | Power (mW) | Delay (ps) |
|-------------|------------|------------|
| BC 0.6 mA   | 53         | 80         |
| BCII 0.8 mA | 16         | 46         |
| BCII 0.6 mA | 12         | 55         |
| BCII 0.4 mA | 8          | 70         |
| BCII 0.2 mA | 4          | 120        |

Table 1 Power and Delay Chart for Designs

An AND gate is simulated for design comparison

Figure 9 shows the power and delay trade-off of BCII in the IBM 7HP SiGe process. The best trade-off is at 0.4 mA per current tree. A current larger than 0.8 mA will give a shorter gate delay at the expense of increasing power consumption.



Figure 9 Power delay trade-off in the IBM 7HP Process



Figure 10 Simulation result of an AND gate

As shown in Figure 10, an AND is simulated in the IBM 7HP SiGe technology. The running current is 0.6 mA and the gate

delay is 55 ps. The simulation condition is 25 °C and the voltage swing is 250 mV. One chip that contains four BCII ring oscillators has been shipped out for fabrication. These four ring oscillators have different power consumption, which can be used as a trade-off reference in future work. The layout of this chip in fabrication is shown in Figure 11.



Figure 11 BCII IBM 7HP layout

## 5. CONCLUSION AND FUTURE WORK

The BCII design is focused on low power consumption while keeping the best performance. One BCII chip with different power-delay trade-offs has been shipped out for fabrication. Further research result of the chip will be updated. Future works includes chip testing and a redesign for the faster IBM 8HP process.

Other improvements on BCII structure involve including an adder circuit into the BCII structure, thus reducing the number of cells needed in some application. This further improved structure is called BCIII. It has the same gate delay while having more functionality. Each BCIII is equivalent to three BCII gates when an adder circuit is required in an FPGA application.

### 6. **REFERENCES**

- [1] John F. McDonald and Bryan S. Goda, "Reconfigurable FPGA's in the 1-20GHz Bandwidth with HBT BiCMOS", Proceedings of the first NASA/ DoD Workshop on Evolvable Hardware, pp. 188-192.
- [2] Bryan S. Goda, John F. McDonald, Stephen R. Carlough, Thomas W. Krawczyk Jr. and Russell P. Kraft, "SiGe HBT BiCMOS FPGAs for fast reconfigurable computing," *IEE Proc.-Compu. Digi. Tech*, vol.147, no. 3 pp. 189-194.
- [3] "IBM SiGe Designer's manual", (IBM Inc. Burlington Vermont. 2001).
- [4] Harme, D., Crabbe, E., Cressler, J., Comfort, J., Sun, J., Stiffler, S., Kobeda, E., Burghartz, M., Gilbert, J., Malinowski, A., Dally, S.,Rathanphanyarat, M., Saccamango, W., Cotte, J., Chu, C., Stork, J.: "A High Performance Epitaxial SiGe-Base ECL BiCMOS Technology," *IEEE IEDMTech Digest*, 1992, PP. 2.1.1-2.1.4.