# Spec-based flip-flop and latch repeater planning 

Man Chung Hon<br>Intel Corportation<br>Santa Clara, CA 95054, USA<br>manch.c.hon@intel.com


#### Abstract

Shrinking process geometries and frequency scaling give rise to an increasing number of interconnects that require multiple clock cycles. This paper explores efficient techniques to insert flip-flops and latches to meet pre-determined latency and margin constraints at the receivers. Previous approaches push timing margins to either ends of interconnect. We present an $O(n \log n)$-time algorithm to insert flipflops that evens out timing margins across the entire interconnect, resulting in more robust designs and faster design convergence. An $O(n \log n)$-time extension to handle symmetric, two-phases latches is also presented. Experimental results verify the correctness and practicality of our approach.


## I. Introduction

Deep submicron process has created a number of new design challenges. Primary among them is the increasing dominance of interconnects. Current scaling trends show that the frequency of high-performance ICs approximately doubles and the die size grows by $25 \%$ with every process generation. Interconnect optimization techniques such as repeater insertion and gate sizing have proven effective in reducing interconnect delay. However, with such short clock cycles, the delay on global signals may be longer than one clock cycle even after being optimized with these techniques. Insertion of clocked repeaters such as flip-flops and latches become necessary on these signals.

In a typical design flow, microarchitects determine the target latencies for each driver-receiver pair in the design during the early phase of design cycle. Traditionally, flip-flops and latches are coded into RTL manually. It is difficult for the circuit designers to try out different configurations and placements. A recent study on the effect of process scaling on repeaters [9] shows that, not only is a clock cycle's worth of metal length shrinking faster than the scaling of geometries, it is also decreasing at a much faster rate than the repeater-to-repeater distance. It is predicted that at the $45-\mathrm{nm}$ process node, as many as one in every four repeaters is clocked [9].

There has been growing interests among researchers on the problem of multi-cycle global interconnects. Lu, Zhong, Koh and Chao [8] proposed an analytical formulation for the simultaneous insertion of flip-flops and
buffers to minimize latencies. Hassoun, Alpert and Thiagarajan [3] combined routing, buffer and flip-flop insertion in multiple clock domain systems. Both [8] and [3] only work with two-pin nets, and do not respect latency constraints specified a priori. Cocchini [2] extended van Ginneken's classical dynamic programming structure [12] to simultaneously insert flip-flops and buffers, and can be directed to either minimize latencies or to meet specified latency constraints. This algorithm was in turn extended by Seth, Zhao and Hu [10] to incorporate single phase level-sensitive latches. As pointed out by Akkiraju and Mohan [1], Cocchini's approach puts all margins, both positive and negative, to the first segment right after the driver. (On a 2 -pin net bounded by two flip-flops, margin is defined to be the difference between a clock cycle and the arrival time at the receiving flip-flop. On a multifanout net, we take the minimum from all the receivers.) Instead, [1] proposed another extension of van Ginneken to insert flip-flops in such a way that margins are evenly distributed in the driver and the receiver ends, while holding the middle segments timing-tight. Researchers have also investigated techniques other than clocked repeaters. Zhang, Hu and Chen [14] suggested a new global interconnect architecture using the wave-pipelining technique. Lin and Zhou [7] applied retiming to improve the clock speed on an initial flip-flop placement.
In this paper, we present an algorithm to insert flipflops to even out margins throughout the entire interconnect. This way of margin distribution is often what a circuit designer wants intuitively. In the case of positive margin, putting all at the extremities of the interconnect, as is done in Cocchini [2] and Akkiraju and Mohan [1], makes the middle segments timing-critical. A large number of tight segments could pose challenges to downstream design stages. This is especially problematic when designers seek to improve the chip's speed in minor revisions of a taped-out design. The middle segments, which barely pass the original frequency goal, may now come in with negative margins. In the case of negative margin, putting all at the ends of the interconnect makes the timing problem more difficult to fix. As an example, imagine a 2-pin net with 9 flip-flops in between, and a negative margin of -100 pico-seconds. Fixing 10 nets each with a -10 ps margin is arguably easier than fixing one net with a -100 ps margin (Cochinni), or two nets with -50 ps margin
(Akkiraju and Mohan)
The remainder of the paper is organized as follows. Section II sets up the flip-flop and latch insertion problems in terms of equivalent retiming problems. Section III presents an efficient $O(n \log n)$-time flip-flop insertion algorithm that evens out margins. Section IV extends the flop algorithm to handle 2-phase symmetric latches, which also runs in $O(n \log n)$ time. Both algorithms are asymtotically faster and easier to implement than previous work. Section V presents experimental results, and we conclude in Section VI.

## II. Preliminaries

In this paper, we model the topology of an interconnect as a tree $T(V, E) . V=\left\{v_{d} \cup S_{N} \cup S_{T N}\right\}$ is a set of $n$ nodes, with root $v_{d}$ and leaves $S_{N}$ of $T$ being the interconnect driver and receivers respectively. The intermediate Steiner nodes $S_{T N}$ contain candidate locations for flop and latch insertion. $E$ is a set of $|V|-1$ edges corresponding to wires. A stage is an ordered subset of edges $\left(u, u_{1}\right), \ldots,\left(u_{k}, v\right)$ in $E$ such that $u$ is either $v_{d}$ or contains a flop (latch), $v$ is either in $S_{N}$ or contains a flop (latch), and $u_{1}, \ldots, u_{k}$ are free of flops and latches. Each receiver $v_{r} \in S_{N}$ has a non-negative, integeral latency requirement $l\left(v_{r}\right)$, and a non-negative required margin $m\left(v_{r}\right) . l\left(v_{r}\right)$ specifies the number of flops, or in the case of latches, half the number of latches, on the unique path from $v_{d}$ to $v_{r}$. The margin requirement states that the signal must arrive at $v_{r}$ no later than $m\left(v_{r}\right)$ time-units before the falling clock edge. We model the delay inside a flop or a latch by a single, fixed real number $d_{\text {cell }} \geq 0$.

Our modeling of wire delay follows [1]. We assume the delay of a wire to be proportional to the square root of its RC-content. The maximum unrepeated distance for a wire was computed based on performance and reliability requirements. Buffers were inserted based on this distance and simulations were done to compute the delay across the wire. The delays were then curve-fitted to a linear equation with respect to the square root of the wire's total RC-content. This allows us to quickly estimate the wire delay $d(e)$, under the assumption that the wire is part of a buffered interconnect path.

We now describe our notion of proper latch timing. We follow closely the definition in [4]. In particular, we require the latches to hold the same values as in an identical circuit in which all elements, including both gates and wires, have zero propagation delay. The notion of proper latch timing is structural in nature, in the sense that the circuit should operate correctly regardless of the functions it computes.

Formally, we seek to solve the following two problems:

1. Flip-Flop Repeater Insertion Problem Assign flops to $S_{T N}$ such that: (i) margins are equally distributed among all stages; and (ii) latency and margin requirements at the receivers are satisfied.


Fig. 1. Top: Original graph model $T(V, E) . d\left(e_{1}\right), \cdots, d\left(e_{5}\right)=1$, $d\left(e_{6}\right), \cdots, d\left(e_{11}\right)=2$. Receivers have required margins $m\left(v_{5}\right)=3, m\left(v_{9}\right)=m\left(v_{11}\right)=2$, and required latencies $l\left(v_{5}\right)=2, l\left(v_{9}\right)=l\left(v_{11}\right)=3$. Bottom: Transformed graph model $T^{\prime}\left(V^{\prime}, E^{\prime}\right)$. Edges have implicit weight 0, unless otherwise specified. Numbers on nodes are associated delays.
2. Latch Repeater Insertion Problem Assign twophase, symmetric level-sensitive latches to $S_{T N}$ such that: (i) latency and margin requirements at the receivers are satisfied; and (ii) timing is structurally correct.

Let us break the circuit $T$ into $\left|v_{r}\right|$ sub-circuits $T_{r}$, each consisting of the unique 2-pin sub-net from the driver $v_{d}$ to a single receiver $v_{r}$. It is clear that distributing margins evenly on the 2-pin net $T_{r}$ amounts to minimizing the clock period of $T_{r}$. The clock period of $T$ is given by the relation $\operatorname{period}(T) \geq \max _{v_{r}} \operatorname{period}\left(T_{r}\right) \geq \operatorname{period}\left(T_{r}\right)$ for all $v_{r}$. Minimizing clock period of $T$ evens out margins by proxy.

Retiming [5] is a technique that relocates registers to reduce cycle time while preserving circuit functionality. As a first step, we convert each edge $e \in E$ into a vertex with propagation delay $d(e)$. Formally, we transform $T(V, E)$ into another tree $T^{\prime}\left(V^{\prime}, E^{\prime}\right)$, with $V^{\prime}=V \cup E$ and each edge $u \xrightarrow{e} v$ in the original edge set $E$ spawns two edges in $E^{\prime}: u \rightarrow e$ and $e \rightarrow v$.

There are three possibilities for $d\left(v^{\prime}\right), v^{\prime} \in V^{\prime}$. If $v^{\prime}$ comes from a receiver $v_{r}$ in the original vertex set $V$,
$d\left(v^{\prime}\right)=m\left(v_{r}\right)$. If $v^{\prime}$ comes from an edge $e \in E, d\left(v^{\prime}\right)=$ $d(e)$. For all other vertices, $d\left(v^{\prime}\right)=0$.

As in [5], sequential elements are represented implicitly by a non-negative integral weight $w\left(e^{\prime}\right)$ on $e^{\prime} \in E^{\prime}$. In our case, $w\left(e^{\prime}\right)=0$ except for the edges that are adjacent to a receiver $v_{r}$, in which case $w\left(e^{\prime}\right)=l\left(v_{r}\right)\left(2 l\left(v_{r}\right)\right.$ for latches $)$ (Fig. 1.) Note that the original flop or latch placement, if there is any, is ignored. Our algorithms operate on the transformed tree model $T^{\prime}$. For ease of exposition, we will refer to both as $T$ in the rest of the paper. Note that the asymptotic bounds derived for $T^{\prime}$ also hold for $T$.

A retiming solution can be viewed as an integral labeling $r$ of vertices. $r(v)$ represents the number of sequentials moved from $v$ 's outputs toward its inputs. In particular, if we require that $r\left(v_{d}\right)$ at the driver and $r\left(v_{r}\right)$ at all receivers to be equal, it is guaranteed that latencies at the receivers remain the same after retiming. Since we model delays across a clocked repeater as a constant $d_{\text {cell }}$, it does not factor in the selection of one retiming solution over another. For the rest of the paper we assume $d_{\text {cell }}=0$.

## III. Flip-Flop Insertion

A general framework to calculate the minimum period is to perform a binary search over a range of possible clock periods until a user-definied accuracy is reached. At each step in the binary search, a new clock period is tried. Retiming is performed to recalculate $r(v)$ at each node in an attempt to make the circuit work with the current clock period. The smallest period for which retiming succeeds is returned as the best clock period.
The efficiency of the binary search depends strongly on the retiming step in its inner loop. One algorithm for the retiming step is the relaxation method, modelled after the Bellman-Ford algorithm for the single source shortest path problem [5, 11]. Shown in Fig. 2, relaxation works on general graphs. We denote by $\delta(v)$ the latest arrival time at $v$ along any combinational path that terminates at $v$. It is obvious that the clock period $c$ is given by $c=\max _{v \in V} \delta(v)$. For a given set of $r(v)$, we calculate $\delta(v)$ for all $v$, which takes $O(E)$ time. The relaxation method thus runs in $O(V E)$ time.

An analysis by Yen $[13,6]$ explains why the $V-1$ loop in lines 2-7 of Graph_Flop_Clk_Feas (Fig. 2) works. The latest arrival time $\delta(v)$ converges to its final value if the edges along the longest path (in terms of propagation time) from the source node $v_{d}$ to $v$ are relaxed in order. The sequence of edges relaxed in the loop consists of $V-1$ copies of some ordering of $E$, and therefore contains every vertex-disjoint path as a subsequence. Since $T$ is a tree, there is a unique path from $v_{d}$ to $v$. There is thus no need for the $V-1$ loop. Instead, we combine the calculation of $\delta(v)$ with the relaxation of $r$ in a single topological tour of the nodes. The new linear-time relaxation algorithm is shown in Fig. 2.

We bound the binary search in Fig. 2 from above with

```
Graph_Flop_Clk_Feas (G, r, c)
/* Input: Graph G(V,E), label r, period c
    * Output: Feasibility of c */
        for each \(v \in V \quad r(v) \leftarrow 0\)
        for \(|V|-1\) \{
            Compute retimed edge weights
            Compute \(\delta(v)\) for all \(v \in V\)
            for all \(v \in V\) such that \(\delta(v)>c\)
                \(r(v) \leftarrow r(v)+1\)
        \}
        Compute retimed edge weights
        if \(\max _{v \in V} \delta(v)>c\)
            then no feasible retiming
            else the current \(r\) is legal
Tree_Flop_Clk_Feas(T,r, c)
/* Input: Tree \(T(V, E)\), label \(r\), period \(c\)
    * Output: Feasibility of \(c * /\)
        for each \(v \in V \quad r(v) \leftarrow 0\)
        for \(v \in V\) in topological order \(\{\)
            Compute \(\delta(v)\)
            if \(\delta(v)>c\)
                then \(r(v) \leftarrow r(\operatorname{parent}(v))+1\)
                    \(\delta(v) \leftarrow d(v)\)
                    else \(r(v) \leftarrow r(\operatorname{parent}(v))\)
        \}
        Compute retimed edge weights
        if \(\max _{v \in V} \delta(v)>c\)
            then no feasible retiming
            else the current \(r\) is legal
Tree_Flop_Insert (T, \(\epsilon\) )
/* Tree Flip-Flop Insertion
    * Input: Tree \(T(V, E)\), rel error \(\epsilon\)
    * Output: Retiming label r*/
        \(d_{\text {max }} \leftarrow \max _{v \in V} d(v)\)
        \(c_{h i} \leftarrow \max _{v_{r}} d\left(v_{d} \leadsto v_{r}\right)\)
        \(c_{l o} \leftarrow \max _{v_{r}}\left\{d\left(v_{d} \leadsto v_{r}\right) /\left(l\left(v_{r}\right)+1\right)\right\}\)
        \(r \leftarrow 0\)
        while \(c_{h i}-c_{l o}>\epsilon * d_{\max }\{\)
        \(c \leftarrow\left(c_{h i}+c_{l o}\right) / 2\)
        if Tree_Flop_Clk_Feas(T,r,c) = true
        then \(c_{h i} \leftarrow c\)
        else \(c_{l o} \leftarrow c\)
    \}
    return \(r\)
```

Fig. 2. Clock feasibility test in FF retiming for general graphs, clock feasibility test for trees and FF insertion for trees


Fig. 3. Two phase clocking scheme, reproduced from [4].


Fig. 4. The path $v_{4} \leadsto v_{8}$ satisfying Condition 1 implies $v_{5} \leadsto v_{7}$ meets the same condition. $d_{i}$ denotes the sum of vertex delays in the $i$-th stage.
$c_{h i}=\max _{v_{r}} d\left(v_{d} \leadsto v_{r}\right)$. This corresponds to the circuit being entirely combinational. The search is bounded from below by $c_{l o}=\max _{v_{r}}\left\{d\left(v_{d} \leadsto v_{r}\right) /\left(l\left(v_{r}\right)+1\right)\right\}$, which corresponds to margins being distributed perfectly evenly along every driver to receiver path. Note that for $c>=c_{l o}$, it is guaranteed that Tree_Flop_Clk_Feas leaves $r\left(v_{r}\right)=0=r\left(v_{d}\right)$ for all receivers $v_{r}$ : the latency requirement is satisfied throughout the binary search. Tree_Flop_Insert solves Problem 1 in $O(V \log V)$ time for any fixed $\epsilon$.

## IV. Latch Insertion

In a two-phase clocking scheme $\left\langle\phi_{0}, \gamma_{0}, \phi_{1}, \gamma_{1}\right\rangle$, two clocking phases are employed (Fig. 3). $\phi_{0}, \phi_{1}$ denote the duty cycle of the first and second phases, and $\gamma_{0}, \gamma_{1}$ denote the gaps. The period is given by $\phi_{0}+\gamma_{0}+\phi_{1}+\gamma_{1}$. A two-phase symmetric clocking scheme is characterized by equal duty cycles and gaps: $\phi_{0}=\phi_{1}=\phi$ and $\gamma_{0}=\gamma_{1}=\gamma$.

The conditions for proper timing of $T$ are based on considering the operation of $T$ when all propagation delays are 0 . It can be shown that $T$ is properly timed if and only if for every path $u \stackrel{p}{\sim} v$, the following condition is satisfied [4]:

$$
\begin{equation*}
d(p) \leq c\left(\frac{1+w(p)}{2}\right)+\phi \tag{1}
\end{equation*}
$$

Furthermore, we only need to make sure (1) is satisfied for maximal paths. These are paths that start right after a latch and ends right before another. To see this, note that the right hand side of (1) stays the same as we take in more nodes into the path, as long as the number of latches does not change. On the other hand, more nodes could potentially increase $d(p)$, leading to a tighter inequality. To reduce clutter on the formulas, we set $\phi=0$ for the rest of the paper.
(1) leads to a greedy algorithm. Nodes are visited in topological order staring with the driver. In Fig. 4, suppose all the maximal paths up until $v_{10}$ satisfy (1), and
we are considering whether a latch must be inserted between $v_{10}$ and $v_{11}$ to maintain timing feasibility. Applying (1) to the maximal paths that end with $v_{11}$, we consider whether the following inequalities are satisfied:

$$
\begin{align*}
y & \leq \frac{c}{2}-x \\
y & \leq \frac{c}{2} * 2-d_{4}-x \\
\vdots &  \tag{2}\\
y & \leq \frac{c}{2} * 5-d_{1}-d_{2}-d_{3}-d_{4}-x
\end{align*}
$$

Case 1 Not satisfied. In this case we must insert a latch. $x$ becomes the new $d_{5}$, the right hand sides of (2) adjusted, and a new inequality is added to the set:

$$
\begin{align*}
z & \leq \frac{c}{2}-y \\
\vdots & \\
z & \leq \frac{c}{2} * 6-d_{1}-d_{2}-d_{3}-d_{4}-d_{5}-y \tag{3}
\end{align*}
$$

Case 2 Satisfied. We can avoid adding a latch:

$$
\begin{align*}
& z \leq \frac{c}{2}-x-y \\
& \vdots \\
& z \leq \frac{c}{2} * 5-d_{1}-d_{2}-d_{3}-d_{4}-x-y \tag{4}
\end{align*}
$$

The inequality satisfaction tests can be done more efficiently by replacing (2) with a single inequality $y \leq d_{\text {min }}$, where $d_{\text {min }}$ is the minimum of the right hand sides in (2):

$$
d_{\min }=\min \left(\frac{c}{2}-x, \ldots, \frac{c}{2} * 5-d_{1}-d_{2}-d_{3}-d_{4}-x\right)
$$

Comparing the updated inequalities in (3) and (4) with (2), it is evident that $d_{\text {min }}$ can be computed efficiently. Starting with $d_{\min }\left(v_{d}\right)=\frac{c}{2}$ at the driver, $d_{\min }(v)$ is computed recursively from $d_{\text {min }}(u)$, where $u \rightarrow v$ is an edge in $E$, as follows:

$$
d_{\min }(v)=\left\{\begin{array}{c}
\frac{c}{2}+\min \left(0, d_{\min }(u)-d(u)\right)  \tag{5}\\
\text { if } u \rightarrow v \text { is latched } \\
d_{\min }(u)-d(u) \text { otherwise }
\end{array}\right.
$$

With (5), feasibility of any period $c$ can be determined in linear time (Fig. 5). Using this as a subroutine, a binary search over the range of possible periods is performed. Let $D=\max _{v \in V} d(v)$. The search is bounded by $|V| * D+2 * \phi$ from above, and $2 * \phi$ from below. This results in an $O(n \log n)$-time algorithm (Fig. 5).

## V. Experimental Results

We implemented three different spec-based sequential insertion algorithms in $\mathrm{C}++$, and compared them on a block from an Intel Xeon ${ }^{\mathrm{TM}}$ microprocessor designed on $90-\mathrm{nm}$ process technology.

TABLE I
Results of various sequential insertion algorithms. The flop number of $L A T C H$ is half the number of latches.

| Algorithm | Flops | RunTime (seconds) |
| :--- | ---: | ---: |
| SBFIA | 14989 | 3104.38 |
| RFLOP | 14907 | 11.54 |
| LATCH | 15028.5 | 11.92 |

- SBFIA - Spec-based FF insertion from [1].
- RFLOP - FF insertion algorithm from Section III.
- LATCH - Latch insertion algorithm from Section IV.

The design has 1769 multi-cycle nets, with fanouts ranging from 1 to 34 . The nets' topologies are discretized by breaking the wires with a candidate insertion node every 100 microns. The relative error $\epsilon$ is set to 0.001 . All experiments were run on a workstation with four Intel Xeon ${ }^{\mathrm{TM}} 2.8 \mathrm{GHz} / 512 \mathrm{~KB}$ processors and 8 GB of memory, although none of the three algorithms exploit the parallelism afforded by the machine.

As shown in Table I, the three algorithms used similar amount of sequential resources. For accounting purpose a latch is counted as half a flop. Note that even though minimizing flop usage is not an explicit goal of $R F L O P$, it used slightly fewer flops than SFBIA, while $L A T C H$ used $0.26 \%$ more. A more detailed study of the result shows that RFLOP was more flop-efficient than SBFIA in every fanout bucket.

The comparison on runtime is more pronounced. Both RFLOP and LATCH finished under 12 seconds, while the $O\left(n^{2} \log n\right)$-time SFBIA took over 50 minutes.

We observe that stage spread - the difference between maximum and minimum flop-to-flop propagation delay is smaller for RFLOP than SBFIA (Table II), validating our design goal of a more even margin distribution. Errors from discretization and imbalance among multiple receivers keep the spread positive. Perfectly balanced margins would have given a spread of zero.

Both RFLOP and LATCH improved negative slacks compared to SFBIA (Table III). In the case of LATCH, all the negative slacks were wiped out. The median gain on frequency over SFBIA ranges from $15.96 \%$ to $35.11 \%$ for RFLOP, and $22.42 \%$ to $44.27 \%$ for LATCH.

## VI. Conclusions

We have presented fast and practical algorithms for flop and latch insertion to facilitate design of pipelined interconnects within the latency constraints at the receivers. We argued that evening out timing margins across the entire interconnect is more beneficial than pushing them to either ends, and made it an explicit goal of our algorithms to distribute margins evenly. Experimental results

TABLE II
Stage spread is measured in picoseconds. The number of nets is given in parentheses next to the fanouts.

| Algorithm | No. <br> Flops | Stage Spread |  |
| :---: | :---: | :---: | :---: |
|  |  | Median | Average |
| 1-fanout (941) |  |  |  |
| SBFIA | 5766 | 162.12 | 175.96 |
| RFLOP | 5766 | 58.23 | 80.31 |
| 2-fanout (456) |  |  |  |
| SBFIA | 4968 | 249.78 | 253.46 |
| RFLOP | 4956 | 146.73 | 152.98 |
| 3-fanout (187) |  |  |  |
| SBFIA | 1877 | 269.94 | 256.91 |
| RFLOP | 1876 | 183.30 | 181.92 |
| 4-fanout to 6-fanout (145) |  |  |  |
| SBFIA | 1703 | 281.67 | 281.46 |
| RFLOP | 1671 | 205.34 | 200.73 |
| 7-fanout+ (40) |  |  |  |
| SBFIA | 675 | 333.72 | 319.72 |
| RFLOP | 638 | 230.98 | 224.37 |

TABLE III
Negative slack is measured in picoseconds. The number of nets is given in parentheses next to the fanouts.

|  | Negative Slack |  |  | Freq Gain |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
|  | Total | Worst | Nets | Median | Average |
| 1-fanout (941) |  |  |  |  |  |
| SBFIA | -38242.2 | -213.4 | 707 | 0 | 0 |
| RFLOP | -253.2 | -16.6 | 192 | 17.51\% | 19.36\% |
| LATCH | 0.0 | 0.0 | 0 | 24.35\% | 29.67\% |
| 2-fanout (456) |  |  |  |  |  |
| SBFIA | -17859.8 | -220.0 | 413 | 0 | 0 |
| RFLOP | -69.1 | -6.3 | 59 | 17.47\% | 17.13\% |
| LATCH | 0.0 | 0.0 | 0 | 23.47\% | 25.53\% |
| 3 -fanout (187) |  |  |  |  |  |
| SBFIA | -7704.4 | -180.8 | 168 | 0 | 0 |
| RFLOP | -278.6 | -15.0 | 44 | 15.96\% | 17.45\% |
| LATCH | 0.0 | 0.0 | 0 | 22.42\% | 26.48\% |
| 4-fanout to 6-fanout (145) |  |  |  |  |  |
| SBFIA | -7756.8 | -189.9 | 142 | 0 | 0 |
| RFLOP | -335.8 | -15.8 | 60 | 16.24\% | 21.66\% |
| LATCH | 0.0 | 0.0 | 0 | 24.05\% | 29.27\% |
| 7-fanout+ (40) |  |  |  |  |  |
| SBFIA | -3312.1 | -169.0 | 40 | 0 | 0 |
| RFLOP | -119.4 | -13.2 | 22 | 35.11\% | 32.36\% |
| LATCH | 0.0 | 0.0 | 0 | 44.27\% | 39.58\% |

demonstrate the efficiency and efficacy of the algorithms on industrial test cases. unsrt

## References

[1] N. Akkiraju and M. Mohan. Spec based flip-flop and buffer insertion. In Proceedings of the 21st International Conference on Computer Design, pages 270-275. IEEE Computer Society, 2003.
[2] P. Cocchini. A methodology for optimal repeater insertion in pipelined interconnects. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 22(12):1613-1624, 2003.
[3] S. Hassoun, C. J. Alpert, and M. Thiagarajan. Optimal buffered routing path constructions for single and multiple clock domain systems. In Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, pages 247-253. ACM Press, 2002.
[4] A. T. Ishii, C. E. Leiserson, and M. C. Papaefthymiou. Optimizing two-phase, level-clocked circuitry. J. ACM, 44(1):148-199, 1997.
[5] C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorithmica, 6(1):5-35, 1991.
[6] C. E. Leiserson and J. B. Saxe. A mixed-integer linear programming problem which is efficiently solvable. $J$. Algorithms, 9(1):114-128, 1988.
[7] C. Lin and H. Zhou. Retiming for wire pipelining in system-on-chip. In Proceedings of the 2003 international conference on on Computer-aided design, pages 215-220. IEEE Computer Society, 2003.
[8] R. Lu, G. Zhong, C. Koh, and K. Chao. Flip-flop and repeater insertion for early interconnect planning. In Proceedings of the conference on Design, automation and test in Europe, page 690. IEEE Computer Society, 2002.
[9] P. Saxena, N. Menezes, P. Cocchini, and D. Kirkpatrick. Repeater scaling and its impact on cad. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 23(4):451-463, 2004.
[10] V. Seth, M. Zhao, and J. Hu. Exploiting level sensitive latches in wire pipelining. In Proceedings of the 2004 international conference on on Computer-aided design, pages 283-290. IEEE Computer Society, 2004.
[11] N. Shenoy. Retiming: theory and practice. Integr. VLSI J., 22(1-2):1-21, 1997.
[12] L. van Ginneken. Buffer placement in distributed rc-tree networks for minimal elmore delay. In Proceedings of the IEEE International Symposium on Circuits and Systems, volume 2, pages 865-868. IEEE Computer Society, 1990.
[13] J. Yen. An algorithm for finding shortest routes from all source nodes to a given destination in general networks. Quarterly of Applied Mathematics, 27(4):526-530, 1970.
[14] L. Zhang, Y. Hu, and C. Chen. Wave-pipelined on-chip global interconnect. In Proceedings of the Design Automation Conference Asia and South Pacific, page to appear. IEEE Computer Society, 2005.

