Physical Planning with Retiming*

Jason Cong and Sung Kyu Lim
UCLA Department of Computer Science, Los Angeles, CA 90095
{cong,limsk}@cs.ucla.edu

Abstract

In this paper, we propose a unified approach to partitioning, floorplanning, and retiming for effective and efficient performance optimization. The integration enables the partitioner to exploit more realistic geometric delay model provided by the underlying floorplan. Simultaneous consideration of partitioning and retiming under the geometric delay model enables us to hide global interconnect latency effectively by repositioning FF along long wires. Under the proposed geometric embedding based performance driven partitioning problem, our GEO algorithm performs multi-level top-down partitioning while determining the location of the partitions. We adopt the concept of sequential arrival time [14] and develop sequential required time in our retiming based timing analysis engine. GEO performs cluster-move based iterative improvement on top of multi-level cluster hierarchy [4], where the gain function obtained from the timing analysis is based on the minimization of cutsize, wirelength, and sequential slack. In our comparison to (i) state-of-the-art partitioner Metis [9] followed by retiming [11] and simulated annealing based slicing floorplanning [15], and (ii) state-of-the-art simultaneous partitioning with retiming HPM [7] followed by floorplanning [15], GEO obtains 35% and 23% better delay results while maintaining comparable cutsize, wirelength, and runtime results.

1 Introduction

The conventional objective of partitioning is to minimize the number of connections among the subcircuits. Under the new interconnect-centric design paradigm, however, partitioning is seen as the crucial step that defines the local and global interconnects [2]. To meet the performance requirement of today’s complex design, partitioners must consider the amount of interconnect induced by partitioning as well as its impact on performance. Cutsizes minimization helps to lower the possibility of critical paths crossing partition boundary multiple times, thus improving performance. However, cutsizes minimization alone is not enough since we need more rigorous approaches that address the impact of partitioning on delay minimization. Performance driven partitioning methods can be grouped into ones that are designed for combinational circuits [10, 13, 16], and for sequential circuits with retiming [14, 3, 7]. However, all of these works use a very simple delay model – unique delay value for gates, but single constant value for inter-block connection. This limitation is simply due to the fact that the location and dimension of blocks are not available during the partitioning.

Given a circuit consisting of both fixed or flexible blocks and a netlist interconnecting these blocks, floorplanning constructs a layout by determining the position and shape of each block such that all nets can be routed and total layout area is minimized. Traditionally, partitioning is performed first to generate blocks, followed by the subsequent floorplanning to determine the location of blocks. However, the separation between partitioning and floorplanning poses two major shortcomings for performance optimization; (i) partitioning suffers from non-realistic delay estimation as the location and dimension of blocks are not available, and (ii) the subsequent floorplanning may be constrained by the predefined local and global interconnects from the prior partitioning. Retiming [11] is a sequential logic optimization technique that exploits the flexibility provided by repositioning flip-flops (FFs) to minimize either the delay or the number of FFs in the circuit. Recently, retiming has become more attractive in handling global interconnects – it allows multiple clock cycles to propagate signals across the chip. Traditionally, retiming is performed during logic synthesis but suffers from the lack of routing delay estimation. Most of existing algorithms on performance driven partitioning [10, 13, 12, 16, 5] consider only combinational circuits and thus ignore retiming. Recent advance in combining retiming with partitioning [14, 3, 7] enables the partitioner to explore wider solution space that considers equivalent FF movement. However, these algorithms impose a restriction on retiming of global interconnects – we cannot place FFs along wires but only at the beginning of the wires.

In this paper, we propose a unified approach to partitioning, floorplanning, and retiming for effective and efficient performance optimization. The integration enables the partitioner to exploit more realistic geometric delay model provided by the underlying floorplan, which in turn promotes more effective performance driven refinement. Simultaneous consideration of partitioning and retiming under the geometric delay model enables us to hide global interconnect latency more effectively – finer-grained FF placement is possible since each cell is associated with certain geometric location. Our GEO algorithm is based on multi-level framework, where the given solution is iteratively improved by cluster moves at each level of hierarchy [4]. We adopt the concept of sequential arrival time (SAT) [14] and develop sequential required time (SRT) in our retiming based timing analysis engine. The maximum SAT (MSAT) for a given partitioning solution translates into the delay result obtained after retiming, and our objective is to minimize both MSAT and cutsizes. MSAT is used to derive SRT and its corresponding timing slack for each cell, and we use the slack information to guide cell move for effective delay and cutsize optimization. We demonstrate that our performance optimization becomes more effective based on the availability of geometry information from the underlying floorplan. In our comparison to (i) state-of-the-art partitioner Metis [9] followed by retiming [11] and simulated annealing based slicing floorplanning [15], and (ii) state-of-the-art simultaneous partitioning with retiming algorithm HPM [7] followed by floorplanning [15], GEO obtains 35% and 23% better delay results while maintaining comparable cutsize, wirelength, and runtime results.

*This research was partially supported by the MARCO/DARPA Gigascale Silicon Research Center (GSRC) and a grant from Intel Corporation. The authors would like to thank Dr. Chang Wu and Dr. Peichen Fan at Aplus Design Technologies, Inc. for their helpful comments for this manuscript.
The remainder of the paper is organized as follows. Section 2 provides the problem formulation. Section 3 discusses our retiming based timing analysis engine. Section 4 presents our GEP algorithm. Section 5 provides experimental results. Section 6 concludes the paper with our ongoing research.

2 Problem Formulation

Given a sequential gate-level netlist \( NL(C, N) \), let \( C \) denote cells consisting of gates and flip-flops, and \( N \) denote nets that connect cells. The purpose of Geometric Embedding based Performance Driven K-way Partitioning (GEPDP) is to assign cells in \( NL \) to \( K \) blocks while determining the location of the blocks under the prescribed area constraints. Given a GEPDP solution \( B \), our primary objective is to minimize delay \( \phi(B) \) (to be defined below). As secondary objectives, we minimize cutsize \( \theta(B) \) and wirelength \( l(B) \) induced by \( B \). The formal definition of the problem is as follows.

Definition 1 – The GEPDP Problem

The geometric embedding based performance driven K-way partitioning problem under the given area constraints \( A = (L_i, \ell_i) \) for \( 1 \leq i \leq K \) has a solution \( B \), where \( B = [B_1(x_1, y_1), B_2(x_2, y_2), \ldots, B_K(x_K, y_K)] \) denotes the set of blocks and their locations. \( B \) is optimal if it satisfies the following conditions: (1) \( B \subseteq C \) and \( L_i \leq |B_i| \leq U_i \) for \( 1 \leq i \leq K \), (2) \( B_1 \cup B_2 \cup \ldots \cup B_K = C \), (3) \( B_i \cap B_j = \emptyset \) for all \( i \neq j \), (4) \( \phi(B) \) is minimized.

2.1 Delay Objective

For delay calculation, we model \( NL \) with a directed, edge-weighted graph \( G = (V, E) \), where the vertex set \( V = \{v_1, v_2, \ldots, v_n\} \) represents cells, and the directed edge set \( E = \{e_1, e_2, \ldots, e_m\} \) represents signal directions in \( NL \). A directed edge \( e(u, v) \) denotes the connection from vertex \( u \) to vertex \( v \). In our geometric delay model, a directed edge \( e \) has a delay of \( d(e) \), and each edge \( (u, v) \) has a delay of \( d(e) \) that is linearly proportional to the Manhattan distance between \( u \) and \( v \), i.e., \( d(e) \propto |x_u - x_v| + |y_u - y_v| \), where \( u \in B_i \) and \( v \in B_j \). The delay of a path \( p(u, v) \) from \( u \) to \( v \), denoted \( d(p) \), is defined to be the sum of \( d(e) \) along the path. The formal definition of \( \phi(B) \) is as follows.

Definition 2 – Delay Without Retiming

The delay \( \phi(B) \) induced by a geometric embedding based partitioning solution \( B \) is the largest delay among the following 4 types of paths: \( \phi(B) = \max\{d(p(u, v))|u \in PI \text{ or } FF \text{ and } v \in PO \text{ or } FF\} \).

We use retiming graphs [11] for our retiming based timing analysis. A retiming graph \( R = (V, E, W) \) consists of a vertex set \( V = \{v_1, v_2, \ldots, v_n\} \) that represents gates, a directed edge set \( E = \{e_1, e_2, \ldots, e_m\} \) that represents signal directions in \( NL \), and edge weight set \( W \) that represents the number of flip-flops (FFs) between the two end-vertices of each edge. A retiming is a labeling of the vertices \( R : V \rightarrow Z \), where \( Z \) is the set of integers. The weight of an edge \( e = (u, v) \) after retiming is denoted by \( w'(e) \) and is given by \( w'(e(u, v)) + r(v) - r(u) \). The retiming label \( r(v) \) for a vertex \( v \in V \) represents the number of FFs moved from its output towards its inputs. A circuit is retimed to a delay \( \phi \) by a retiming \( r \) if the following conditions are satisfied: (i) \( w'(e) \geq 0 \), (ii) \( w'(p) \geq 1 \) for each path \( p \) such that \( d(p) > \phi \), where \( w'(p) = \sum_{e \in p} w'(e) \). Since \( \phi \) after retiming is usually smaller, retiming is used to further reduce the delay result of partitioning.

For a given GEPDP solution \( B \) and a target delay \( \phi \), the edge length of \( e = (u, v) \), denoted \( l(e) \), is defined to be \( -\phi - w'(e) + d(v) - d(u) \). The path length of \( p \), denoted \( l(p) \), is \( \sum_{e \in p} l(e) \). The sequential arrival time (SAT) of \( v \), denoted \( l(v) \), is the maximum path length from PIs to \( v \). If the SAT for all PIs are less than or equal to \( \phi \), the target delay \( \phi \) is called feasible. Let \( D_0 = \max\{d(v)|v \in V\} \). The delay objective considering retiming is as follows.

Definition 3 – Delay After Retiming

The delay \( \phi(B) \) induced by a geometric embedding based partitioning solution \( B \) after retiming for a given feasible target delay \( \phi \) is defined to be the minimum \( \phi + D_y \).

2.2 Cutsize and Wirelength Objective

For the cutsize and wirelength calculation, we model \( NL \) with a hypergraph \( H = (V, E_H) \), where the vertex set \( V = \{v_1, v_2, \ldots, v_n\} \) represents cells, and the hyperedge set \( E_H = \{h_1, h_2, \ldots, h_m\} \) represents nets in \( NL \). Each hyperedge \( h \) is a nonempty subset of \( V \). The cutsize induced by \( B \), denoted \( \theta(B) \), is the number of hyperedges connecting vertices in different blocks. The \( x \)-span of a hyperedge \( h \), denoted \( h_x \), is defined as \( h_x = \max_{e \in h} \{x_e \in E_H\} - \min_{e \in h} \{x_e \in E_H\} \). The \( y \)-span of \( h \), denoted \( h_y \), is similarly defined using \( y \)-coordinates instead. The sum of \( x \)-span and \( y \)-span of each hyperedge \( h \) is the Half-Perimeter of the Bounding Box (HPBB) of cells in \( h \). The wirelength induced by \( B \), denoted \( l(B) \), is the sum of HPBB of all hyperedges.

3 Retiming based Timing Analysis

This section presents our retiming based timing analysis engine. We adopt the concept of sequential arrival time (SAT), which was first introduced in [14] and later on used in [3, 7] for partitioning with retiming. We extend SAT to introduce sequential required time and sequential slack. The slack information is used to derive \( e \)-network that identifies timing critical cells in the circuit. In GEP, the derivation of \( e \)-network plays a key role in determining the cell move gain.

3.1 Basic Concepts

The concepts of arrival time, required time, and slack are essential in timing analysis of circuits. The arrival time of a vertex \( v \in V \) is defined to be the maximum path delay from PIs to \( v \). For combinational circuits, we can compute arrival time of all vertices by visiting them in a topological order. We can assign the required time of a vertex \( v \in V \) to specify timing constraint for \( v \), i.e., the delay from PIs to \( v \) is required to be a certain value. Then, we can compute required time of all upstream vertices, i.e., all vertices reachable from PIs before arriving at \( v \), by examining fan-out vertices. For combinational circuits, we can compute required time of all vertices by visiting them in a reverse

[1]For the conventional non-geometric embedding based partitioning with \( d(e) = D \) for all inter-block edges, the authors of [14] showed that \( B \) can be retimed to a delay less than \( \phi + D \) if \( \phi \) is feasible. A straightforward extension of this delay bound \( \phi + D \) for our GEPDP problem is \( \phi + D_y \), where \( D_y = \max\{d(v)|v \in E_H\} \). However, our FF placement phase in Section 4.3 reduces this bound to \( \phi + D_y \), where \( D_y < D_x \) according to Table 1.
topological order. The slack of a vertex \( v \in V \) is defined to be the difference between the required time and arrival time of \( v \). The timing slack is used to determine the timing criticality of vertices in \( V \) – the smaller the slack is, the more timing critical \( v \) is, and vice versa. The \( \epsilon \)-network is defined to be a subgraph of \( R \) consisting of vertices whose slack is smaller than or equal to \( \epsilon \). Thus, \( \epsilon \)-network consists of timing critical vertices that deserve attention for delay optimization.

In order to handle sequential circuits directly – as opposed to the conventional approach, where FFs are removed to make the circuit combinational – we define the sequential arrival time (SAT) of \( v \in V \) in terms of fan-in vertices as follows:

\[
I(v) = \max \{I(u) - \phi \cdot w(e) + d(e) + d(v)|e(u, v) \in E\}
\]

In a similar way, we define the sequential required time (SRT) of \( v \in V \) in terms of fan-out vertices as follows:

\[
q(v) = \min \{q(u) + \phi \cdot w(e) - d(e) - d(v)|e(v, u) \in E\}
\]

The slack of \( v \), denoted \( s(v) \), is given by \( q(v) - I(v) \). Our Bellman-Ford variant shortest path algorithm presented in the next section uses these recursive definitions to derive the \( \epsilon \)-network for given sequential circuits.

SAT and retiming are closely related. In fact, the computation of SAT and retiming can be performed at the same time. Consider a path \( p \) that starts from a PI and ends at vertex \( v \). If we want to retiming \( p \) to satisfy the timing constraint \( \phi \), there must be at least \( |I(p) - q(p)| \) FFs on \( p \). Since there exists \( w(p) \) FFs on \( p \), we can set the retiming value \( r(v) \) as \( |I(p) - q(p)| + 1 - w(p) \). Thus, \( r(v) = \max(|I(p) - q(p)| - 1 - w(p)) \). After rewriting, we get \( r(v) = \max|I(p) - q(p)| - 1 \).

3.2 Timing Analysis Algorithm

Our retiming based timing analysis algorithm RTA uses a feasible target delay \( \phi \) to compute SAT, SRT, and retiming all at the same time. RTA algorithm determines if the target delay \( \phi \) is feasible. If so, RTA returns SAT, SRT, and retiming values of all vertices in \( V \). Since \( R \) can possibly contain loops with positive weights depending upon target delay \( \phi \), RTA essentially performs single source longest path algorithm as in Bellman-Ford algorithm [1]. In RTA, SAT for all PIs are set to zero while all others are set to \(-\infty\). SRT for all PIs are set to \( \phi \) while all others are set to \( \phi \). Then, we can iteratively update SAT and SRT until they converge to their maximum and minimum values, respectively. From these SAT and SRT values, we can compute timing slack and its corresponding \( \epsilon \)-network. We can also obtain retiming values for all vertices as a byproduct of this step as discussed in Section 3.1.

The complexity of Bellman-Ford type longest path algorithm is \( O(n^2) \) in the worst case since it requires \( O(n) \) number of relaxation until all values converge to their maximum. If the circuit contains no positive loops, RTA needs only one iteration of relaxation if the vertices are relaxed in topological order starting with PIs. For a circuit with positive loops, however, a topological ordering is not defined. We observe from the related experiments [6] that the number of iterations is usually small compared to \( |V| \). Thus, the complexity of RTA in practice is \( O(n) \). Note that the retiming graph \( R \) has multiple sources – all PIs. In order to perform single source longest path algorithm on \( R \), we add two special vertices \( v_s \) and \( v_t \) to \( V \) and call them the source and sink vertex. We add directed edges from \( v_s \) to all PIs and from all POs to \( v_t \). The delay of \( v_s \), \( v_t \), and edges incident to them are set to zero.

Figure 1: Description of RTA algorithm that computes the sequential arrival time \( I(v) \), sequential required time \( q(v) \), retiming \( r(v) \), and predecessor \( \pi(v) \) for all \( v \) in \( V \). RTA also determines the feasibility of the target delay \( \phi \).

PIs, and from all POs to \( v_t \). The delay of \( v_s \), \( v_t \), and edges incident to them are set to zero.

Figure 1 shows the description of our RTA algorithm. The initialization for sequential arrival time \( I(v) \), sequential required time \( q(v) \), retiming \( r(v) \), and predecessor \( \pi(v) \) for each vertex is done (line 1-6). The Boolean flag done is used to check if update of any vertex is done (line 8) – if not, we conclude that the target delay \( \phi \) is feasible (line 23). We visit each vertex once in each iteration (line 9), and temporary \( I(v) \) and \( q(v) \) are computed from examining \( v \)'s fan-in (line 10) and fan-out (line 11) vertices. We also remember the fan-in vertex that causes the update of \( I(v) \) in order to keep track of the longest path to \( v \) (line 12). If we find that the sequential arrival time of any PO is larger than the target delay \( \phi \), we conclude that \( \phi \) is not feasible (line 13-14). Otherwise, if newly computed \( I(v) \) is larger than the current value, we update \( I(v) \), \( r(v) \), and \( \pi(v) \) (line 15-18) and conclude that we need additional iteration (line 19). Also, we update \( q(v) \) and conclude that we need additional iteration if newly computed \( q(v) \) is smaller than the current value (line 20-22). If RTA does not converge after \( O(V) \) iterations, we conclude \( \phi \) is not feasible (line 35).

4. GEO Partitioning Algorithm

We present our algorithm GEO in this section. We provide an overview of GEO algorithm followed by the discussion of two main phases of GEO, construction and FF-placement.

4.1 Overview of GEO Algorithm

GEO is a multi-level geometric embedding based partitioner, which essentially performs a multi-level block placement in
a top-down manner. GEO generates blocks and determines their locations at each level of cluster hierarchy, where the number of blocks increases we go down the cluster hierarchy. GEO consists of two phases, namely, construction and FF placement phase. During the construction phase, multi-level partitioning with floorplanning is performed. First, a multi-level cluster hierarchy is generated, and the partitioning of top-level clusters is obtained. We use cell move based iterative improvement partitioning for the refinement of this initial solution. The gain value is determined by our retiming based timing analysis engine. The top-level partitioning information is then projected onto the next cluster hierarchy, and cell move is performed on the next level for further refinement. This top-down partitioning based on the multi-level cluster hierarchy continues until we obtain partitioning solution of the bottom level (= original circuit). The horizontal and vertical cuts are alternating in breadth-first manner in GEO so that the first horizontal cut divides the chip area into top and bottom block, and the second vertical cuts further divide the chip area into top-left, top-right, bottom-left, and bottom-right block, etc. During the FF placement phase, both coarse-grained and finer-grained repositioning of FFS are performed. The coarse-grained FF repositioning determines which edge gets FFS, and this is done via standard retiming. The finer-grained FF positioning determines at what point of the edge FF is to be located, and this is done with FF placement.

4.2 Construction Phase

During the construction phase of GEO, multi-level cluster hierarchy is generated and the corresponding geometric embedding based partitioning is computed in a top-down manner. At each level of the multi-level cluster hierarchy, cell move based iterative improvement partitioning is performed to improve the current partitioning solution. Figure 2 shows the description of the CONSTRUCT algorithm. The input to CONSTRUCT is a sequential netlist NL, the number of blocks desired K, and area constraint A. Then, CONSTRUCT divides cells in NL into K blocks, determines the location of blocks, and computes the corresponding minimum target delay. The height of cluster hierarchy is set to log2(K) (line 1) so that we perform 2-way partitioning on the top level, and 4-way partitioning on the next level, etc., until we perform K-way partitioning on the bottom level. In this way, we need to perform multi-level clustering only once for single K-way partitioning solution. We use ESC multi-level clustering algorithm [4] in GEO (line 2).

Let B(i) denote K/2-way geometric embedding based partitioning of clusters on level i. Let R^c denote retiming graph of NL at level i, whereas R is the retiming graph of the original circuit. We generate a random initial 2-way partitioning and its corresponding minimum φ (line 3). We need to perform binary search to find the minimum φ, which requires O(log n) number of calls of RTA. This initial φ is used for all the subsequent call to RTA during CONSTRUCT. On each level i starting from the top (= h = 1) to bottom level (= 0) (line 4), CONSTRUCT performs cell move based iterative improvement partitioning. Depending on the current partitioning, each RTA call gives different c-network. We perform RTA on the original circuit (= R) to obtain its c-network R^c (line 7). Then, we derive R^c, the c-network at the current level from R^c (line 8). We increase the weight of edges in R^c (line 9), and the cell move gain is determined in such a way to minimize the total weighted edge lengths (line 10). By representing the cell move gain this way, we can use bucket structure to maintain linear time complexity during each pass. Since full timing analysis on R is costly (= O(n)), GEO performs RTA once per each pass of cell move. We obtain the next level partitioning from the current level (line 12). We randomly divide cells in each block into two and let CONSTRUCT improve this solution during the cell move of the clusters on the next level. At the bottom level, CONSTRUCT performs binary search once again to obtain the minimum feasible delay for the partitioning solution of the bottom level using RTA (line 13). Note that CONSTRUCT can generate multiple solutions and select the one that corresponds to the minimum φ to feed to the subsequent FF placement phase. Due to O(log n) number of calls to O(n) time RTA, the complexity of CONSTRUCT is O(n log n).

4.3 FF Placement Phase

We note that retiming performs coarse-grained FF placement – it determines which edge gets which FF, but not the exact location on the edge. The optimal position of i-th FF computed by retiming on a path p is every point where the propagation delay equals to i · φ for 1 ≤ i ≤ w(p). Thus, it is possible that the location is occupied by a cell. Under the non-geometric partitioning, we do not have a choice but to place f either at Bu or Bv, for f ∈ FF repositioned to e(u, v) via retiming. However, finer-grained FF placement is possible under geometric embedding based partitioning since each cell is associated with certain geometric location. Note that the potential improvement can be as big as the delay of the global interconnect itself depending on the retiming result.

Our strategy to find out the exact location is as follows: we recompute the sequential arrival time for each vertex after retiming. Then, l(v) for each vertex v represents the maximum delay from PUs to v. Thus, as depicted in Figure 3, the optimal location of f can be determined based on l(u) and the target delay ϕ. We move f from u towards v by a distance of ϕ − l(u). In other words, we find a position x where l(x) = ϕ on e = (u, v). Assuming that u, u_v, and v_v, respectively denote x-coordinate of f, u, and v (similar for y, y_v, and v_y), we obtain the following equations:

|x − u_v| : |v_v − x| = ϕ − l(u) : d(e) − ϕ + l(u)
wirelength is the sum of half-perimeter of bounding box dimension of 1cm for the current 18mm technology is used for ind/4 [8]. Cutsizes is based on cost-1 metric, and wirelength is the sum of half-perimeter of bounding box of nets. Runtime is in seconds and collected from ULTRA60 at 360MHz. The bipartitioning area balance skew is set to [.49, .51] for hMetis to enforce tight area balance among blocks. We assume that all gates have unit area and unit delay, while primary inputs, primary outputs and flip-flops in each circuit, and next 2 columns respectively denote the shortest and longest edge delay d(e), assuming gate delay is 1.

Table 1: Benchmark circuit characteristics. First 5 columns respectively denote the name, total number of gates, primary inputs, primary outputs, and flip-flops in each circuit, and next 2 columns respectively denote the shortest and longest edge delay d(e), assuming gate delay is 1.

<table>
<thead>
<tr>
<th>name</th>
<th>#GA</th>
<th>#FI</th>
<th>#PO</th>
<th>#FF</th>
<th>min,</th>
<th>max,</th>
</tr>
</thead>
<tbody>
<tr>
<td>s9234</td>
<td>1290</td>
<td>28</td>
<td>39</td>
<td>135</td>
<td>0.3</td>
<td>4.6</td>
</tr>
<tr>
<td>s5378</td>
<td>1443</td>
<td>35</td>
<td>49</td>
<td>163</td>
<td>0.3</td>
<td>4.8</td>
</tr>
<tr>
<td>s3207</td>
<td>3146</td>
<td>59</td>
<td>152</td>
<td>486</td>
<td>0.5</td>
<td>7.1</td>
</tr>
<tr>
<td>s1550</td>
<td>3784</td>
<td>76</td>
<td>150</td>
<td>515</td>
<td>0.6</td>
<td>7.8</td>
</tr>
<tr>
<td>bigkey</td>
<td>8599</td>
<td>228</td>
<td>197</td>
<td>224</td>
<td>0.8</td>
<td>11.8</td>
</tr>
<tr>
<td>s36584</td>
<td>13209</td>
<td>38</td>
<td>304</td>
<td>1423</td>
<td>1.0</td>
<td>14.4</td>
</tr>
<tr>
<td>clpa</td>
<td>30552</td>
<td>61</td>
<td>82</td>
<td>33</td>
<td>1.6</td>
<td>22.0</td>
</tr>
<tr>
<td>ind1</td>
<td>29780</td>
<td>2630</td>
<td>3242</td>
<td>603</td>
<td>1.6</td>
<td>22.7</td>
</tr>
<tr>
<td>ind2</td>
<td>26060</td>
<td>2772</td>
<td>6242</td>
<td>1755</td>
<td>1.6</td>
<td>21.9</td>
</tr>
<tr>
<td>ind3</td>
<td>52197</td>
<td>2801</td>
<td>3070</td>
<td>2001</td>
<td>2.1</td>
<td>29.3</td>
</tr>
<tr>
<td>ind4</td>
<td>101531</td>
<td>4155</td>
<td>4547</td>
<td>8333</td>
<td>2.9</td>
<td>40.7</td>
</tr>
</tbody>
</table>

Figure 3: Placement of f ∈ FF along edge e = (u, v). (a) f is placed at the position where the sequential arrival time equals to φ, (b) optimal location of f, where φ = 10. f is placed at the point where the distance between u and f is 2 = φ - l(u) = 10 - 8.

<table>
<thead>
<tr>
<th>y - u_y</th>
<th>v_y - y</th>
<th>= φ - l(u) : d(e) - φ + l(u)</th>
</tr>
</thead>
</table>

After rewriting these equations, we get;

\[
x = \frac{v_x \cdot [φ - l(u)] + u_x \cdot [d(e) - φ + l(u)]}{d(e)}
\]

\[
y = \frac{v_y \cdot [φ - l(u)] + u_y \cdot [d(e) - φ + l(u)]}{d(e)}
\]

Moreover, placement of FFs along the longest paths from PIs to POs will suffice in order to retine the circuit. We establish the following theorem as discussed in Section 2.1.

Theorem 1 For a given geometric embedding based partitioning solution B with a target delay φ, FF placement can retine B to a delay less than φ + D_0, if the sequential arrival time for all POs are less than or equal to φ.

Figure 4 illustrate impact of FF placement on delay. We observe that the optimal positioning of FFs improve the final delay result obtained after retiming. We emphasize that the finer-grained FF placement is possible only through the geometric embedding based partitioning. Finally, the complexity of GEO is that of CONSTRUCT and FF placement, which is O(n log n + k) = O(n log n).

Our experimental results in Section 5 confirm that the runtime of GEO indeed is comparable to that of other linear time algorithms such as hMetis [9].

5 Experimental Results

We implemented our algorithms in C++/STL, compiled with gcc v2.4, and tested on SUN ULTRA SPARC60 at 360MHz. We obtained the latest binary executable of hMetis [9] (v1.5.3) for the evaluation. The benchmark set consists of 7 ISCAS circuits and 4 large scale industrial designs provided by our industrial sponsor. Detailed statistics of the circuits are shown in Table 1. We report delay, cutsizes, wirelength, and runtime from 64-way partitioning embedded onto 8 x 8 grid. The grid length is computed by α√(V)/64, where α = 1/14. We obtain the constant α based on the assumption that the delay ratio among the gate, the shortest, and the longest edge is 1:3:40 in our biggest benchmark circuit ind/4 as shown in Table 1. The underlying chip dimension of 1cm x 1cm for the current 18mm technology is used for ind/4 [8]. Cutsizes is based on cost-1 metric, and wirelength is the sum of half-perimeter of bounding box of nets. Runtime is in seconds and collected from ULTRA60 at 360MHz. The bipartitioning area balance skew is set to [.49, .51] for hMetis to enforce tight area balance among
able - fastest among the algorithms used in this experiment.

6 Conclusions and Ongoing Works

In this paper, we proposed a unified approach to partitioning, floorplanning, and retiming for effective and efficient performance optimization. The integration enables partitioning to exploit geometric delay model provided by the underlying floorplan. Simultaneous consideration of partitioning and retiming based on the geometric delay model enables us to hide global interconnect latency more effectively. We are currently trying to improve the cutsize and wirelength is the sum of half-perimeter of bounding box of nets. Runtime is in seconds.

References