ε-Optimal Minimum-Delay/Area Zero-Skew Clock Tree Wire-Sizing in Pseudo-Polynomial Time∗

Jeng-Liang Tsai, Tsung-Hao Chen
Electrical and Computer Engineering
University of Wisconsin-Madison
1415 Engineering Drive
Madison, WI 53706
{jlt, tchen}@cae.wisc.edu

Charlie Chung-Ping Chen ♦
Graduate Institute of Electronics Engineering &
Department of Electrical Engineering
National Taiwan University
Taipei 106, Taiwan
cchen@cc.ee.ntu.edu.tw

ABSTRACT

In 21st-Century VLSI design, clocking plays crucial roles for both performance and timing convergence. Due to their non-convex nature, optimal minimum-delay/area zero-skew wire-sizing problems have long been considered intractable. None of the existing approaches can guarantee optimality for general clock trees to the authors’ best knowledge. In this paper, we present an ε-optimal zero-skew wire-sizing algorithm, ClockTune, which guarantees zero-skew with delay and area within ε distance to the optimal solutions in pseudo-polynomial time. Extensive experimental results show that our algorithm executes very efficiently in both runtime and memory usage. For example, ClockTune takes less than two minutes and 35MB memory to size an industrial clock tree with 3101 sink nodes within 2% to the optimal solution on a 533MHz Pentium III PC. Our algorithm can also be used to achieve useful clock skew to facilitate timing convergence and to incrementally adjust clock tree for design convergence and explore delay/power tradeoffs during design cycles. ClockTune is available on the web [13].

Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids

General Terms
Algorithms

Keywords
Zero-skew, clock tree, wire-sizing, incremental refinement, ε-optimal, pseudo-polynomial

∗This work was partially supported by the National Science Foundation under grants CCR-0093309 and CCR-0204468.

†This work was done when the author was at department of Electrical and Computer Engineering, University of Wisconsin-Madison.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ISPD’03, April 6–9, 2003, Monterey, California, USA.
Copyright 2003 ACM 1-58113-650-103/0004 ...$5.00.

1. INTRODUCTION

As the feature size keeps shrinking and clock frequency increasing, clock design has come to play a crucial role in determining the chip performance and facilitating timing and design convergence. First, clock skew directly affects chip performance in close to a one-to-one ratio since it has to be counted as cycle time penalty. Second, incremental clock tree adjustment enables fast design convergence by avoiding the potentially divergent design iterations [1]. Since designs are subjected to change on a daily basis, the clock trees need to be incrementally adjusted accordingly with minimum changes to ensure acceptable clock skew. Third, since interconnect delay dominates over gate delay, timing plans often cannot be met due to some physical effects. Recently, useful-skew [2] concepts have also been widely proposed to speed up time convergence in order to compensate for the timing uncertainties resulting from physical layout. From the above analysis, it is crucial to develop clock tuning algorithms that can balance clock skew with minimum adjustments. One of the most effective ways to adjust clock skew is through wire-sizing since it involves minimum routing modifications.

In [3], discrete wire-sizing for general routing-trees is assumed and a bottom-up dynamic programming approach is used to propagate optimal wire-sizing combinations toward the root node. The designer can then determine the wire-sizing by making a tradeoff between delay and power consumption. In [4] [5], the wire-sizing problem is formulated as optimization problems, in which the maximum delay of each sink node is constrained, and the delay can be minimized efficiently. These methods perform wire-sizing without modifying the clock routing, but do not guarantee zero-skew.

Recent works [6] [7] integrate wire-sizing into the Deferred-Merging Embedding (DME) algorithm [8], which allow a zero-skew clock tree to benefit from wire-sizing. However, the zero-skew property is maintained by moving the merging points, and the clock routing might be changed to accommodate the skew caused by design changes, which might affect the detail routing and there is no guarantee of optimality for these algorithms. In conclusion, there is no optimal wire-sizing algorithm which can guarantee the use of minimum power or area to simultaneously achieve minimum delay and zero-skew to the best of authors’ knowledge.

In this paper, we propose a novel wire-sizing algorithm,
ClockTune, which guarantees zero-skew with a given delay and power consumption solution within ε distance to the optimal solution in psuedo-polynomial time. ClockTune first calculates the feasible delay and capacitance load information for each node in a bottom-up fashion. After the desired delay and capacitance load is chosen from the feasible region, a wire-sizing is determined in a top-down fashion.

To achieve the ε-optimality we use sampling techniques to capture the whole design space and prune out infeasible solutions. Our algorithm takes pseudo-polynomial time, and experimental results show that our algorithm can efficiently re-size large clock trees. Although we focus on achieving zero-skew, ClockTune can also be used to achieve useful-skew to tackle timing problems.

The rest of this paper is organized as follows. In Section 2, we formulate the problem and introduce the framework of our algorithm. In Section 3, we derive the formulae and give more details of the algorithm. Section 4 contains the complexity and optimality analyses. Experimental results are shown in Section 6, and Section 7 is our conclusion.

2. PRELIMINARY

Assuming the Elmore delay under π-model [9], the wire-sizing problem is defined as below.

**Problem Definition 1. Minimum-Delay / Area Zero-Skew Wire-Sizing (min-ZSWS) Problem**

Given a clock tree T, find a set of feasible wire widths that achieve the target delay/area such that the zero-skew constraint is satisfied and the area/delay is also minimized.

Table 1 lists the notations used throughout this paper. In Table 1, T_v is a binary tree. However, if the wire length is allowed to be zero, then any tree structure can be represented as a binary tree.

| T_v | A clock tree with given routing rooted at node v
| v_l | The left child of node v
| v_r | The right child of node v
| e_v | The edge between node v and its parent node
| l_e | The length of edge e_v
| w(e_v) | The wire width of edge e_v
| r(e_v) | The resistance of wire e_v, r(e_v) = l_e / w(e_v)
| c(e_v) | The capacitance of wire e_v, c(e_v) = l_e w(e_v) c_0
| w_m(e_v) | Minimum feasible wire width for edge e_v
| w_M(e_v) | Maximum feasible wire width for edge e_v
| d_v | Elmore delay from node v to any leaf node in T_v
| c_v | Total down-stream capacitance seen at node v
| r_0 | Wire resistance per μm²
| c_0 | Wire capacitance per μm²
| w_m | Minimum wire width defined by user
| w_M | Maximum wire width defined by user

Table 1: Notations for the min-ZSWS problem

2.1 The concept of DC Region

For most general routing-tree wire-sizing problems, only the maximum delay is constrained. Due to the convex nature of these problems, bottom-up approaches that keep the lower-left portion of the delay-area tradeoff curve can solve these problems optimally. On the contrary, the min-ZSWS problem is non-convex and the whole design space has to be kept to solve the problem optimally.

A set of wire widths which makes the clock tree T_v zero-skew is an embedding. Each embedding determines a pair of delay value and total capacitance load value of T_v. The former relates to the clock skew due to process variation, and the latter is proportional to the dynamic power consumption, P_d = fC V^2.

The entire design space can be captured by mapping all embeddings on the D-C plane, the X-Y plane with delay value on Y-axis and capacitance load value on X-axis.

**Definition 1. DC Region**

The DC region of node v, Ω_v = {(d_v, c_v)}, is a set of points on the D-C plane which represents all feasible embeddings of T_v.

The DC region represents the solution space mapped on the D-C plane. Figure 1 gives examples for different types of DC regions. For level-2 and above nodes, the optimal solutions lie on the lower-left edge of the DC region. However, in dealing with the min-ZSWS problem, pruning out suboptimal solutions during bottom-up propagation might lead to failure in finding a solution at the root node. Moreover, incremental refinement is not possible if only optimal solution sets are preserved because the optimal solution sets are likely to become infeasible after design changes. Thus new algorithms that guarantee optimality and are able to achieve incremental refinement must be able to capture the whole DC region.

2.2 Algorithm overview

We propose a dynamic programming algorithm, ClockTune, to solve the min-ZSWS problem. ClockTune is composed of two phases. In the first phase, a bottom-up approach is used to obtain the DC regions of all nodes. In the second phase, a top-down approach determines the widths of all wire segments. The dynamic programming algorithm ClockTune() is given as a pseudocode in Algorithm 1, and Figure 2 illustrates its procedures.
3. $\varepsilon$-OPTIMAL ZERO-SKEW WIRE-SIZING ALGORITHM

In this section, the details of ClockTune are described. In 3.1, the procedures to obtain the DC regions of different types of nodes are examined. It is then followed by the top-down wire-sizing procedure in 3.2.

3.1 Bottom-up Phase

In this subsection, we first introduce the definition of branch DC region and the associated $\lambda$ operator to facilitate our discussion. The calculation of DC regions for each type of node is then followed.

Definition 2. Branch DC Region

The branch DC region of node $v$, $\Omega_v^+ = \{(d_v^+, c_v^+)\}$, is a set of points on the D-C plane which represents all feasible embeddings of $T_v^+ = \{v, T_v\}$ such that $d_v^+$ and $c_v^+$ are the Elmore delay and capacitance load seen at the root node of $T_v^+$. $\Omega_v^+$ can be obtained from $\Omega_v$ by the transformation $\varphi_v : \mathbb{R}^3 \rightarrow \mathbb{R}^2$ described as follows.

$$d_v^+ = d_v + \frac{k_v r_v w(e_v) c_0}{w(e_v)} \frac{1}{2} + c_v$$
$$c_v^+ = c_v + k_v w(e_v) c_0,$$
where $(d_v, c_v) \subset \Omega_v, w_m \leq w(e_v) \leq w_M$

Definition 3. $\lambda$ Operator

The DC region of $v$ can be generated from the branch DC regions of $v_i$ and $v_r$ by $\lambda$ operator. The operation is defined as follows:

$$\Omega_v = \Omega_{v_i}^+ \land \Omega_{v_r}^+$$

where $(d_v, c_v) \subset \Omega_v$ \iff $\exists(d_{v_i}^+, c_{v_i}^+) \subset \Omega_{v_i}^+, (d_{v_r}^+, c_{v_r}^+) \subset \Omega_{v_r}^+$

$$s.t. \ d_v = d_{v_i}^+ = d_{v_r}^+, c_v = c_{v_i}^+ + c_{v_r}^+.$$

The $\lambda$ operation combines all feasible embeddings with same delays on both subtrees. Therefore, the DC region of the root node can be obtained by recursion. To use more accurate delay models such as AWE [10] method, one can use a fudge factor approach to approximate the exact delay and capacitance load [11]. ClockTune_DC() is given in Algorithm 2 and the details on how to obtain the DC regions for each type of nodes are given in the following subsections.

3.1.1 Leaf Nodes with Zero-Skew and Useful-Skew Constraints

For a leaf node $v$, $c_v$ is the load capacitance and hence a constant. If zero-skew is desired, we can set $d_v$ to 0 for all leaf nodes. $\Omega_v = \{(d_v, c_v)\}$ is a single point on the D-C plane. In designing critical components, time borrowing between consecutive logic blocks is achieved by useful-skew. Useful-skews are often considered in the early design stage, however, they might also be used to fix the timing problems for free. For example, if there is a path timing failure but there are still timing budgets left in the prior or next stage, useful-skews can be introduced to fix the timing problem. Although our focus is on the min-ZSWS problem, ClockTune can accept arbitrary skew values by simply assigning a different $d_v$ to each leaf node and find an embedding if it exists. This is because ClockTune operates on DC regions, and whether the leaf nodes require zero-skews or useful-skews does not make any difference. ClockTune can also be extended to accept bounded-skew constraints, where $\Omega_v$ becomes a vertical segment. If intentional skew is desired, [2] [12] can be used for initial routing.

3.1.2 Level-1 Nodes

A level-1 node $v$ has two leaf children, $v_i$ and $v_r$. We know that

$$d_{v_i}^+ = d_v + \frac{k_v r_v w(e_v) c_0}{w(e_v)} \frac{1}{2} + d_{v_i}$$
$$d_{v_r}^+ = d_v + \frac{k_v r_v w(e_v) c_0}{w(e_v)} \frac{1}{2} + d_{v_r}$$

Observing zero-skew and wire width constraints, we have the following equations.

$$d_v = d_{v_i} \land d_{v_r}^+$$
$$w_m \leq w(e_v) \leq w_M$$
$$w_m \leq w(e_v) \leq w_M$$

From (1)(2)(3), we can derive the relation between $w(e_v)$ and $w(e_v)$.

$$w(e_v) = \frac{k_v r_v w_{e_v}}{(d_v - d_{v_r}) + \frac{k_v r_v w_{e_v}}{w_{e_v}} (t_{e_v} - t_{e_v}) + \frac{k_v r_v w_{e_v}}{w_{e_v}} (t_{e_v} - t_{e_v})}$$

Thus, $\Omega_v = \{(d_v, c_v)\}$ can be written with a single variable
to calculate and store the DC region for level-2 and above bounds of nodes. The first step is to find out the upper and lower delay. Equations (7) (8) with boundary conditions (11) (12) and always positive for $v$

$$d_v(w(e_{v_i})) = d_{v_i} + \frac{l_{v_i} r_0 c_0}{2} + \frac{l_{v_i} r_0 c_{v_i}}{w(e_{v_i})}, \quad (7)$$
$$c_v(w(e_{v_i})) = c_{v_i} + c_{v_j} + l_{v_j} w(e_{v_j}) c_0 + \frac{l_{v_j}^2 r_0 c_0}{2} \frac{c_{v_j}}{w(e_{v_i})}, \quad (8)$$

From (4) (5) (6), the range of $w(e_{v_i})$ and $w(v_{e_i})$ are

$$\max \left( \frac{l_{v_i} r_0 c_{v_i}}{w(e_{v_i})}, w_m \right), \quad (9)$$
$$\min \left( \frac{l_{v_i} r_0 c_{v_i}}{w(e_{v_i})}, w_m \right), \quad (10)$$

From (7) (8), the first derivative is always negative for $d_v$ and always positive for $c_v$, thus

$$d_v(w(e_{v_i})) \leq d_v \leq d_v(w_m(e_{v_i})), \quad (11)$$
$$c_v(w_m(e_{v_i})) \leq c_v \leq c_v(w(e_{v_i})). \quad (12)$$

Equations (7) (8) with boundary conditions (11) (12) completely capture $\Omega_v$.

### 3.1.3 Level-2 Nodes

![Figure 3: DC regions of Level-2 nodes](image)

As shown in Figure 3, the DC regions of level-2 nodes become more complex and are difficult to solve analytically. To overcome this problem, we use line sampling technique to calculate and store the DC region for level-2 and above nodes. The first step is to find out the upper and lower delay bounds of $\Omega_v$.

**Calculate Delay Range**

A Level-2 node $v$ has at least one level-1 child. Assuming $v_1$ is a level-1 node and $w_a = w(e_{v_1})$, $\Omega_{v_1}^+$ can be derived from (7)(8) as

$$d_{v_1}^+(w(e_{v_1}), w_a) = d_{v_1}(w_a) + \frac{l_{v_1} r_0 c_0}{2} + \frac{l_{v_1} r_0 c_{v_1}(w_a)}{w(e_{v_1})}, \quad (13)$$
$$c_{v_1}^+(w(e_{v_1}), w_a) = c_{v_1}(w_a) + l_{v_1} w(e_{v_1}) c_0. \quad (14)$$

Equation (13) describes a delay surface which is strictly decreasing along $w(e_{v_1})$ direction ($\partial d_{v_1}^+ / \partial w(e_{v_1}) < 0$), and is a convex function along $w_a$ direction ($\partial^2 d_{v_1}^+ / \partial w^2_a > 0$). The maximum delay could be at two corners as specified in (15). The minimum delay could be at two corners or the minimum of the convex function, if the minimum is within the defined rectangle, as specified in (16).

$$d_{v_1,max}^+ = \max \left( d_{v_1}^+(w_m, w_m(e_{v_1})), d_{v_1}^+(w_m, w_M(e_{v_1})) \right), \quad (15)$$
$$d_{v_1,min}^+ = \min \left( d_{v_1}^+(w_M, w_m(e_{v_1})), d_{v_1}^+(w_M, w_M(e_{v_1})) \right), \quad (16)$$

Equation (16) would lead to a fourth-order equation that can be solved using iterative equation solvers such as GNU Scientific Library. The delay range of $\Omega_{v_1}^+$ can be calculated in the same way, or by (2)(5) if $v_r$ is a leaf node.

**Level-2 Nodes with Two Level-1 Children**

For a level-2 node with two level-1 children, it is difficult to merge the branch DC regions described by complex equations with the $\Lambda$ operator. We choose to use line sampling technique to overcome the problem. By taking delay samples from $\{d_{v_i}^+ \cap d_{v_j}^+ \}$ and calculating the range of $c_{v_i}^+$, for each sample, $\Omega_v$ can be obtained by merging sampled $\Omega_v$ and $\Omega_{v_r}^+$.

The maximum and minimum of $d_v$ in $\Omega_v$, $d_v,min$ and $d_v,max$, can be obtained from 3.1.3. Assuming that the set of $p$ delay samples between $d_v,min$ and $d_v,max$ is $D_v = \{d_v^i, i = 1,...,p\}$, and substituting $d_v^i$ back to (13), we have

$$w(e_{v_i})^i = \frac{l_{v_i} r_0 c_{v_i}(w_a)}{d_v^i - d_v^i(w_a) - \frac{l_{v_i} r_0 c_{v_i}}{2}}. \quad (17)$$

Combined with (4) (14), the range of $c_{v_i}^{i+}$ can be derived. Again, we apply sampling technique on $w_a$ because the range is difficult to solve analytically.

Assuming that the set of $q$ wire width samples between $w_{m}(e_{v_i})$ and $w_{M}(e_{v_i})$ is $W_v = \{w_{a,j}, j = 1,...,q\}$, for each delay $d_{v_i}$, we substitute $w_{a,j}$ back to (13) and get

$$w(e_{v_i})^j = \frac{l_{v_i} r_0 c_{v_i}(w_{a,j})}{d_v^i - d_v^i(w_{a,j}) - \frac{l_{v_i} r_0 c_{v_i}}{2}}. \quad (18)$$

Substitute $w_{a,j}$ and $w(e_{v_i})^j$ into (14), we have

$$c_{v_i}^{i,j} = c_{v_i}(w_{a,j}) + c_{v_i}(w(e_{v_i})^j). \quad (19)$$

The intervals $c_{v_i}^{i,j}$ sweep through give the range of $c_{v_i}^{i+}$. If we sample the delay evenly, the sampled DC region can be written as

$$\Omega_v = \{ (d_v^i, c_{v_i}^+), i = 1,...,p \},$$

where

$$d_v^i = d_v,min + \frac{i - 1}{p - 1}(d_v,max - d_v,min) \quad (20)$$

$$c_{v_i,min}^+ + c_{v_i,min}^+ \leq c_{v_i}^+ \leq c_{v_i,max}^+ + c_{v_i,max}^+. \quad (21)$$

There might be more than one range for $c_{v_i}^{i+}$ and $c_{v_i}^{i-}$, and all combinations should be considered. However, the ranges will quickly merge together as the DC regions propagate toward the root node.

**Level-2 Nodes with One Leaf Child**

For level-2 nodes with level-1 child on the left and leaf child
on the right, $\Omega^+_{v_i}$ and $\Omega^-_{v_i}$ can be obtained by the method in 3.1.3, (2), (5) respectively. The merging procedure remains the same, except that $c_{v_i}^{\pm}$ can be solved directly.

3.1.4 Level-3 and Above Nodes

The merging process described in 3.1.3 can also be applied to level-3 and above nodes. Assuming that the DC region of the left child is in sampled form. By combining (1) and (20), we get

$$d_{v_i}^{+k} = d_{v_i}^e + \frac{l_{v_i}^2 r_{0} c_0}{2} + \frac{l_{v_i} r_{0} c_0}{w(e_{v_i})}, \quad k = 1, \ldots, p'$$

(21)

$$c_{v_i}^{+k} = c_{v_i}^e + k + l_{v_i} w(e_{v_i}) c_{u}, \quad k = 1, \ldots, p'$$

(22)

where $w_m \leq w(e_{v_i}) \leq w_M$ and $p'$ is the number of delay samples in $\Omega_{v_i,s}$. Thus,

$$d_{v_i,max}^+ = \max \left\{ d_{v_i}^{+k} \mid w(e_{v_i}) = w_m, \quad k = 1, \ldots, p' \right\}$$

(23)

$$d_{v_i,min}^+ = \min \left\{ d_{v_i}^{+k} \mid w(e_{v_i}) = w_M, \quad k = 1, \ldots, p' \right\}$$

(24)

and $d_{v_i,max}^+$, $d_{v_i,min}^+$ can be obtained in the same way. Taking a set of sampled delay, $D_v = \{d_{v_i}', i = 1, \ldots, p\}$, from $(d_{v_i}^e \cap d_{v_i}^s)$ and substituting $d_{v_i}'$ into (21), we have

$$w(e_{v_i})_{\pm} = \frac{l_{v_i} r_{0} c_0}{d_{v_i}' - d_{v_i}^e - \frac{l_{v_i}^2 r_{0} c_0}{2}}, \quad k = 1, \ldots, p'$$

(25)

Substituting (25) into (22), we get

$$c_{v_i}^{\pm k} = c_{v_i}^e + \frac{l_{v_i}^2 r_{0} c_0}{d_{v_i}' - d_{v_i}^e - \frac{l_{v_i}^2 r_{0} c_0}{2}}$$

(26)

Thus $c_{v_i}^{\pm k}$ can be calculated by overlapping the $p'$ intervals from (26), and $\Omega^+_{v_i,s} = \{(d_{v_i}^{+k}, c_{v_i}^{+k}), i = 1, \ldots, p\}$ is obtained. Using the same way to obtain $\Omega^-_{v_i,s}$, $\Omega_{v_i,s} = \Omega^+_{v_i,s} \cup \Omega^-_{v_i,s}$ can be obtained.

3.2 Top-Down Phase

The top-down phase is straightforward. We first select a pair of target delay and capacitance load values $(d_t, c_t)$ from $\Omega_r$. Since a point in $\Omega_r$ represents at least one embedding, it is guaranteed to find an embedding that achieves the specified target delay and capacitance load. The capacitance load is further divided into $c_{u_l}$ and $c_{u_r}$ such that $c_{u_l} + c_{u_r} = c_t$, $(d_t, c_{u_l}) \subset \Omega^+_{v_i}$, and $(d_t, c_{u_r}) \subset \Omega^-_{v_i}$. If $v_l$ is a leaf node, then $w(e_{v_l})$ is determined by (1). If $v_l$ is a level-1 node, the feasible range of $w(e_{v_l})$ can be obtained by (13). If $v_l$ is a level-2 or above node, then the DC region of $v_l$ is in sampled form. For each sample in $\Omega_{v_i,s}$, the range of $w(e_{v_l})$ can be obtained by (1). Once $w(e_{v_l})$ is chosen and the target delay and capacitance load of $\Omega_{v_i}$ are determined, we can proceed to determine the wire widths in $T_{v_l}$. Same approach applies to $v_r$. $\text{ClockTune}_{\text{Embed}}()$ is given in Algorithm 3.

4. OPTIMALITY AND COMPLEXITY

We use sampling techniques to calculate DC regions, and the embeddings that do not map on the sampled delay will
Property 1. Error Composition

\[ \xi(\Omega_{v,i,s}) \leq \xi_0(\Omega_{v,i,s}) + \xi_s(\Omega_{v,i,s}) + \xi_w(\Omega_{v,i,s}), \]

where \( \xi_0(\Omega_{v,i,s}) \) is the propagated error from \( \xi(\Omega_{v,i,s}) \); \( \xi_s(\Omega_{v,i,s}) \) is the delay sampling error, and \( \xi_w(\Omega_{v,i,s}) \) is the additional wire width sampling error after delay sampling.

\( \Omega_v \) is generated from \( \Omega_v^+ \) and \( \Omega_v^- \) by the \( \lambda \) operator, which does not introduce additional errors. Thus we have the following property.

Property 2. Error Bound

The error of a sampled DC region \( \Omega_{v,i,s} \) is bounded by \( \xi(\Omega_{v,i,s}) \leq \xi_0(\Omega_{v,i,s}) + \xi_s(\Omega_{v,i,s}), \)

### 4.1.1 Error Propagation

The first source of error comes from lower level nodes. Rewrite \( d_{vi}^l \) and \( c_{vi}^l \) into matrix form:

\[
\begin{bmatrix}
  d_{vi}^l \\
  c_{vi}^l
\end{bmatrix} =
\begin{bmatrix}
  1 & \frac{r_{vi}a}{w_{l}(e_{vi})} \\
  0 & 1
\end{bmatrix}
\begin{bmatrix}
  d_{vi} \\
  c_{vi}
\end{bmatrix} +
\begin{bmatrix}
  \frac{\partial^2 r_{coa}}{l_{vi} w_{l}(e_{vi}) c_{0}} \\
  \frac{r_{vi}a}{l_{vi} w_{l}(e_{vi}) c_{0}}
\end{bmatrix}
\]

(27)

The spectral norm of the matrix is

\( \lambda_{vi} = 1 + \frac{l_{vi} r_{0a}}{2 w_m} \left( l_{vi} r_{0a} \frac{w_m(e_{vi})}{w_m} + \sqrt{4 + \left( l_{vi} r_{0a} \frac{w_m(e_{vi})}{w_m} \right)^2} \right). \)

(28)

\( \xi_0(\Omega_{v,i,s}) \) being the error propagated from lower level DC regions, we have

\[ \xi_0(\Omega_{v,i,s}) \leq \lambda_{vi} \xi(\Omega_{v,i,s}). \]

(29)

The upper bound of \( \xi_l(\Omega_{v,i,s}) \) is determined by the error in the lower level and \( \lambda_{vi} \). As the wire length increases, \( \lambda_{vi} \) also increases.

### 4.1.2 Sampling Error of Level-2 DC Regions

The other sources of error come from the sampling process. By fixing \( w_m(e_{vi}) \leq w_a \leq w_M(e_{vi}) \), (13) (14) form a subset of \( \Omega_{v,i} \). The difference between the maximum and minimum \( d_{vi}^l \) of a given \( w_a \) is lower bounded by \( d_{vs} = l_{vi} r_{0c_{oa}}(w_m(e_{vi}))(\frac{1}{w_m} - \frac{1}{w_m}) \). If we make the delay sampling interval \( \delta_d < d_{vs} \), at least one point in each subset of \( \Omega_{v,i,s} \) is on \( \Omega_{v,i,s} \). Due to the convex property of subsets of \( \Omega_{v,i,s} \), for every \( (d_s, c_s) \in \Omega_{v,i,s} \), we can find a \( (d_s, c_s) \in \Omega_{v,i,s} \) such that \( |d_s - d_a| < \delta_d \) and \( |c_s - c_a| < \delta_c \), where

\[
\delta_c = \max \left\{ \left| \frac{\partial c_{vi}^l}{\partial d_{vi}^l} (w(e_{vi}), w_a) / \partial w(e_{vi}) \right| \right\}
\]

\[
= \frac{c_{vi}^l}{r_{o_{vi}c_{oa}}(w_m(e_{vi}))}.
\]

Thus the delay sampling error is bounded by

\[ \xi_d(\Omega_{v,i,s}) \leq \delta_d \sqrt{1 + \left( \frac{c_{vi}^l}{r_{o_{vi}c_{oa}}(w_m(e_{vi}))} \right)^2} = \delta_d k_{dv}, \]

(31)

For delay sample \( d_{vi}^l \), substitute (17) back to (14), we have

\[
c_{vi}^l(w_a) = c_{vi}^l(w_a) + l_{vi} r_{0c_{oa}}(w_a) l_{vi} - d_{vi}^l(w_a) - \frac{l_{vi} r_{0c_{oa}}}{2}.
\]

(32)

The error caused by wire width sampling with sampling interval \( \delta_w \) is bounded by

\[
\xi_w(\Omega_{v,i,s}) \leq \max \left\{ \left| \frac{\partial c_{vi}^l}{\partial w_a} (w_a) \right| \right\} \delta_w \]

\[
= \delta_w k_{wv}, \]

(33)

For a given clock routing, \( k_{wv} \) is a constant determined by \( \epsilon_{vij}, c_{vij}, l_{vij}, l_{vij}, \) and \( l_i \). For level-2 nodes, no error is propagated from either child, and \( \xi(\Omega_{v,i}) \leq \delta_d(k_{dv} + k_{dv}) + \delta_w(k_{wv} + k_{wv}). \)

### 4.1.3 Sampling Error at Level-3 and above Nodes

The analyses of the error for level-2 nodes and the error propagation in a clock tree can be used to measure the sampling error at level-3 and above nodes. Figure 5 illustrates the process of obtaining sampled DC-regions at level-3 and above nodes.

Figure 5: The process of obtaining sampled DC-regions at level-3 and above nodes

The upper bound of \( \xi_l(\Omega_{v,i,s}) \) is determined by the error in the lower level and \( \lambda_{vi} \). As the wire length increases, \( \lambda_{vi} \) also increases.

The ClockTune Algorithm is a \( \epsilon \)-optimal

\[ \xi(\Omega_{v,s}) \leq \delta_d(k_{dv} + k_{dv}) + \delta_w(k_{wv} + k_{wv}) + \lambda_{v} \xi(\Omega_{v,s}) + \lambda_{v} \xi(\Omega_{v,s}) \]

(35)

The proof of the upper bound of the total sampling error of the DC region at root node \( v \) can be obtained by the following recursion inequality:

\[ \xi(\Omega_{v,s}) \leq \delta_d(k_{dv} + k_{dv}) + \delta_w(k_{wv} + k_{wv}) + \lambda_{v} \xi(\Omega_{v,s}) + \lambda_{v} \xi(\Omega_{v,s}) \]
By reducing the sampling intervals for delay and wire width at every node to one kth of the original values, the worst-case total sampling error can also be reduced to one kth of the value. The sampling error can be reduced to an arbitrarily small value; the algorithm is ϵ-optimal.

4.2 Complexity Analysis

Assuming a clock tree \( T_n \) has \( n \) nodes, and we always take \( p \) samples for delay and \( q \) samples for wire width. In the bottom-up phase, we need \( O(1) \) time to construct the DC regions for leaf and level-1 nodes. Level-2 nodes require \( O(pq) \) time for delay and wire width sampling. The other nodes need \( O(p^2) \) time to combine at most \( p \) ranges for each delay sample. Thus, the complexity for the bottom-up phase is \( O(max(p,q)pn) \). In the top-down phase, each wire width can be determined in \( O(p) \) time, and the complexity is \( O(pn) \). The overall runtime complexity is \( O(max(p,q)pn) \). Since we only need to store the maximum and minimum values of the capacitance load of each delay sample, the memory requirement is \( O(pn) \). From \( (35) \), the required sample numbers \( p, q \) in achieving the specified accuracy is determined by some polynomial of the input parameters, thus \( \text{ClockTune} \) takes pseudo-polynomial runtime and memory usage.

5. EXPERIMENTAL RESULTS

We implement our algorithm in C++ and run the program on a 128MB 533Mhz Pentium III PC. The benchmarks r1-r5 are taken from [9]. In this section, we investigate the runtime and optimality tradeoff and determine the sampling rate for reasonable accuracy. An example that demonstrates the incremental refinement capability of our algorithm is followed.

5.1 Optimality and Runtime Tradeoff

![Figure 6: (a) Runtime and (b) memory usage with difference number of samples](image)

![Figure 7: (a) Minimum delay and (b) minimum load error](image)

We set \( p, q \) to 128, 256, 512, 1024 in each run and record the minimum delay and capacitance load of the DC regions. All simulations use \( r_0 = 0.03 \), \( c_0 = 2 \times 10^{-16}/\mu m^2 \), and acceptable wire width from 1μm to 4μm. We choose the BB+DME [8] algorithm for routing generation, thus the capacitance load of the initial routing is the global optimal. The optimal delay is estimated by nonlinear curve fitting using \( (35) \). Figure 6 shows the runtime and memory usage, and Figure 7 shows relative errors between minimal and optimal delays and loads. The results show that \( \text{ClockTune} \) is very efficient. For r5, taking 256 samples is sufficient to find an embedding within 1% of the optimal solution and the runtime is less than 6 minutes. For 128 samples, the maximum error is less than 2% and the runtime is further reduced to 1.2 minutes while consuming only 34 MB of memory.

Figure 8(a) shows the DC region of r5 with \( p, q = 256 \), and Figure 8(b) with \( p, q = 1024 \). When \( p, q = 256 \), the errors are below 1%, thus \( p, q \) are fixed to 256 for the rest of the simulations. Note that all delay values are the Elmore delay values multiplied by \( l n 2 \).

Table 2 lists the delay and capacitance load of initial routings and minimum delay/load embeddings by \( \text{ClockTune} \). On average, \( \text{ClockTune} \) achieves 3.3X improvement on delay with only 22% increase on wire capacitance. All embeddings are verified by SPICE and the maximum skew is 12ps.

5.2 Incremental Refinement

![Figure 8: The DC regions of r5 with (a) \( p, q = 1024 \) (b) \( p, q = 256 \)](image)

![Figure 9: Routing examples (a) initial routing (b) incremental refinement for local changes](image)

We will now proceed to show how incremental refinement is achieved by \( \text{ClockTune} \). Figure 9(a) shows a clock routing with 16 sink nodes. The numbers in the circles are capacitance values in \( fF \). Wire width and length are in μm. The \( dc \)-pairs followed by nodes A-G are the delay values in \( 10^{-12}s \) and capacitance load values in \( 10^{-12}F \) of the current embedding.
Table 2: Comparison of initial routings and minimum delay/load embeddings. The gains are measured by initial delays divided by minimum delays.

Figure 9(b) shows three changes in the upper-right portion of the routing: one of the sink capacitance and the lengths of two wire segments connecting to sink nodes are increased. We first re-construct the DC regions for the new clock routing. Since the dc-pair at node C does not reside in its new DC region, we could not find an embedding for TC to achieve the same delay and capacitance load. However, the dc-pair at node B does fall in its new DC region, we are able to find an embedding for TB that yields the same dc-pair while fixing the wire width of BD. By choosing such an embedding, only BC and the wire segments in TC need to be re-sized, and the rest of the clock routing were left intact.

If the designer chooses an optimal or near optimal embedding for the clock routing in the first place, the dc-pairs of the original embedding are more likely to fall out of the new DC regions as design changes occur. The changes might result in re-sizing many wire segments in order to re-balance the clock routing. In these cases, it might be desirable to choose a sub-optimal embedding for a subtree of the clock routing if future modifications are likely to occur. Our algorithm preserves the whole design space on the D-C plane and allows the designers to choose either optimal or sub-optimal embeddings depending on design requirements.

6. CONCLUSION AND FUTURE WORK

We present an $\epsilon$-optimal zero-skew clock tree wire-sizing algorithm, ClockTune. The algorithm is guaranteed to find an $\epsilon$-optimal embedding for the target delay or capacitance load requirement. For acceptable wire width ranging from 1\textmu m to 4\textmu m, the algorithm achieves 3.3X improvement in delay with only 22% increase in wire area over the BB+DME algorithm. The algorithm is also capable of re-balancing the clock routing by local refinement should design changes occur. Moreover, the algorithm only takes pseudo-polynomial runtime and memory usage.

We plan to incorporate wire-sizing and buffer-insertion simultaneously into our algorithm framework. We will also investigate the inductance effect and adopt more accurate delay models in our future work.

7. REFERENCES