# Low Area Pipelined Circuits by Multi-clock Cycle Paths and Clock Scheduling 

Bakhtiar Affendi Rosdi, Atsushi Takahashi<br>Department of Communications and Integrated Systems, Tokyo Institute of Technology<br>2-12-1-S3-58 Ookayama, Meguro-ku, Tokyo 152-8552 Japan<br>Tel:+81-3-5734-2665 Fax:+81-3-5734-2902<br>E-mail:\{fendi, atushi\}@lab.ss.titech.ac.jp


#### Abstract

A new algorithm is proposed to reduce the number of intermediate registers of a pipelined circuit using a combination of multi-clock cycle paths and clock scheduling. The algorithm analyzes the pipelined circuit and determines the intermediate registers that can be removed. An efficient subsidiary algorithm is presented that computes the minimum feasible clock period of a circuit containing multi-clock cycle paths. Experiments with a pipelined adder and multiplier verify that the proposed algorithm can reduce the number of intermediate registers without degrading performance, even when delay variations exist.


## I. Introduction

The sustained progress of VLSI technology has led to increasing wire delays, shrinking clock period and growing chip size. Circuit pipelining is one technique that has been used in order to shrink the clock period. Pipelining is a method in which a circuit is divided into a small number of stages and intermediate registers are inserted between stages to store the intermediate data. With this method, extra circuit area is required to situate the additional intermediate registers and the size of the clock tree is also increased.

Recently, to overcome this problem, several studies have been carried out on wave pipelining [1], which is a method of speeding up the circuit without the insertion of intermediate registers. However, wave pipelining requires tighter timing constraints. In wave pipelining, there may exist a number of 'waves' of data in a circuit at any given time. Therefore, to avoid data collisions, delay balancing is required, which increases the circuit area.

This paper presents a new algorithm to reduce the number of intermediate registers of a pipelined circuit by using a combination of multi-clock cycle paths and clock scheduling. A multi-clock cycle path is a path from register to register where data transmission takes more than one clock period. Note that in wave pipelining, all paths are multi-clock cycle paths. Introducing a multi-clock cycle path into a pipelined circuit allows some intermediate registers to be removed. However, as mentioned above, certain timing constraints must be satisfied. Therefore, there is a tradeoff between area reduction from register removal and area increase from delay balancing. Clock scheduling is a technique in which the clock skew of a register is intentionally introduced to improve circuit performance by relaxing the timing constraints. Using clock scheduling, more intermediate registers can be removed,
without the need for delay balancing.
The minimum feasible clock period in terms of clock scheduling is obtained by linear programming [2], by graph-theoretic approaches with binary search $[3,4]$, or by graph-theoretic approaches without binary search [5]. Graph-theoretic approaches are based on construction of a constraint graph that represents various constraints and which can handle a circuit of practical size. The constraints are feasible if and only if the constraint graph contains no negative cycle. In graph-theoretic approaches with binary search [4], the Bellman-Ford shortest path algorithm is used to decide whether the graph contains a negative cycle and a simple negative cycle detection method is employed to increase speed. The algorithm proposed in [4] is for a circuit that contains single-clock cycle paths only. However, when the algorithm [4] is applied to a circuit containing multi-clock cycle paths, there are some cases for which the minimum feasible clock period cannot be determined. The clock period for such a circuit is bounded above, unlike the situation for a circuit containing only single-clock cycle paths. This range of feasible clock periods has to be taken into account in clock-schedule design.

In this paper, we initially discuss the constraints on a circuit containing multi-clock cycle paths. These constraints take into account the range of feasible clock periods required to make the circuit tolerant of clock jitter. Using the constraints, we enhance the algorithm in [4] to find the minimum feasible clock period of a circuit that contains multi-clock cycle paths. The enhanced minimum clock-period algorithm uses the existence of a negative cycle to narrow the binary search interval efficiently. A negative cycle exists whenever the constraints are infeasible. Then, we propose an algorithm to reduce the number of intermediate registers of a pipelined circuit by introducing multi-clock cycle paths with clock scheduling. In the proposed algorithm, all intermediate registers of the pipelined circuit are initially removed. Then the minimum feasible clock period of the resulting circuit is computed by the proposed minimum clock-period algorithm. If this value is too high, i.e. greater than the target minimum clock period, then intermediate registers are repeatedly inserted into the multi-clock cycle paths until the minimum feasible clock period has been sufficiently reduced.

Experiments with a pipelined adder and multiplier verify that, given a particular target clock-period range, the proposed algorithm can reduce the number of intermediate registers, even when delay variations are present.

## II. Preliminaries

We consider a circuit with a single clock consisting of registers linked by combinatorial circuits. The clock timing $s(v)$ of register $v$ is the difference in clock signal arrival time between $v$ and an arbitrarily chosen (perhaps hypothetical) reference register. The set of clock timings is called a clock-schedule.

We make the basic assumption that a circuit works correctly if the following two types of constraint are satisfied for each register pair with signal propagation $[2,6]$ :

## Setup Constraint

$$
\begin{equation*}
s(u)-s(v) \leq \beta_{u, v} T-d_{\max }(u, v) \tag{1}
\end{equation*}
$$

## Hold Constraint

$$
\begin{equation*}
s(v)-s(u) \leq d_{\min }(u, v)-\alpha_{u, v} T \tag{2}
\end{equation*}
$$

where $T$ is the clock period, $d_{\max }(u, v)\left(d_{\min }(u, v)\right)$ is the maximum (minimum) propagation delay from register $u$ to register $v$ along a combinatorial circuit, and $\beta_{u, v}$ and $\alpha_{u, v}$ are given constants ( $\beta_{u, v}>\alpha_{u, v} \geq 0$ ). Note that for a pair of registers with a single-clock cycle path, $\beta_{u, v}$ and $\alpha_{u, v}$ are given by 1 and 0 , respectively. This formulation is sufficiently general to deal with multi-clock cycle paths, a mixture of positive-edge and negative-edge registers, latch based circuitry, and multi-clocks that have different periods.

If $\alpha_{u, v}$ is 0 for every pair, the feasible clock period has no upper bound, i. e. if the clock period $T$ is feasible then any $T^{\prime}$ ( where $T^{\prime} \geq T$ ) is feasible. However, the feasible clock period is bounded above if $\alpha_{u, v}$ is not 0 for some pair ( $u, v$ ).

From the above constraints, when the clock schedule and the signal propagation delay are known, the minimum and maximum feasible clock period, $T_{\min }$ and $T_{\max }$, can be determined from the setup and hold constraints, respectively.

If the clock timing is not fixed, then $T_{\text {min }}$ and $T_{\text {max }}$ depend on each other. $T_{\text {min }}$ has to be minimized under the constraint that the circuit works correctly throughout a certain clock-period range, in order for the circuit to tolerate clock jitter. The above constraints become:

## Setup Constraint

$$
\begin{equation*}
s(u)-s(v) \leq \beta_{u, v} T_{\min }-d_{\max }(u, v) \tag{3}
\end{equation*}
$$

## Hold Constraint

$$
\begin{equation*}
s(v)-s(u) \leq d_{\min }(u, v)-\alpha_{u, v} T_{\min }-\alpha_{u, v} \delta \tag{4}
\end{equation*}
$$

where $\delta$ is the clock-period range, i.e. $\delta=T_{\max }-T_{\min }$. Therefore if $\delta$ is given, then, by using the above constraints, clock timings can be determined so that the circuit works correctly for a clock period between $T_{\text {min }}$ and $T_{\min }+\delta$. In the following, our target is to minimize $T_{\min }$ under the constraint that the circuit is feasible throughout the given clock-period range, i.e. that constraints (3) and (4) hold.

These constraints are represented by the constraint graph $G(V, E)$ of the circuit, which is defined as follows: a vertex $v \in V$ corresponds to a register; a directed edge $(u, v) \in E$ corresponds to either type of constraint; an edge ( $u, v$ ) corresponding to the setup (hold) constraint


Fig. 1. (a) Constraint graph $G^{1}$. (b) Circuit graph $C G^{1}$ (containing multi-clock cycle paths). (c) Min clock-period computation by the algorithm shown in [4]. (d) Min clock-period computation by proposed algorithm.
is called a Z-edge (D-edge), and the weight $w(u, v)$ of $(u, v)$ is $\beta_{u, v} T-d_{\max }(u, v)\left(d_{\min }(u, v)-\alpha_{u, v} \delta-\alpha_{u, v} T\right)$. The constraint graph $G$ corresponding to clock period $t$ is denoted by $G_{t}$.

In a constraint graph $G$, for any cycle $C$, the cycle weight $w(C)$ is the sum of edge weights over the cycle. The cycle weight can be expressed as $k T+w$, where T is the clock period and $k$ and $w$ are constants.

Definition 1 In the constraint graph $G$, a cycle $C$ for which $w(C)=k T+w$ is said to be of type 0, type $\boldsymbol{P}$, or type $M$, as $k=0, k>0$, or $k<0$, respectively.

Theorem 1 Let $C$ be a negative cycle in the constraint graph $G_{t}$. If $C$ is of type $\mathbf{0}$, then for any $t^{\prime}$, there exists a negative cycle in the constraint graph $G_{t^{\prime}}$.

Theorem 2 Let $C$ be a cycle in the constraint graph $G$ such that $w(C)=k T+w$. If $C$ is of type $\boldsymbol{P}$, then for any $t<w / k$, there exists a negative cycle in the constraint graph $G_{t}$, whereas, if $C$ is of type $M$, then for any $t>$ $-w / k$, there exists a negative cycle in the constraint graph $G_{t}$.

Definition 2 Let $C$ be a cycle in the constraint graph $G$ such that $w(C)=k T+w$. Then $\operatorname{Bound}(C)=w / k,-w / k$, or $\infty$, according to whether $C$ is of type P, type M, or type 0 , respectively.

Example: For the constraint graph $G^{1}$ shown in Fig. 1 (a), the cycle $C_{1}=(u, w 2, v 2, u)$ with $w\left(C_{1}\right)=$ $4 T-20$ is of type $\mathbf{P}$ and $\operatorname{Bound}\left(C_{1}\right)=5$. The cycle $C_{2}=(u, v 1, w 1, u)$ with $w\left(C_{2}\right)=-T+6$ is of type $\mathbf{M}$ and $\operatorname{Bound}\left(C_{2}\right)=6$.

Note that, in a constraint graph of a circuit that contains just single-clock cycle paths, only type $\mathbf{P}$ and type $\mathbf{0}$ cycles can exist, whereas in a constraint graph of a circuit that contains multi-clock cycle paths, all three cycle types can be present.

## III. Minimum Feasible Clock Period

## A. A circuit that contains just single-clock cycle paths

The minimum feasible clock period of a circuit that contains just single-clock cycle paths can be determined by graph-theoretic approach with binary search [4]. The maximum signal propagation delay from a register to the same register gives a lower bound of feasible clock period.


Fig. 2. (a) Constraint graph $G^{2}$. (b) Circuit graph $C G^{2}$ (containing just single-clock cycle paths). (c) Min clock-period computation by the algorithm shown in [4].

The difference of the maximum and minimum signal propagation delay from a register to another register gives also a lower bound of feasible clock period. They adopt the larger of these two lower bounds as an initial lower bound $L$ of the binary search. They adopt the maximum signal propagation delay between registers as an initial upper bound $U$ since it gives a feasible clock period even in zero clock-skew framework.

In the algorithm [4], the initial lower bound $L$ and upper bound $U$ are initially checked. If $L$ is feasible the algorithm is stopped and output $L$ as the minimum feasible clock period. While, if $L$ is infeasible, $U$ is checked to confirm there is no negative cycle of type $\mathbf{0}$. If there exists a negative cycle of type $\mathbf{0}$, the circuit is infeasible and the algorithm is stopped. If $U$ is feasible, the algorithm does binary search by adjusting the lower and upper bounds to determine the minimum feasible clock period.

Using the algorithm shown in [4] let us determine the minimum feasible clock period of the circuit shown in Fig. 2 (b). Note that, throughout this paper, the precision used is 1 . In this example, initial lower bound $L=2$ and initial upper bound $U=6$. So, the algorithm does binary search between 2 and 6 as follows:

Check $L(L=2): G_{2}^{2}$ is infeasible, cycle $C_{1}=(u, w, v, u)$ with $w\left(C_{1}\right)=3 T-10$ is negative, so next step is check $U$.

Check $U(U=6): G_{6}^{2}$ is feasible, so next step is check $M=(U+L / 2)$.
Check $M(M=4=(6+2 / 2))$ : $G_{4}^{2}$ is feasible, so 4 becomes new upper bound $U$.
Check $M(M=3=(4+2 / 2))$ : $G_{3}^{2}$ is infeasible, cycle $C_{1}=(u, w, v, u)$ with $w\left(C_{1}\right)=3 T-10$ is negative. Since $U-L=1$, output 4 as the minimum feasible clock period $T$.

The flow when we apply the algorithm is shown in Fig. 2 (c).

## B. A circuit that contains multi-clock cycle paths

For a circuit that contains just single-clock cycle paths, if the circuit is feasible then the circuit is feasible at the initial upper bound $U$, otherwise the circuit is infeasible. However, for a circuit that contains multi-clock cycle paths, even if the circuit is infeasible at initial upper bound $U$, there are some possibilities that the circuit is feasible at clock period $t(t<U$ or $t>U)$.

Input: circuit graph $C G$, target clock-period range $\delta$
Output: minimum feasible clock period $T$.

1. Construct constraint graph $G$ by $C G$ and $\delta$.
2. $L_{\text {self }}:=\max _{(u, u) \in E_{h o l d}} d_{\max }(u, u)$.
$L_{\text {diff }}:=\max _{(u, v) \in E_{h o l d}}\left(d_{\max }(u, v)-d_{\min }(u, v)\right)$.
$L:=\max \left\{L_{\text {self }}, L_{\text {diff }}\right\}$.
$L:=\max _{(u, v) \in E_{\text {hold }}}\left\{\left(d_{\max }(u, v)+s(u)-s(v)\right) / \beta_{u, v}\right\}$.
3. Check whether $G_{L}$ is feasible.
if there is no negative cycle in $G_{L}$ return $L$.
else if there exists a negative cycle $C$ of type 0 or type M return $\infty$.
4. Check whether $G_{U}$ is feasible.
if there exists a negative cycle $C$
case $C$ is of type 0 return $\infty$.
case $C$ is of type P then repeat the following
$L:=\operatorname{Bound}(C)$.
if there is no negative cycle in $G_{L}$ return $L$.
else if there exists a negative cycle $C^{\prime}$
if $C^{\prime}$ is of type 0 or type M return $\infty$.
else $C \leftarrow C^{\prime}$. case $C$ is of type M then $U:=\operatorname{Bound}(C)$ and if $U<L$ return
$\infty$.
5. While $(U-L>\varepsilon)$ do $M:=(U+L) / 2$. check whether $G_{M}$ is feasible if there is no negative cycle in $G_{M}$ then $U:=M$.
else let $C$ be the negative cycle.
case $C$ is of type 0 return $\infty$.
case $C$ is of type P then
$L:=$ Bound $(C)$.
if there is no negative cycle in $G_{L}$ return $L$.
else if there exists a negative cycle $C^{\prime}$ of type 0 or type M return $\infty$. case $C$ is of type M then $U:=\operatorname{Bound}(C)$. endwhile.
6. $T:=U$. return $T$.

Fig. 3. Minimum feasible clock period algorithm of the circuit that contains multi-clock cycle paths.

When the initial upper bound $U$ is infeasible, the algorithm [4] concludes that the circuit is infeasible at any clock period and stop. For example, let us determine the minimum feasible clock period of the circuit shown in Fig. 1 (b). Initial lower bound $L=4$ and initial upper bound $U=8$. So, the algorithm does binary search between 4 and 8 as follows:

Check $L(L=4): G_{4}^{1} \quad$ is infeasible, cycle $C_{1}=$ $(u, w 2, v 2, u)$ with $w\left(C_{1}\right)=4 T-20$ is negative, so next step is check $U$.

Check $U(U=8)$ : $G_{8}^{1} \quad$ is infeasible, cycle $C_{2}=$ $(u, v 1, w 1, u)$ with $w\left(C_{2}\right)=-T+6$ is negative and the algorithm is stopped.

The flow when we apply the algorithm is shown in Fig. 1 (c). As you can see from the above example, the algorithm [4] is stopped after checking the initial upper bound $U$ and concludes that the circuit is infeasible at any clock period.
However as we mentioned earlier, the conclusion is correct for a circuit that contains just single-clock cycle paths, while for the circuit that contains multi-clock cycle paths the conclusion might be incorrect. Therefore, the above approach might miss the minimum feasible clock period. In fact, in this case, the algorithm cannot determine the minimum feasible clock period which is 5 .

We enhance the algorithm that has been introduced in [4] to determine the minimum feasible clock period of a circuit that contains multi-clock cycle paths. The algorithm does binary search between lower and upper bounds


Fig. 4. (a) If $C$ is of type $\mathbf{P}$ then $L=\operatorname{Bound}(C)$ and check $L$. (b) If $C$ is of type $\mathbf{M}$ then $U=\operatorname{Bound}(C)$.
same as in the algorithm shown in [4]. We extend the algorithm [4] by introducing checking the type of cycle when there exists a negative cycle in the constraint graph. If the circuit is infeasible at given clock period, there exists a negative cycle. The lower and upper bounds are adjusted based on the type of negative cycle and the Bound value.

The new algorithm to determine the minimum feasible clock period of a circuit that contains multi-clock cycle paths is shown in Fig. 3.

For the initial value of lower bound $L$ and upper bound $U$ of the binary search, we adopt the same approach as in the algorithm shown in [4]. Initial lower bound $L$ will be checked whether it is feasible or not, if $L$ is feasible, then output $L$ as the minimum feasible clock period. Otherwise, there exists a negative cycle $C$. If $C$ is of type $\mathbf{0}$ or type M, the circuit is infeasible and the algorithm is stopped. While, if $C$ is of type $\mathbf{P}$ then an initial upper bound $U$ will be checked whether it is feasible or not.

If the initial upper bound $U$ is feasible, then the algorithm does binary search to determine the minimum feasible clock period. Otherwise, there exists a negative cycle $C$. In case $C$ is of type $\mathbf{0}$, the circuit is infeasible and the algorithm is stopped. In case $C$ is of type $\mathbf{P}$, Bound $(C)$ is our new lower bound $L$ and $L$ will be checked whether it is feasible or not. If our new lower bound $L$ is feasible then output $L$ as the minimum feasible clock period. Otherwise, repeat until $L$ is feasible or $C$ is of type $\mathbf{0}$ or type M, where the circuit is infeasible and the algorithm is stopped. In case $C$ is of type M, $\operatorname{Bound}(C)$ is our new upper bound $U$ and the algorithm will check whether $U<L$ or not. If $U<L$, then the circuit is infeasible and the algorithm is stopped. Otherwise, our algorithm does binary search by adjusting the lower and upper bounds to determine the minimum feasible clock period.

In binary search, the algorithm will check whether the constraint graph $G_{M}(M=(U+L) / 2)$ containing any negative cycle or not. If there are no negative cycles in $G_{M}$, then $M$ is our new upper bound $U$ and continue do binary search. Otherwise, if there exists a negative cycle in $G_{M}$ then the algorithm will check the type of it. In case $C$ is of type $\mathbf{0}$, the circuit is infeasible and the algorithm is stopped. From Theorem 2, in case $C$ is of type $\mathbf{P}$ then $\operatorname{Bound}(C)$ is our new lower bound $L$ and $L$ will be checked whether it is feasible or not (Refer Fig. 4 (a)). If our new lower bound $L$ is feasible then output $L$ as the minimum feasible clock period, otherwise, continue do binary search. In case $C$ is of type $\mathbf{M}$ then $\operatorname{Bound}(C)$ is our new upper bound $U$ (Refer Fig. 4 (b)), and continue do binary search.

Using the proposed algorithm, let us find the minimum feasible clock period of the circuit shown in Fig. 1 (b). Our target clock-period range $\delta$ is 0 . Initial lower bound $L=4$ and initial upper bound $U=8$. So, the algorithm
does binary search between 4 and 8 as follows:
Check $L(L=4)$ : $G_{4}^{1} \quad$ is infeasible, cycle $C_{1}=$ $(u, w 2, v 2, u)$ with $w\left(C_{1}\right)=4 T-20$ is negative and of type $\mathbf{P}$. So next step is check $U$.

Check $U(U=8): G_{8}^{1} \quad$ is infeasible, cycle $\quad C_{2} \quad=$ $(u, v 1, w 1, u)$ with $w\left(C_{2}\right)=-T+6$ is negative and of type $\mathbf{M}$ and $\operatorname{Bound}\left(C_{2}\right)=6$, therefore 6 is our new $U$.

Check $M(M=5=(6+4 / 2))$ : $G_{5}^{1}$ is feasible. Since $U-$ $L=1$, output 5 as the minimum feasible clock period $T$.

The flow when we apply the algorithm is shown in Fig. 1 (d). The algorithm can determine the minimum feasible clock period of the circuit which is 5 .

## IV. Reduction on the number of intermediate REGISTERS

In this paper we consider a problem on how to reduce the number of intermediate registers of a pipelined circuit, subject to the minimum feasible clock period is lower than or equal to the original circuit and works at target clockperiod range

In the proposed algorithm, all intermediate registers of the pipelined circuit are initially removed. Then the minimum feasible clock period is computed using the algorithm shown in Fig. 3. When the intermediate registers are removed, the minimum delay is reduced compared with the original circuit. Therefore, the hold constraints which correspond to the D-edges in the cycles become tight. If the circuit is infeasible at any clock period, there exists a negative cycle in the constraint graph. As we mentioned earlier, the hold constraints are the reason why the timing constraints become tight. Therefore to make the circuit feasible at target clock-period range, an intermediate register which corresponds to a D-edge contained in the negative cycle is inserted to the multi-clock cycle path.

Our target is to get a circuit, where the minimum feasible clock period is lower than or equal to the original circuit. If the minimum feasible clock period of the obtained circuit is higher than the original circuit, the minimum feasible clock period needs to be reduced. Our proposed algorithm find a critical cycle in the constraint graph that decide the minimum feasible clock period of the obtained circuit. Then, in order to reduce the minimum feasible clock period, an intermediate register which corresponds to a D-edge contained in the critical cycle is inserted to the multi-clock cycle path. The details of our proposed algorithm is omitted here due to the space constraint.

To explain the behavior of the algorithm, we implemented the algorithm to the pipelined circuit shown in Fig. 5. In this example, the timing of each I/O pin is fixed at 0 , while the timing of each register is scheduled. We also assume that Setup and Hold Time for registers are 0 and the minimum and maximum delay of the intermediate registers are 1 and 2 , respectively. Our target clock-period range $\delta$ is 4 . Note that in the constraint graph, vertices In and Out are the same vertex because we fix the timings of $\mathrm{I} / \mathrm{O}$ pins to 0 . In the figure, vertices In and Out are draw in different vertex to make it easy



Fig. 5. Pipelined circuit with intermediate registers and the corresponding constraint graph $G^{i n} . T_{\min }\left(G^{i n}\right)=10$.


Fig. 6. Pipelined circuit after removing all intermediate registers and the corresponding constraint graph $G^{0}$ (D-edge weight of the multi-clock cycle paths in $G^{0}$ is reduced 4 compared with the corresponds minimum delay in the circuit).
to understand. For the original circuit with zero clockskew, the minimum feasible clock period $T_{\min }\left(G^{i n}\right)$ is 10 . The circuit after removing the intermediate registers $v_{0}$, $v_{1}, v_{2}$ is shown in Fig. 6. Note that the D-edge weight of the multi-clock cycle paths in the constraint graph is reduced 4 compared with the corresponds minimum delay in the circuit because $\delta$ is 4 . When the minimum clock period of constraint graph $G^{0}$ is computed, there exists a negative cycle $C^{0}=\left(\right.$ in, $u_{1}, w_{1}$, out $)$ which is of type $\mathbf{M}$ in the constraint graph $G_{9}^{0}$. The minimum clock period algorithm concludes that it is infeasible. This means that the circuit works at clock period 6 but $\delta$ is not secured. Since path $\left(u_{1}, w_{1}\right)$ is a D-edge which is multi-clock cycle path, intermediate register $v_{1}$ is inserted. The circuit after inserting the intermediate register $v_{1}$ is shown in Fig. 7. $T_{\min }\left(G^{1}\right)=11>T_{\min }\left(G^{i n}\right)$, therefore the minimum fea-


Fig. 7. Pipelined circuit after inserting the intermediate register $v_{1}$ and the corresponding constraint graph $G^{1}$ (D-edge weight of the multi-clock cycle paths in $G^{1}$ is reduced 4 compared with the corresponds minimum delay in the circuit). $\left.T_{\min }\left(G^{1}\right)=11\right)>\left(T_{\min }\left(G^{i n}\right)=10\right.$.


Fig. 8. Pipelined circuit after inserting the intermediate registers $v_{1}$ and $v_{2}$, and the corresponding constraint graph $G^{2} . T_{\min }\left(G^{2}\right)=7$.

TABLE I
Statistics of adder and multiplier

|  |  | Circuit delay [ps] |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: |
| circuit | \# FF | 1st stage |  | 2nd stage |  |
|  |  | min max | min max |  |  |

sible clock period of the obtained circuit need to be reduced. The critical cycle $C^{1}=(u 2, w 2, v 1, u 2)$ is found in the constraint graph $G_{11}^{1}$. Since path $\left(u_{2}, w_{2}\right)$ is a D-edge which is multi-clock cycle path, intermediate register $v_{2}$ is inserted. The circuit after inserting the intermediate register $v_{2}$ is shown in Fig. 8. $T_{\min }\left(G^{2}\right)=7<T_{\text {min }}\left(G^{i n}\right)$, so the algorithm stop and output the circuit and clock timings as shown in Fig. 8.

## V. Experiments



Fig. 9. Relation between $T_{\min }(p s)$ and \# Int. FF of a 16 bit multiplier. $\delta=500 \mathrm{ps}$. Delay variation $=0 \%$.

The proposed algorithms were written in $\mathrm{C}++$ and implemented on a Pentium 4 (CPU 3 GHz , memory 513764 kb ). Since there are no benchmark examples of pipelined circuits, two simple examples, briefly described below, were constructed for our experiments.

TABLE II
Experimental Results

|  |  | Delay variation $=0 \%$ |  |  | Delay variation $=20 \%$ |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Circuit | $\begin{gathered} \hline \delta \\ (p s) \\ \hline \end{gathered}$ | $\begin{gathered} \hline T_{\min } \\ (p s) \quad(\%)^{\mathrm{a}} \end{gathered}$ | Int. FF <br> \# (\%) |  | (ps) | $T_{\text {min }}$ $(\%)^{\mathrm{a}}$ | $(\%)^{b}$ | $\begin{aligned} & \hline \text { Int. } \mathrm{FF} \\ & \# \quad(\%) \end{aligned}$ |  |
| $\begin{aligned} & \hline \text { 4bit } \\ & \text { add } \end{aligned}$ | Ori. | 2840 (100) |  | (100) | 3093 | (110) | (100) | 10 | (100) |
|  | 0 | 1422 (50) |  | (0) | 2047 | (72) | (66) | 0 |  |
|  | 100 | 1522 (54) |  | (0) | 2147 | (76) | (69) | 0 |  |
|  | 200 | 1622 (57) |  | (0) | 2247 | (79) | (73) | 0 |  |
|  | 300 | 1722 (61) |  | (0) | 2347 | (83) | (76) | 0 |  |
|  | 400 | 1822 (64) |  | (0) | 2447 | (86) | (79) | 0 |  |
|  | 500 | 1922 (68) | 0 | (0) | 2547 | (90) | (82) | 0 | (0) |
| 8bit <br> add | Ori. | 4474 (100) |  | (100) | 4890 | (110) | (100) | 18 | (100) |
|  | 0 | 1702 (38) |  | (0) | 2629 | (59) | (54) | 0 |  |
|  | 100 | 1802 (40) |  | (0) | 2729 | (61) | (56) | 0 | (0) |
|  | 200 | 1902 (43) |  | (0) | 2829 | (63) | (58) | 0 | (0) |
|  | 300 | 2002 (45) |  | (0) | 2929 | (66) | (60) | 0 |  |
|  | 400 | 2102 (47) |  |  | 3029 | (68) | (62) | 0 |  |
|  | 500 | 2202 (49) | 0 | (0) | 3129 | (70) | (64) | 0 | (0) |
| 16bit add | Ori. | 7634 (100) | 34 | (100) | 8366 | (110) | (100) | 34 | (100) |
|  | 0 | 3024 (40) |  |  | 4533 | (59) | (54) | 0 | (0) |
|  | 100 | 3124 (41) |  |  | 4633 | (61) | (55) | 0 | (0) |
|  | 200 | 3224 (42) |  |  | 4733 | (62) | (57) | 0 | (0) |
|  | 300 | 3324 (44) |  |  | 4833 | (63) | (58) | 0 | (0) |
|  | 400 | 3424 (45) |  |  | 4933 | (65) | (59) | 0 |  |
|  | 500 | 3524 (46) | 0 | (0) | 5033 | (66) | (60) | 0 | (0) |
| 16bit <br> mul | Ori. | 5075 (100) | 56 | (100) | 5551 | (110) | (100) | 56 | (100) |
|  | 0 | 4722 (93) |  | (86) | 3590 | (71) | (65) |  | (88) |
|  | 100 | 4822 (95) |  | (86) | 3590 | (71) | (65) | 49 | (88) |
|  | 200 | 4922 (97) |  | (86) | 3590 | (71) | (65) |  | (88) |
|  | 300 | 5022 (99) |  | (86) | 3590 | (71) | (65) |  | (88) |
|  | 400 | 3086 (61) |  | (88) | 3670 | (72) | (66) |  | (88) |
|  | 500 | 3086 (61) |  | (88) | 3770 | (74) | (68) | 49 | (88) |

[^0]- $n$-bit $(n=4,8,16)$ add: A 2 -stage adder that added four $n$-bit numbers (A, B, C and D) [7]. The first stage computed the partial sum $\mathrm{A}+\mathrm{B}$ and $\mathrm{C}+\mathrm{D}$ and the second stage computed the final sum. Each adder was of ripple-carry type.
- 16-bit mul: A 2 -stage multiplier that multiplied two 16 -bit numbers. The first stage used a carry-save adder with Wallace tree structure [8] and the second stage used a carry-look-ahead adder.

The statistics of the circuits are shown in Table I. The ROHM 0.35 process library was used for these experiments. The timing of each I/O pin was scheduled as well as the timing for each register.

Table II shows the results when the algorithm shown in section IV was implemented. Ori. is the original circuit containing the intermediate registers and with the clock timing of all registers fixed at 0 (zero clock-skew). " $\delta$ " and " $T_{\text {min }}(p s)$ " are the target clock-period range and output
minimum feasible clock period, respectively. "Int. FF (\#)" is the number of intermediate registers, and "Int. FF (\%)' is the percentage of the number of intermediate registers present compared with the total in the original circuit. Delay variation was $20 \%$, i.e. the delay variation for each gate and register was set at $\pm 10 \%$.

The results show that by a combination of multi-clock cycles and clock scheduling, the number of intermediate registers and the minimum feasible clock period can be reduced, even in the presence of delay variations in gates and registers.

The relation between the minimum feasible clock period and the number of intermediate registers of the 16 bit multiplier is shown in Fig. 9. In the graph, the label "Algorithm" indicates results using the proposed algorithm for insertion of the intermediate registers, while "Random1-4" labels indicate results when the intermediate registers are inserted randomly. The graph shows that the proposed algorithm can construct an equivalent circuit using fewer registers and with a smaller minimum feasible clock period.

## VI. Conclusion

It has been shown that the number of intermediate registers of a pipelined circuit can be reduced by implementing a multi-clock cycle path technique together with clock scheduling. The proposed algorithm inserts intermediate registers without considering delay balancing in order to make the circuit work correctly throughout the target clock-period range. We believe that by using delay balancing together with intermediate register insertion, circuit area can be further reduced. This is a topic for future investigation.

## Acknowledgements

This work is supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc., Cadence Design Systems, Inc., Rohm Corporation and Toppan Printing Corporation.

## References

[1] W. J. Kim and Y. Kim, "Clocking for correct functionality on wave pipelined circuits," in Proc. IEEE International ASIC/SOC Conference, 2003, pp. 161-164.
[2] J. P. Fishburn, "Clock skew optimization," IEEE Trans. on Computers, vol. 39, no. 7, pp. 945-951, 1990.
[3] R. B. Deokar and S. S. Sapatnekar, "A graph-theoretic approach to clock skew optimization," in Proc. International Symposium on Circuits and Systems (ISCAS), vol. 1, 1994, pp. 407-410.
[4] A. Takahashi, "Practical fast clock schedule design algorithms," in Proc. 18th Karuizawa Workshop, 2005, pp. 515-520.
[5] A. Takahashi and Y. Kajitani, "Performance and reliability driven clock scheduling of sequential logic circuits," in Proc. Asia and South Pacific Design Automation Conference (ASP-DAC), 1997, pp. 37-42.
[6] B. A. Rosdi and A. Takahashi, "Reduction on the usage of intermediate registers for pipelined circuits," in Proc. the Workshop on Synthesis and System Integration of Mixed Technologies (SASIMI 2004), 2004, pp. 333-338.
[7] S.Malik, K.J.Singh, R.K.Brayton, and A.Sangiovanni-Vincentelli, "Performance optimization of pipelined circuits," in Proc. IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1990, pp. 410-413.
[8] C. Wallace, "A suggestion for fast multiplier," IEEE Trans. on Electronic Computers, vol. 13, no. 2, pp. 14-17, 1964.


[^0]:    ${ }^{b}$ Compared with original circuit with zero clock-skew and delay variation $=20 \%$.
    ${ }^{a}$ Compared with original circuit with zero clock-skew and delay variation $=0 \%$.

