Optimized System Synthesis of Complex RT Level Building Blocks from Multirate Dataflow Graphs

Jens Horstmannshoff and Heinrich Meyr
Integrated Signal Processing Systems
Aachen, Germany

Abstract

In order to cope with the ever increasing complexity of today's application specific integrated circuits, a building block based design methodology is established. The system is composed of high level building blocks of which some are reused from previous designs while others might have been created by behavioral synthesis. In data flow oriented designs, these blocks usually have complex non-matching interface properties, making it necessary to generate additional interfacing and controlling hardware to integrate them into an operable system.

In this paper, an RTL-HDL code generation from a synchronous data flow representations is introduced, that efficiently automates the generation of the required additional hardware. While existing code generation approaches provide strong limitations concerning the building block interfacing properties, our method enables the integration of components that access their ports periodically with arbitrary patterns. In order to reduce interface register cost, a minimum-area retiming approach is taken to determine optimum building block activation times, which is known to have polynomial time complexity. The code generation methodology is compared to an existing approach using a simple case study.

1 Introduction

Today's ASIC designers face the problem of an exploding design complexity. At the same time, the development cycles are getting shorter in order to achieve a minimum time-to-market of the product. This pressure leads to a shift in ASIC design paradigm to speed up productivity while guaranteeing that only one design iteration is necessary to develop an operable product (first-silicon-success). This new paradigm is based on the combination of high level building blocks that are taken from different origins [3]. To a certain degree these blocks can be reused from previous designs or purchased from third party vendors. Other blocks have to be custom made for the specific application. This involves the usage of advanced block design methods like behavioral synthesis [1] or data-path generation. In general, the high-level building blocks have non-matching complex interfacing properties that require the designer to write additional interfacing and controlling hardware that combines all blocks into an operable system.

In order to ensure a first-silicon-success of the building block based design it is important to use a seamless design flow from the algorithmic system specification to hardware implementation, that enables the joint development of the algorithm and architecture and a stepwise refinement of the initial algorithmic model. This flow allows the verification of the refined model to the more abstract algorithmic specification.

The ideal design environment to automate the seamless development of building-block based data-path oriented ASIC designs is RTL-HDL code generation from synchronous data-flow graphs [6]. Here, the algorithmic input specification is given by the data flow graph representation of the system which contains no notion of time. In the course of the code generation, the purely functional data-flow actors are mapped to complex RTL building blocks that are taken from a library. Furthermore, additional interfaces and controllers are generated to glue the given building-blocks into an operable system.

In this paper, we present a new approach of generating a building-block based target RTL architecture from synchronous data flow graph specifications. Here, a minimum area retiming transformation is used to reduce interfacing register cost. The paper is organized as follows: Section 2 summarises the existing HDL code generation approaches from synchronous data flow representations. In Section 3 an overview of the tasks performed by our HDL code generation tool is given. After discussing the timing specification of the building blocks in Section 4, the algorithms involved in the code generation are presented in Section 5. Finally, a simple case study is presented in Section 6.

2 Existing Approaches

Several approaches of generating hardware from synchronous data flow graphs are known to date. In [10], portions of the data flow graph are grouped into hardware execution units for which asynchronous communication is generated. Here, the granularity of the actors is restricted, so the desired combination of complex building blocks is not supported.

In [11], a library based HDL code generation method is presented that enables the integration of more complex building blocks. This approach, however, is strongly restricted concerning the I/O properties of the building blocks. Each block is assumed to read and write samples equidistantly, meaning that a fixed number of clock cycles elapses between each sample being read or written. This assumption is not valid for a vast number of building-block architectures which are integrated in today's communication systems.

The code generation approach presented in this paper is based on the work discussed in [2], where the building blocks are allowed to have arbitrary periodic port access patterns. Here, a straightforward approach is taken to generate the additional interfacing and controlling hardware, which can lead to a high overhead in interfacing registers. In this paper we will present a code generation approach that tremendously reduces this overhead.
3 Code Generation Overview

Algorithmic simulation of communication systems can be performed very efficiently using synchronous dataflow [6]. In dataflow, a system is represented as a directed graph, in which the nodes represent computations and the edges represent FIFO channels. These channels queue data values, encapsulated in objects called tokens, which are passed from the output of one computation to the input of another. In the dataflow model, no notion of time is present. For our code generation tool we use dataflow graphs as offered by [9] as an input specification. In the course of the code generation, the computational models contained in the dataflow nodes are mapped to RTL building blocks. Furthermore, a cycle based time scale is introduced for the data tokens transmitted over the graph’s edges. Therefore, detailed information is required about the timing of each RTL building block.

The building blocks are assumed to have complex timing patterns for their ports. The port timing pattern consists of a periodic sequence of timesteps (specified in multiples of clock cycles) at which data samples are consumed or produced at this port. Most data-path oriented signal processing components can be described in this manner. A detailed description of the component timing model will follow in Section 4.

The main task in system integration is the generation of additional RTL-components to glue the single components together to a working system. As depicted in Figure 1, this is done by introducing interface components and controllers which provide reset and stall signals to ensure proper building block activation. These components are now discussed with respect to their tasks.

- **Stall signal generation:**
  In synchronous dataflow, each component consumes and produces a fixed number of samples at its ports each time it is activated. This number is referred to as the port’s data rate. In a block diagram with multiple rates, each component has to be activated a certain number of times within a global processing period in order to achieve rate-consistency. When the system is rate-consistent and free of deadlocks, it can go through an infinite number of iterations with finite memory requirements at the interfaces between its components [6]. If the number of clock cycles the building-blocks require for one system iteration is not constant for all building-blocks in the system, it becomes necessary to pause the faster ones. In the target architecture, this is realized by applying a stall signal to the corresponding blocks or by using gated clock signals.

- **FIFO buffers:**
  On certain signals in the RTL target architecture, FIFO buffers have to be inserted to adapt non-matching timing patterns of the connected building block ports or to balance the latency of merging paths. The registers required for this latency balancing are also called shimming registers [4].

- **Initial value generation:**
  In the system’s initialization phase, some blocks have to start processing with determined initial values in order to avoid dead-locks in feedback loops. These values are produced by an initial value generator in the interface preceding those blocks.

- **Delayed reset signals:**
  After the system reset signal is deactivated, a component can start processing only after the preceding block starts to deliver valid data samples, i.e. after the writing block has finished its initialization. This delayed activation is implemented by disabling the reset signals for each component at a pre-determined point in time. Therefore, the target architecture contains a reset generator which delivers an independent reset-signal for each building-block.

In order to build this required additional hardware, the following data has to be determined:

- Number of pause cycles per building block
- Mapping of pause cycles to block schedules
- Reset deactivation delay of each building block
- Data token transfer delays

In section 5, we will present the algorithms to determine this data.

4 System Input Specification

The HDL code generation presented here is based on a data-flow representation of the system and a clock cycle true timing specification of the RTL building blocks. In the following, a formal model is introduced for this data.

4.1 Data-Flow Graph Terms

The system to be integrated is described by a directed synchronous data flow graph [6] $G_{\text{SDF}} = (N_{\text{SDF}}, E_{\text{SDF}})$, where $N_{\text{SDF}}$ represents a set of nodes (blocks) and $E_{\text{SDF}}$ stands for a set of edges. Figure 2 depicts two nodes $N_i^{\text{SDF}}$ and $N_j^{\text{SDF}}$ which are connected via edge $E_{\text{SDF}}$.

$$
\begin{align*}
N_i^{\text{SDF}} &\xrightarrow{P_{i}^{\text{in}}} E_{\text{SDF}} & P_{i}^{\text{out}}(N_j^{\text{SDF}}) \\
\eta_i(E) &\xrightarrow{\sigma_i(E)} & \eta_j(E)
\end{align*}
$$

**Figure 2. Definition of graph terms**

Each node $N_i^{\text{SDF}}$ has a number of input ports $P_{i}^{\text{in}}(N_i^{\text{SDF}})$ and output ports $P_{i}^{\text{out}}(N_i^{\text{SDF}})$. So, edge $E_{\text{SDF}}$ in figure 2 can be represented by an ordered pair of ports $E_{\text{SDF}} = (P_{i}^{\text{in}}(N_i^{\text{SDF}}), P_{j}^{\text{out}}(N_j^{\text{SDF}}))$, where the input port of this edge is given by $\sigma_i(E_{\text{SDF}}) = P_{i}^{\text{out}}(N_i^{\text{SDF}})$ and the output port...
by \( \sigma_{\text{out}}(E_{\text{out}}) = P_{\text{min}}^n[N_{\text{out}}] \). To denote the nodes with respect to a connected edge we introduce the input node of edge \( E \) as \( \eta_{\text{in}}(E_{\text{in}}) = N_{\text{in}}^\text{SDF} \) and its output node as \( \eta_{\text{out}}(E_{\text{out}}) = N_{\text{out}}^\text{SDF} \).

The number of samples which are consumed or produced on a port \( P(N_{\text{in}}) \) within one activation is given by its data-rate \( r(P(N_{\text{in}})) \). The number of initial values \( \zeta(E_{\text{out}}) \) on edge \( E_{\text{out}} \) denotes the number of data tokens that are written to the edge output port \( \sigma_{\text{out}}(E_{\text{out}}) \) before the first data token is produced by the edge input port \( \sigma_{\text{in}}(E_{\text{in}}) \). In addition to these typical synchronous data-flow properties, the user can assign a minimum number of registers \( \rho(E_{\text{out}}) \) to an edge that will be placed on the corresponding signal in the RTL level system architecture.

### 4.2 Building Block Interface Specification

As mentioned above, the building blocks are assumed to perform their calculations periodically after initialization. The duration of one processing period of node \( N_{\text{in}}^\text{SDF} \) in number of clock cycles is called iteration-period and will be denoted as \( I_{n,\text{ode}}(N_{\text{in}}^\text{SDF}) \). During one iteration-period, the RTL building block consumes and writes the same number of data items at its ports as specified by the data-rates of the corresponding synchronous data flow model. In order to map these data items to the clock cycle true schedule of the RTL model, we introduce a port time mapping vector \( \mu(P) \) for every data port \( P \) to specify in which clock cycle within the iteration period it is accessed. If port \( P \) is an output port, the \( j \)-th element of \( \mu(P) \) represents the clock cycle index of the first valid appearance of the \( j \)-th sample in the iteration period of port \( P \) in the periodic processing phase. If \( P \) is an input port, the \( i \)-th element of \( \mu(P) \) contains the cycle in which the \( i \)-th data sample is read from the port. Since the first element of the port time mapping vector denotes the cycle index of the first access to this port, it can be seen as the latency of this port.

Some building blocks require a certain number of clock cycles between the deactivation of the reset signal and the start of the periodic I/O schedule to initialize their internal states. This duration is called initialization-time \( \tau_{\text{init}}(N_{\text{in}}^\text{SDF}) \).

![Figure 3. Waveforms of DFIR Filter](image)

Following the deactivation of the reset signal the block requires \( \tau_{\text{init}} = 1 \) clock cycle to initialize its internal states. Then the block starts processing periodically with an iteration interval if \( I = 7 \) during which \( r(\text{INPUT}) = 4 \) data samples are consumed from the input port and \( r(\text{OUTPUT}) = 1 \) data sample is written to the output port.

As depicted in figure 3, the port time mapping vectors are given by

\[
\begin{align*}
\mu(\text{INPUT}) &= (0 4 5 6)^T \\
\mu(\text{OUTPUT}) &= (4) 
\end{align*}
\]

The periodicity model holds especially well for data-flow oriented components with resource sharing as they are generated by high level synthesis tools [1]. As we will see in section 5.2 it is also important to specify whether an output port is registered or whether there exists a combinational path from an input port. This property is crucial for the construction of the paused timing pattern.

### 5 Algorithms

#### 5.1 Rate and Periodicity Consistency

The first step in synthesizing an operable RTL architecture from the synchronous data flow system is to determine the minimum system iteration period and the number of clock cycles each building block has to be paused in this period. The system iteration period \( I_{\text{sys}}^{\text{min}} \) describes the minimum number of clock cycles the RTL system requires for one system iteration. Therefore, it is necessary to calculate the number of iteration periods each block goes through during one system iteration. This information can be extracted from the synchronous data flow representation. Let \( q_{N_{\text{in}}^\text{SDF}} \) denote the number of times node \( N_{\text{in}}^\text{SDF} \) is activated per system iteration period, then we have to determine values for \( q_{N_{\text{out}}^\text{SDF}} \) which fulfill the following balance equations for all edges \( \forall e \in E_{\text{out}} \)

\[
q_{\eta_{\text{in}}(e)(E_{\text{in}})}(\sigma_{\text{in}}(E_{\text{in}})) = q_{\eta_{\text{out}}(e)(E_{\text{out}})}(\sigma_{\text{out}}(E_{\text{out}})) 
\]

Equation 2 expresses the fact that within one system iteration period the same number of samples is produced and consumed by the ports connected to edge \( E_{\text{out}} \). We can construct a topology matrix \( \Gamma \) that contains the integer \( r(P_{\text{min}}^n(N_{\text{in}}^\text{SDF})) \) in position \( (j, i) \) if node \( N_{\text{out}}^\text{SDF} \) produces \( r(P_{\text{out}}^n(N_{\text{in}}^\text{SDF})) \) samples from output port \( n \) on the edge \( E_{\text{out}} \).

If it contains the integer \( -r(P_{\text{min}}^n(N_{\text{in}}^\text{SDF})) \) in position \( (j, i) \), node \( N_{\text{out}}^\text{SDF} \) consumes \( r(P_{\text{in}}^n(N_{\text{in}}^\text{SDF})) \) samples on input port \( n \) from edge \( E_{\text{in}} \). Then, the system of equations to be solved can be written as

\[
\Gamma \vec{q} = \vec{0} 
\]

where \( \vec{0} \) is a vector full of zeros, and \( \vec{q} \) is the repetition vector. Now we are able to calculate the minimum system iteration period

\[
I_{\text{sys}}^{\text{min}} = \max_{\forall e \in E_{\text{out}}} \left(I_{n,\text{ode}}(N_{\text{out}}^\text{SDF}) q_i\right) 
\]

Please note that any system iteration period can be chosen depending on the system throughput requirements as long as the minimum system iteration interval is not violated. If the number of cycles a node \( N_{\text{in}}^\text{SDF} \) requires for \( q_i \) activations is smaller than the system iteration period, the component has to be paused for

\[
p(N_{\text{in}}^\text{SDF}) = I_{\text{sys}} - I(N_{\text{in}}^\text{SDF}) q_i 
\]

cycles in each system iteration. This measure has to be taken in order to achieve a global periodicity consistency.

#### 5.2 System Timing Adjustment by Retiming

In this paper, we will use a minimum-area retiming approach to map the pause cycles \( p(N_{\text{in}}^\text{SDF}) \) to the building blocks I/O schedule and determine the block reset deactivation cycles, while minimizing the delay of all data token transfers in the system. This approach leads to a tremendous reduction in interface register area compared to the
The directed retiming graph \( G_{\text{RET}} \) is a representation of the building block I/O schedules and the data transfers that take place in one system iteration. In general, the I/O schedule of a building block can be partitioned into multiple phases where each phase contains exactly one clock cycle in which ports are accessed. A retiming node \( N_{\text{RET}}(n_{\text{dfir}}) \) represents the first phase in the I/O schedule of data flow block \( n_{\text{dfir}} \). Every retiming node has a reference cycle \( c(N_{\text{RET}}(n_{\text{dfir}})) \) that contains the index of the common cycle of the periodic I/O schedule in the I/O schedule of block \( n_{\text{dfir}} \). The retiming edges \( E_{\text{RET}} \) are represented by ordered pairs of retiming nodes \( E_{\text{RET}} = (N_{\text{RET}}, N_{\text{RET}}) \). The set of Retiming edges can be divided into two basic types \( E_{\text{RET}} = E_{\text{RET}}(N_{\text{dfir}}) \cup E_{\text{RET}}(N_{\text{dfir}}) \). A schedule edge \( E_{\text{RET}}(N_{\text{dfir}}) \) stands for the direct sequential dependency between two phases in the I/O schedule of building block \( n_{\text{dfir}} \). A data transfer edge \( E_{\text{RET}}(N_{\text{dfir}}) \) represents the transfer of the i-th data token over data flow edge \( E_{\text{dfir}} \) within one system iteration. Here, the feeding retiming node represents the I/O schedule phase in which the data token is written, while the consuming retiming node stands for the I/O schedule phase in which the token is read. Every retiming edge \( E_{\text{RET}} \) has a delay \( D_{\text{RET}} \) and a width \( w_{\text{RET}} \) assigned to it.

As an example, the I/O schedule of the downsampling FIR filter in figure 3 shall be used, assuming that it has a delay \( D_{\text{dfir}} \) and a width \( w_{\text{dfir}} \) for all \( n_{\text{dfir}} \). This block can be paused by \( p(n_{\text{dfir}}) \) and only activated once each system iteration \( q_{\text{dfir}} = 1 \). This block can be paused by deactivating the load-enable signal of all internal registers. We partition the I/O schedule of this block into five phases where the first three phases span the first three cycles of its periodic I/O schedule. Pasing the component in any cycle of this phase will result in the same pattern stretching. The remaining four clock cycles of the I/O schedule are partitioned into four distinct phases since the pattern is stretched differently depending on which of these cycles is chosen for pausing. In the retiming graph, the five I/O phases are represented by retiming nodes \( N_{\text{RET}}(n_{\text{dfir}}) \) that are connected cyclically by retiming schedule edges \( E_{\text{RET}}(n_{\text{dfir}}) \) as depicted in figure 4. These edges represent the cyclic I/O schedule of the building block.

Figure 4 presents the algorithm that determines the set of retiming nodes \( N_{\text{RET}} \), their reference cycles \( c(N_{\text{RET}}) \) and the number of retiming nodes \( \#N_{\text{RET}}(N_{\text{dfir}}) \) for every building block. The algorithm iterates through all building blocks and generates a retiming graph for each phase in the periodic I/O schedule of a block that contains a clock cycle in which a port access takes place. However, if a building block does not have to be paused \( p(n_{\text{dfir}}) = 0 \), only one retiming node will be created. The function is_port_access \( \text{is_port_access}(N_{\text{dfir}}, i) \) returns true if a port access takes place in the i-th cycle of the periodic I/O schedule of block \( n_{\text{dfir}} \). In the case of registered output ports, the function returns true if a data item is written in cycle \( i + 1 \) in order to take the register delay into account.

The retiming nodes are cyclically connected by

\[
\begin{align*}
E_{\text{RET}}(N_{\text{dfir}}) &= \{ (c_0, c_1), (c_1, c_2), (c_2, c_3), (c_3, c_4), (c_4, c_0) \} \\
E_{\text{RET}}(N_{\text{dfir}}) &= \{ (c_0, c_1), (c_1, c_2), (c_2, c_3), (c_3, c_4), (c_4, c_0) \}
\end{align*}
\]

Figure 5. Determination of retiming nodes

\[
\#N_{\text{RET}}(N_{\text{dfir}}) \text{ schedule edges, where } E_{\text{RET}}(N_{\text{dfir}}) \text{ always marks the schedule edge between the last } N_{\text{RET}}(N_{\text{dfir}}) \text{ and the first } N_{\text{RET}}(N_{\text{dfir}}) \text{ retiming node. The pause cycles that are introduced in the I/O schedule of a building block are represented by the retiming edge delays } D(E_{\text{RET}}(N_{\text{dfir}})) \text{ assigned to schedule edge } E_{\text{RET}}(N_{\text{dfir}}) \text{ of this block. As depicted in figure 4 the } p(n_{\text{dfir}}) \text{ pause cycles are initially placed on schedule edge } E_{\text{RET}}(N_{\text{dfir}}). \text{ So we can write}
\]

\[
D(E_{\text{RET}}(N_{\text{dfir}})) = \begin{cases} 
0 & : i \neq 0 \\
1 & : i = 0
\end{cases}
\]

In the retiming transformation, these pause cycles are distributed over the I/O schedule to minimize data transfer delay. As we will in section 5.2.2, the width of all schedule retiming edges is

\[
w(E_{\text{RET}}(N_{\text{dfir}})) = 0 \quad \forall E_{\text{RET}}(N_{\text{dfir}}) \in E_{\text{RET}}(N_{\text{dfir}})
\]

This means that pausing the I/O schedule of a building block does not cause any register cost.

Now the set of data transfer retiming edges \( E_{\text{RET}}(E_{\text{dfir}}) \) is constructed. The number of data tokens transferred over a data flow edge \( E_{\text{dfir}} \) in one system iteration is given by the interface rate

\[
r_{\text{int}}(E_{\text{dfir}}) = \sigma_0(E_{\text{dfir}}) r_1(E_{\text{dfir}})
\]
In the retiming graph, each data token transfer over a data flow edge $E^{SDF}$ is represented by a data transfer edge $E^{RET}(E^{SDF})$. So the number of retiming edges per data flow edge is also determined by $r_{int}(E^{SDF})$. In order to determine the input and output retiming nodes of each data transfer edge, we need knowledge about the I/O schedule phase in which a data token is produced or consumed. This information is given in the retiming variable mapping vector $\phi(P(N^{SDF}))$ that can easily be determined once the phase partitioning was performed. The $i$-th element $\phi_i(P(N^{SDF}))$ of this vector contains the I/O phase index in which the $i$-th data token is read or written on port $P(N^{SDF})$. For the down-sampling FIR filter example in figure 3, these vectors are

$\phi^T(INPUT(dif)) = [0 \ 2 \ 3 \ 4\ ]^T, \ \ \phi^T(OUTPUT(dif)) = [1 \ 2 \ 3 \ 4\ ]^T$

Figure 6 contains the algorithm that determines the set of data transfer edges $E^{SDF}(E^{SDF})$, the data transfer delays $D(E^{RET}(E^{SDF}))$ and their width $w(E^{RET}(E^{SDF}))$. The algorithm generates a retiming data transfer edge for each data

\begin{align}
1) & E^{SDF}(E^{SDF}) \leftarrow \emptyset \\
2) & \text{for all } E^{SDF} \in E^{SDF} \\
3) & \text{for } j = 1 \text{ to } r_{int}(E^{SDF}) - 1 \\
4) & k \leftarrow (j + \zeta(E^{SDF})) \mod r_{int}(E^{SDF}) \\
5) & E^{RET}(E^{SDF}) \leftarrow \left\{ \eta^{RET}(\sigma^i(E^{SDF}))(\eta^{RET}(E^{SDF})) \right\} \\
6) & E^{SDF}(E^{RET}) \leftarrow E^{SDF}(E^{RET}) \cup \left\{ \zeta(E^{RET}) \right\} \\
7) & D(E^{RET}(E^{SDF})) \leftarrow d_{SDF}^i - d_{SDF}^j + \tau_{ret}(E^{SDF}) + \tau_{ret}(E^{SDF}) \\
8) & w(E^{RET}(E^{SDF})) \leftarrow w(E^{SDF}) \\
9) & \text{end} \\
10) & \text{end}
\end{align}

Figure 6. Determination of data transfer edges

5.2.2 Retiming Formulation

Based on the retiming graph constructed in the previous section, the standard min-area retiming problem can be formulated. A retiming is a labeling of the retiming graph vertices $\pi : N^{RET} \rightarrow Z$, where $Z$ is the set of integers. The delay of retiming graph edge after retiming is denoted by $D_{\pi}(E^{RET})$ and given by

$$D_{\pi}(E^{RET}) = \pi(\eta^{RET}(E^{RET})) + D(E^{RET}) - \pi(\eta^{RET}(E^{RET}))$$

(9)

The retiming label $\pi(N^{RET})$ of a retiming node $N^{RET}$ represents the number of delays moved from its outgoing retiming edges to its incoming retiming edges. Moving a delay to a data transfer edge $E^{RET}(E^{SDF})$ means that the delay between the production and consumption of the $j$-th data token via data flow edge $E^{SDF}$ is enlarged. Moving a delay to a schedule edge $E^{RET}(E^{SDF})$ introduces a pause cycle between the I/O schedule phases embodied by the input and output retiming node of this edge.

It is our objective to minimize the data transfer delay weighted with the width of the transferred data tokens, in order to reduce interface register cost. A retiming $\pi(N^{RET}) = 1$ on retiming node $N^{RET}$ contributes an increment of $\Delta C(N^{RET})$ to the cost function

$$\Delta C(N^{RET}) = \sum w(E^{RET}) - \sum w(E^{SDF})$$

(10)

Here, $E^{SDF}(N^{RET})$ represents the set of incoming retiming edges, while $E^{RET}(N^{RET})$ stands for the set of outgoing retiming edges with respect to retiming node $N^{RET}$. Please note that all schedule edges are zero width as shown in equation 7, since no cost is caused by pausing a building block. When formulating the retiming as an ILP problem, the objective function to be minimized is given by

$$\min \sum \Delta C(N^{RET})\pi(N^{RET})$$

(11)

In order to achieve a valid retiming, it has to be ensured that the retiming graph delay after retiming does not become negative. A negative delay on a data transfer edge implies that a data token is read from a data flow edge before it has been produced, while the negative delay on a schedule edge means that the block schedule is shortened. For the I/O schedule edges this non-negativity constraint can written as

$$\pi(\eta^{RET}(E^{RET})) - \pi(\eta^{RET}(E^{RET})) \leq D(E^{RET})$$

$$\forall E^{RET} \in E^{RET}(N^{SIDE})$$

(12)

where $\eta^{RET}(E^{RET})$ denotes the start retiming node of edge $E^{RET}$ and $\eta^{RET}(E^{RET})$ marks the end retiming node of this edge. When constructing the constraints for the data transfer edges $E^{RET}(E^{SIDE})$, we have to consider that the delay must never become smaller than the minimum delay $\rho(E^{SDF})$ that the user assigned to the corresponding data flow edge $E^{SDF}$. This leads to

$$\pi(\eta^{RET}(E^{RET})) - \pi(\eta^{RET}(E^{RET})) \leq D(E^{RET}) - \rho(E^{SDF})$$

$$\forall E^{RET} \in E^{RET}(E^{SIDE})$$

(13)

The ILP problem given by equations 11, 12 and 13, resembles the unconstrained min-area retiming problem as formulated in [7]. The dual of this ILP problem is a min-cost flow network problem which can be solved efficiently in polynomial time [5].
In order to ensure feasibility of the retiming ILP problem, it has to be guaranteed that the sum of delays along all data-transfer cycles in the retiming graph is smaller or equal zero. If this is the case, the network optimization algorithm will always find a valid retiming solution.

5.2.3 Retiming Results

After solving the retiming problem formulated in section 5.2.2, we are able to determine all data that is necessary to generate the required additional hardware. The reset deactivation time $t_{R}(N_{REG})$ denotes the delay in clock cycles between system reset deactivation and the reset deactivation of block $N_{REG}$. This value is given by the retiming labeling of the first retiming node of the corresponding block

$$t_{R}(N_{REG}) = \pi(N_{REG})$$  \hspace{1cm} (14)

In the RTL target architecture, the reset generator will provide an independent reset signal for each block $N_{REG}$, which is delayed by $t_{R}(N_{REG})$ clock cycles.

The periodic pausing pattern that the pause signal generator has to provide for each building block $N_{REG}$ can be determined from the schedule edge delays after retiming $D_{E}(\pi(N_{REG}))$ as given by equation 9 and the reference cycles of the retiming nodes $c(N_{REG})$. A schedule edge delay of $D_{E}(\pi(N_{REG}))$ after retiming means that the periodic schedule of block $N_{REG}$ has to be paused for $D_{E}(\pi(N_{REG}))$ cycles in the $c(N_{REG})$-th clock cycle of its periodic processing phase.

The number of registers required to implement the FIFO buffer on a data flow edge is given by the maximum number of overlapping token transfers. By minimizing the delays of these transfers in the retiming transformation, the maximum overlap is tremendously reduced compared to existing approaches.

6 Case Study

The carrier synchronization unit [8] depicted in figure 7 is used as an example to compare the register cost produced in [2]. The RTL architecture that corresponds to the functional models of the data flow models have non-matching port I/O patterns and different processing periodicities, requiring the RTL-HDL code generation to produce additional interfacing and controlling hardware in order to synthesize all blocks into an operable system. In table 1 the number of interfacing registers inserted between the building block ports are compared for two code generation approaches. The variable $N_{REG}$ represents the number of word level registers, while $N_{REG}$ denotes the register cost in one-bit wide registers. In [2], periodicity adjustment is performed by arbitrarily inserting the pause cycles into the building block I/O schedules, while the reset deactivation time and shimming register cost is determined by using a shortest path algorithm on a timed graph. As we can see, interfacing registers have to be inserted on the signals that correspond to data flow edges $E_{1}$, $E_{2}$ and $E_{5}$, leading to a total register cost of 88 single-bit registers. Approach II represents the optimized code generation algorithm presented in this paper. By using a retiming transformation, to map the pause cycles to the building blocks I/O schedules and to determine the block reset deactivation times, register cost is decreased tremendously. Only one 5-bit wide register is required on edge $E_{4}$, resulting in a interfacing register cost reduction by a factor of 17. Please note that the complexity of the generated controller does not significantly increase when changing the pause cycle mapping or the delayed reset deactivation pattern.

7 Summary

In this paper, we presented an HDL code generation approach from synchronous data flow graphs that enables the seamless building block based design of data-flow oriented systems. By employing a min-area retiming technique to minimize data transfer delays, the interface register cost is significantly reduced compared to existing approaches.

In future work, we will focus on the support of data-dependent communication and on providing an interface to top level control dominated systems.

References


<table>
<thead>
<tr>
<th></th>
<th>Approach I</th>
<th>Approach II</th>
</tr>
</thead>
<tbody>
<tr>
<td>$N_{REG}$</td>
<td>$N_{REG}$</td>
<td>$N_{REG}$</td>
</tr>
<tr>
<td>$E_{1}$</td>
<td>3</td>
<td>12</td>
</tr>
<tr>
<td>$E_{2}$</td>
<td>2</td>
<td>8</td>
</tr>
<tr>
<td>$E_{3}$</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>$E_{4}$</td>
<td>0</td>
<td>5</td>
</tr>
<tr>
<td>$E_{5}$</td>
<td>4</td>
<td>68</td>
</tr>
<tr>
<td>Total</td>
<td>9</td>
<td>88</td>
</tr>
</tbody>
</table>

Table 1. Register Cost of Carrier Synchronization