A Hardware/Software Codesign Method for Pipelined Instruction Set Processor Using Adaptive Database

Nguyen Ngoc Binh, Masaharu Imai, Akichika Shiomi, and Nobuyuki Hikichi*

Department of Information and Computer Sciences
Toyohashi University of Technology
Toyohashi, 441 Japan
Tel: +81-532-47-0111
Fax: +81-532-48-9079

* Software Technology Department
Software Research Associates, Inc.
Tokyo, 170 Japan
Tel: +81-3-3942-4405
Fax: +81-3-3942-4416

e-mail: peas@imasun.tutics.tut.ac.jp

Abstract—This paper proposes a new method to design an optimal pipelined instruction set processor using a formal HW/SW codesign methodology. First, a HW/SW partitioning algorithm for selecting an optimal pipelined architecture is introduced briefly. Then, an adaptive database approach is presented that enables to enhance the optimality of the design through very accurate estimation of the performance of a pipelined ASIC in HW/SW partitioning. The experimental results show that the proposed methods are effective and efficient.

I. INTRODUCTION

One of the key issues in the ASIC (Application Specific Integrated Processor) design is the optimization of the instruction set and CPU architectures of ASIPs. Because the performance of an ASIC is heavily affected by the choice of these architectures, it is essential to select the optimum instruction set and CPU architectures with the maximum performance under the constraints of chip area and power consumption for example. These designs used to be done by computer architects taking advantage of their experience and intuition, but in the ideal ASIC design environment, they should be automatically performed by the system.

In order to realize the ideal ASIC design environment, an integrated design system for ASIPs named PEAS-I (Practical Environment for ASIC development - type I) was proposed and its prototype has been developed [1]. The PEAS-I system accepts a set of application programs written in C language and their associated data and some design constraints. The system profiles the application programs and generates the CPU core design as well as the application program development environment, such as compiler and simulator. The PEAS-I system employs a formal method to synthesize an optimal instruction set processor by solving Instruction Set Implementation Method Selection Problems (IMSP) [2]. IMSP-2 (IMSP type 2) [2] and IMSP-2P (for Pipeline) [3] are resource-constrained design with the consideration of resource sharing.

The HW/SW partitioning addressed in the previous PEAS-I is based on the fixed execution cycles of software implemented operations, which have been applied to any application program so far. In practice, however, their variation depends on input data of the operation. Therefore, it is difficult to estimate the performance accurately. As a result, the generated design might not be optimal because of the missection of Functional Units (FUs). In this paper we propose an approach to enhance the optimality of selected architecture with an accurate performance estimation. This is one of the most distinguished features of PEAS-I compared to other HW/SW codesign systems such as ASIA [4] and CASTLE [5].

The architecture of an ASIC synthesized by the PEAS-I system is based on the GNU C Compiler (GCC) abstract machine model [6]. The PEAS-I CPU is pipelined architecture with four stages: IF (Instruction Fetch & decode), EX (Execution), MEM (Memory access) and WR (Write back to Register file), respectively. Please refer to Ref. [7] (or [3]) for more detail. The essence in IMSP algorithms is selection of FUs to be added to minimum HW components called ‘Kernel’ to achieve the design goal for given design constraints (in particular, the highest performance of a designed ASIC for gate count and power consumption constraints).

The GCC Register-Transfer Language (RTL) operations are divided into primitive and basic operations. The primitive operations contain the minimum operations that can be included in the ASIC chip so that it can execute any C program. The primitive operations should be implemented in HW ‘Kernel’, which consists of an ALU, a one-bit shifter, and a register file. The basic operations contain other C operators that are not included in the primitive operations. A basic operation can be implemented using some HW choice (such as fast or slow hardware modules) or using a software subroutine (run-time
II. HW/SW Partitioning Algorithm

IMSP-2 does not consider pipeline characteristics such as pipelined FUs and pipeline hazards. Hence, we cannot use it to design an optimal pipelined ASIC. In particular, it cannot deal with the pipeline execution cycles of the ASIC accurately. In this section, we describe a method to evaluate pipeline hazards accurately when all the basic operations are implemented in HW.

A. Definitions and Notations

The HW/SW partitioning problem in the current version of PEAS-4 is defined as follows [2]:

For a whole set of all candidate instructions representing a given application domain, select a set of implementation methods which maximize the performance of the CPU under the constraints of chip area and power consumption, taking into account the functional module sharing relation among instructions.

In order to formalize IMSP-2P we need the following definitions and notations:

1. "n" denotes the total number of basic operations to be considered.

2. "x_i" denotes an implementation method that realizes operation #i, where x_i may be HW choice or SW, 0 ≤ i ≤ n. Then X = (x_0, x_1, ..., x_n) is a combination of implementation methods to be considered. (x_0 denotes HW ‘Kernel’ to implement all primitive operations and SW modules).

3. "a(x_i)" and "p(x_i)" denote the area and power consumption required for implementation method x_i respectively, where 0 ≤ i ≤ n.

4. "A_max" and "P_max" denote the available chip area and the maximum power consumption allowable for the computing modules in the ASIC chip.

5. "N" denotes the total number of basic blocks in the application program’s GCC RTL code.

6. "t(B_j, X)" denotes the execution cycles needed to execute basic block B_j using a combination of implementation methods X, where 1 ≤ j ≤ N.

7. "F_j" denotes the execution frequency count of basic block B_j in the given set of application programs, where 1 ≤ j ≤ N.

8. "c_j" denotes clock cycles needed to define control (e.g., branch delay) from block B_j to another one, where 1 ≤ j ≤ N. Here, it is assumed that all branches are taken and delay slot scheduling is not performed.

9. "b" denotes execution cycles reduced by un-taken branches in execution of the given application program.

B. IMSP-2P Formalization

Find a solution vector

\[ X = (x_0, x_1, \ldots, x_n) \]

which minimizes the objective function:

\[ T(X) = \sum_{j=1}^{N} (F_j \times (t(B_j, X) + c_j)) - b \]

subject to the constraints

\[ \sum_{x_i \in S} a(x_i) \leq A_{max}, \]

\[ \sum_{x_i \in S} p(x_i) \leq P_{max}, \]

where

\[ S = \bigcup_{i=0}^{n} \{x_i\} \]

C. Outline of the IMSP-2P Solver

The key point in computing Eq.(1) is how to get t(B_j, X). We have developed a HW/SW partitioning-oriented pipeline scheduling algorithm [8] to estimate t(B_j, X) for basic block B_j under configuration X. The pipeline control hazards are addressed in introducing the coefficients c_j. Note that the number of clock cycles due to control hazards is equal to \( \sum_{j=1}^{N} (F_j \times c_j) - b \). The pipeline scheduling algorithm detects and resolves all types of data hazards and structural hazards by ensuring that no more than one instruction can be issued or completed at each control step.

The IMSP-2P can also be solved using the branch-and-bound method as IMSP-2 can. One of the most important issues in solving problems efficiently by this method is to find a tight lower-bound function to prune as many non-optimum solutions as early as possible. For more detail please refer to Refs. [9] and [7]. Note that coefficient b was not introduced in the previous IMSP-2P formalization in Ref. [9]. Please note that the module sharing capability and heuristic reordering are the same as in the IMSP-2 solver.

The input to the IMSP-2P solver includes information from the Application Program Analyzer (APA), in the
previous section, and a module information database of HW/SW modules for implementing basic operations [7]. The output of the IMSP-2P solver includes the optimum implementation method of each basic operation and pipelined schedules of basic blocks. The instruction set of the designed ASIP will include the primitive operations as default and those basic operations that are selected to be implemented in HW. The algorithm automatically integrates the functional module sharing basic operations into one HW module whenever possible.

III. ADAPTIVE DATABASE APPROACH

So far it was assumed that the execution cycles of a SW implemented operation is fixed to the expected execution cycles for any application program. Estimation errors for designs with SW implementations range from 15% to 30% for the IMSP solvers as shown in the next section. As a result, the selected architecture as well as the instruction set may not be optimal. In this section we propose a method to generate an adaptive database of expected execution cycles of SW implemented operations for a given application program with associated input data.

A. Getting the Input Data of Each Basic Operation

Using the application program analyzer (APA) of the PEAS-I system to profile the given application program we get a sequence of all input data of each basic operation through running the program. We denote \( G_i \) as a sequence of all input data of basic operation \( \#i \) (1 \( \leq \#i \leq n \)). Let \( G_i \) be

\[
G_i = (g_i^{(1)}, g_i^{(2)}, \ldots, g_i^{(f_i)})
\]

where \( f_i \) is the number of elements in \( G_i \) as the same execution frequency count of basic operation \( \#i \) in execution of the application program. Note that \( g_i^{(j)} \) can be a tuple depending on the number of operands of basic operation \( \#i \). Then, the number of cycles needed to execute SW implemented basic operation \( \#i \) with input data \( g_i^{(j)} \) is denoted as \( \tau_i(g_i^{(j)}) \).

B. Evaluation of Execution Cycles

We can use the run-time subroutine for each basic operation \( \#i \) and run it with each \( g_i^{(j)} \) to get \( \tau_i(g_i^{(j)}) \) by using the PEAS-I simulator, as performed in Ref. [9]. However, the time needed for getting all \( \tau_i(g_i^{(j)}) \) is about as same as the time for running the APA with all operations to be implemented in SW. This method is very time consuming. In order to reduce the computation time, we introduce formulae to estimate execution cycles of SW modules for basic operations with given input data.

Let \( A \) and \( B \) be operands in the instruction of the form ‘\( OP \ C, A, B \)’ with the meaning \( C \leftarrow A \ OP B \), where \( OP \) is an operation. We define

\[
s(A) = \begin{cases} 
1, & \text{if } A < 0, \\
0, & \text{otherwise},
\end{cases}
\]

and \( s(B) \) is defined similarly. \( \text{left}(B) \) denotes the position of the leftmost ‘1’ in the binary presentation of \( B \) (counting from the right on the left). \( \text{one}(B) \) denotes the number of 1’s in the binary presentation of \( B \). Investigating assembly codes of SW modules in the current PEAS-I system, we have found the following formulae. Please note that these formulae depend on the algorithms used in run-time routines for basic operations.

\[
\tau_{\text{mul}}(A, B) = 26 + s(A) + s(B)
\]

\[
+ 7 \times \text{max}\{0, \text{left}(|B|) - 1\}
\]

(5)

\[
\tau_{\text{nmul}}(A, B) = 13 + 7 \times \text{max}\{0, \text{left}(B) - 1\}
\]

(6)

\[
\tau_{\text{div}}(A, B) = 278 + s(A) + s(B) + \text{one}(|A|/|B|)
\]

(7)

\[
\tau_{\text{adiv}}(A, B) = 263 + \text{one}(A/B)
\]

(8)

\[
\tau_{\text{mod}}(A, B) = 278 + s(A) + s(B)
\]

(9)

\[
\tau_{\text{nmul}}(A, B) = 263
\]

(10)

\[
\tau_{\text{ashr}}(A, B) = \tau_{\text{ashl}}(A, B) = \tau_{\text{shl}}(A, B)
\]

(11)

\[
= \tau_{\text{lsb}}(A, B) = 20 + \sum_{j, \text{BIT}_j(B)=1} (2^{j+1} - 1)
\]

\[
\tau_{\text{extendl}}(A) = 10 + \text{BIT}_{15}(A)
\]

(12)

\[
\tau_{\text{extendq}}(A) = 8 + 3 \times \text{BIT}_{15}(A)
\]

(13)

\[
\tau_{z,\text{extendl}}(A) = 2
\]

(14)

\[
\tau_{z,\text{extendq}}(A) = 1
\]

(15)

where \( \text{BIT}_j(B) \) is the value of bit \( j \) in the binary presentation of \( B \) (bit 0 is the rightmost bit).
C. Evaluation of Average Execution Cycles

The average execution cycles of SW implemented basic operation \( \#i \) for the given application program are defined as follows:

\[
\tau(\#i) = \frac{\sum_{j=1}^{f_i} r_i(g^{(j)}_i)}{f_i}
\]  

(16)

These values replace the old values in the database to run the IMSP solvers. The revised database is adaptive to the application program.

In order to reduce the computation time, we define \( U_i \) as a set of different (unique) elements in \( G_i \) as follows:

\[
U_i = \bigcup_{j=1}^{f_i} \{ g^{(j)}_i \}
\]  

(17)

and define \( k_i(u) \) as frequency count of \( u \) in \( G_i \). Then, Eq.(16) can be rewritten as follows:

\[
\tau(\#i) = \frac{\sum_{u \in U_i} k_i(u) \times \tau_i(u)}{f_i}
\]  

(18)

Note that \( \sum_{u \in U_i} k_i(u) = f_i \) and \( |U_i| \) is much smaller than \( f_i \) as shown later in Table 2.

IV. EXPERIMENTS AND RESULTS

The IMSP-2P algorithm with the adaptive database approach has been implemented in C and examined on a workstation. A set of sample programs has been performed to evaluate the effectiveness and efficiency of the algorithm.

A. Sample Programs

The sample programs used in the experiments are as follows:
1. ESS : Equation System Solver program, which solves a system of two linear equations using Cramer’s rule.
2. IMC : Inverse Matrix Calculator program that computes the inverse of a non-singular 3 x 3 matrix using Cramer’s rule.
3. diffeq : A program for solving a second order differential equation from Ref. [10].

These sample programs were fed to APA of the PEAS-I system. The code optimization was performed by the GCC [6].

B. Module Library

We use a module library with both non-pipelined FUs and pipelined FUs such as multipliers and dividers generated using a high-level synthesis system called PARTHENON [11] and cell library VSC470.lib (0.8μmCMOS) from VLSI Technology, Inc. A 16 MHz clock was assumed in the design of HW modules. The database contains 14 basic operations, each of them has different implementation methods ranging from 2 to 11. The number of leaf nodes in the search tree is of \( 11^2 \times 7^4 \times 2^8 \approx 74,373,376 \). The whole search tree ranges from 8.2 \( \times 10^9 \) to 1.5 \( \times 10^{10} \) nodes depending on the order of variables \( (x_i’s) \) to be examined. Therefore, it is necessary to have an efficient strategy to explore the search space to get an optimal solution in a reasonable time.

Part of the HW module information database used in the experiments is described in Ref.’s. [3] and [7]. Another part of the module database contains the expected numbers of execution cycles for each basic operation to be implemented by software subroutine. This part is shown in Table 1, where the values were obtained as the average of execution cycles in running the corresponding subroutine with random input data. So far, these values are considered as fixed (therefore, non-adaptive) to any application program in the HW/SW partitioning process.

C. Effectiveness

We demonstrate the IMSP-2P using the adaptive database approach for the IMC, ESS and diffeq programs.

The IMSP-2P selected the optimum partitioning for different values of area constraint. The power consumption was ignored to simplify the experimental cases.

FUs needed to implement basic operations for these sample programs are a multiplier, a divider, a barrel-arithmetic shifter, an extender and so on, where the multiplier and the divider can be pipelined or non-pipelined.

Figures 1 and 2 show estimation errors by IMSP-2 and IMSP-2P, respectively, using the fixed (non-adaptive) database. Note that IMSP-2P can estimate the execution cycles much more accurately than IMSP-2. That is, estimation errors are below 1.3% for designs with gate count constraints exceeding 22 Kcates, where all operations can be implemented in HW. On the other hand, the estimation errors by IMSP-2 range from 5% to 20% because of

<table>
<thead>
<tr>
<th>Basic Operation</th>
<th># Cycles</th>
<th>Basic Operation</th>
<th># Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>div</td>
<td>216</td>
<td>div</td>
<td>96</td>
</tr>
<tr>
<td>add</td>
<td>302</td>
<td>add</td>
<td>91</td>
</tr>
<tr>
<td>sub</td>
<td>214</td>
<td>sub</td>
<td>2</td>
</tr>
<tr>
<td>sub</td>
<td>201</td>
<td>sub</td>
<td>1</td>
</tr>
<tr>
<td>add</td>
<td>31</td>
<td>add</td>
<td>10</td>
</tr>
<tr>
<td>add</td>
<td>31</td>
<td>add</td>
<td>9</td>
</tr>
<tr>
<td>bsh</td>
<td>31</td>
<td>bsh</td>
<td>2</td>
</tr>
<tr>
<td>bsh</td>
<td>31</td>
<td>bsh</td>
<td>1</td>
</tr>
</tbody>
</table>

TABLE I

Expected execution cycles of SW implementations
TABLE II
GENERATION OF ADAPTIVE DATABASE

<table>
<thead>
<tr>
<th>i</th>
<th>Basic Operation</th>
<th>IMC program</th>
<th>ESS program</th>
<th>diffeq program</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>$f_i$, $</td>
<td>U_i</td>
<td>$, $\tau(#i)$</td>
</tr>
<tr>
<td>1</td>
<td>mul</td>
<td>720, 562, 52</td>
<td>180, 153, 52</td>
<td>868, 852, 48</td>
</tr>
<tr>
<td>2</td>
<td>div</td>
<td>653, 314, 279</td>
<td>377, 130, 280</td>
<td>147, 117, 283</td>
</tr>
<tr>
<td>3</td>
<td>mod</td>
<td>383, 46, 278</td>
<td>317, 70, 278</td>
<td>147, 117, 278</td>
</tr>
<tr>
<td>4</td>
<td>ashi</td>
<td>1648, 48, 21</td>
<td>1226, 42, 21</td>
<td>829, 85, 21</td>
</tr>
<tr>
<td>5</td>
<td>extendqi</td>
<td>1080, 36, 8</td>
<td>1800, 42, 8</td>
<td>409, 66, 8</td>
</tr>
<tr>
<td>6</td>
<td>z_extendqi</td>
<td>1193, 37, 1</td>
<td>1825, 48, 1</td>
<td>512, 72, 1</td>
</tr>
</tbody>
</table>

Fig. 1. Estimation errors by IMSP-2

Fig. 2. Estimation Errors by IMSP-2P

D. Efficiency

The proposed method is quite efficient. Table 3 shows the SPARCstation 10 (SS-10) CPU time in seconds for performing each experiment. The CPU time needed for APA to profiles an application program is below 20s. Then, an adaptive database can be generated within 0.1s. An optimal ASIP architecture is decided by the so-called Architecture Information Generator (AIG) with the IMSP

<table>
<thead>
<tr>
<th>CPU time in seconds</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ESS</strong></td>
</tr>
<tr>
<td>APA</td>
</tr>
<tr>
<td>Adaptive DB</td>
</tr>
<tr>
<td>AIG (IMSP-2P)</td>
</tr>
<tr>
<td><strong>Total</strong></td>
</tr>
</tbody>
</table>

*SPARCstation 10

not taking into account the pipeline characteristics. In addition, for designs with SW implementations, the execution cycle estimation reported by IMSP-2P will contain some error up to 32\% as in the case of IMSP-2 because the execution cycles of SW modules depend on the application program.

The analyzed results as well as statistics of the experiment for generating adaptive database are shown in Table 2, where there are 6 basic operation types met in these application programs. Note that the number of unique input data (i.e., $|U_i|$) is much smaller than the frequency count (i.e., $f_i$) for basic operations #3 – #6.

Using the generated adaptive database the IMSP-2P solver selected the first optimal solutions exactly, whereas IMSP-2 may not. Note that the performance estimation errors by the IMSP-2P solver are below 1.1\%, 2.7\%, 1.3\% for any area constraint as shown in Figure 3 for IMC, ESS, diffeq, respectively.
solvers. For a given area constraint, AIG with the IMSP-2P solver has taken CPU time of 2.0s, 2.3s, and 2.1s, which is average for IMC, ESS, and diffeq, respectively.

Thus, PEAS-I accepts as input a C written application program and design constraints. Using the proposed method, we can get an optimal pipelined ASIC architecture within a few seconds (below 25s) as shown in Table 3 for the total time.

V. CONCLUSION AND FUTURE WORK

We have proposed an efficient method to design an optimal pipelined instruction set processor in the PEAS-I system. The method with IMSP-2P is introduced as a HW/SW codesign problem formalization and as an extension of IMSP-2. The IMSP-2P selects the implementation method of the operations that implement a pipelined ASIC instruction set so that the performance goal is maximized under the given gate count and power consumption constraints. Then, we proposed an adaptive database approach to reduce the execution cycle estimation error. The effectiveness of the IMSP-2P algorithm was demonstrated through design examples. The method estimates the performance of a designed pipelined ASIC accurately for most designs. The algorithm is so efficient that the optimal solution was obtained within a few seconds on a conventional workstation.

Using the adaptive database and introducing the new formula for execution cycle estimation in IMSP-2P we enhance the optimality of the solutions and reduce performance estimation error remarkably. Experimental results show that IMSP-2P gave the first optimal solution for any gate count constraint with performance estimation errors of below a few %. The primary goal of selecting the first optimal architecture with low estimation error has been achieved in the PEAS-I HW/SW Codesign system.

Our future work includes the development of a HW/SW partitioning algorithm for pipelined ASIC design with the least gate count under a given execution cycle and power consumption constraints, as well as for the design with the lowest power consumption under gate count and execution cycle constraints.

Acknowledgments

Authors would like to express their thanks to NTT Communication Science Laboratories, VLSI Technology, Inc., Science Create, Co. Ltd., Japan, for their kind assistance. This research is supported in part by Grant-in-Aid for Scientific Research No’s. 07558038 and 07680353 from the Ministry of Education, Science and Culture, Japan.

References


