# Self-timed 1-D ICT Processor

Johnson T.C. Pang

Oliver C.S. Choy

C.F. Chan

W.K. Cham

Department of Electronic Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

| Tel: +852-2609-8272   | Tel: +852-2609-8280   | Tel: +852-2609-8268   | Tel: +852-2609-8281   |
|-----------------------|-----------------------|-----------------------|-----------------------|
| Fax: +852-2609-5558   | Fax: +852-2609-5558   | Fax: +852-2609-5558   | Fax: +852-2609-5558   |
| tcpang@ee.cuhk.edu.hk | cschoy@ee.cuhk.edu.hk | cfchan@ee.cuhk.edu.hk | wkcham@ee.cuhk.edu.hk |

Abstract - This paper describes a LSI implementation of 1-D order-8 Integer Cosine Transform (ICT) which can calculate either forward or reverse transformation. It is a standard-cell based design using 0.7 $\mu$ m CMOS SLP DLM process. The chip's performance is maximized with the fast computation algorithm and self-timed circuit technique. It consists of eight parallel self-timed pipelines. Each self-timed block is designed based on 2-phase handshaking protocol and variable delay concept. The die size is 5.7x4.1mm with about 76k transistors. This chip supports 16-bit I/O data and its data rate is up to 60MHz.

### I. INTRODUCTION

The Discrete Cosine Transform (DCT) is the most widely used transform for image compression. An Integer Transform denoted ICT(10,9,6,2,9,3) has been shown to be a promising alternative to the DCT for its implementation simplicity, close performance and compatibility to the DCT.

Asynchronous digital system design was being extensively discussed in the pass few decades because of its advantages over conventional synchronous systems. It eliminates the limitation in global clock synchronous systems from clock skew or clock distribution problem, higher degree of modularity, lower power consumption and performing average-case delay instead of worstcase delay.

The advantage of this asynchronous ICT chip is that its calculation time may be shorter than a synchronous design, since the computation time of each clock period in each self-timed block is different. Also, the computation time depends on the mode of operation (i.e. multiplication, addition ...). However, for synchronous system, the time required for each operation is fixed. Moreover, this chip demonstrates the idea of delay selection mechanism and the performance of the micropipeline system with 2phase handshake control.

#### II. ORDER-8 INTEGER COSINE TRANSFORM (ICT)

The Integer Cosine Transform was derived from the DCT by the concept of dyadic symmetry[1]. The (i,j)th kernel component of the order-8 DCT is:

$$\begin{split} t_{c}(i,j) &= \sqrt{\frac{2}{N}} \cos\left\{\frac{i(2j+1)\pi}{2N}\right\} & \text{for } i \neq 0, j \in [0, N-1] \\ &= \frac{1}{\sqrt{N}} & \text{for } i = 0, j \in [0, N-1] \end{split}$$

By representing kernel components of the same magnitude using the same variable, the DCT kernel can be expressed as [T] with its (i,j)th components being  $t_c(i,j)$ .

Let [T] = [K][J] as shown in Fig. 1(left), where [K] is a diagonal matrix whose (i,j)th element equals  $k_i$  and [J] contains components

a, b, c, d, e, f and g.  $k_i$  is the scaling factor such that the ith basis vector is of unity magnitude.

|     | [k <sub>0</sub> (g g g g g g g)]         | $x_1 \bigoplus \oplus \bigoplus \oplus \bigtriangleup \ominus \longrightarrow x_1$                                   |
|-----|------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
|     | k <sub>1</sub> (a b c d -d -c -b -a)     | $x_2 \oplus \oplus \bigoplus \bigoplus \bigoplus \bigoplus \bigoplus \bigoplus \bigoplus \bigoplus \bigoplus \max 2$ |
|     | k <sub>2</sub> (e f - f - e - e - f f a) | $x_3 \qquad \oplus \ \square \ominus \textcircled{\odot} \oplus \ \square x_3$                                       |
| T = | k <sub>3</sub> (b - d - a - c c a d - b) | X a a a a                                                                                                            |
| 1 - | k <sub>4</sub> (g-g-g g g -g -g g)       | $x^4 \xrightarrow{\oplus} \Theta = \overline{\nabla} \oplus \overline{\nabla} \oplus \overline{\nabla} \oplus x^4$   |
|     | k <sub>5</sub> (c - a d b - b - d a - c) | x5 ∰ ⊖ <del>_</del> ⊖ 40 ⊖ 40 € ± x5                                                                                 |
|     | k <sub>6</sub> (f -e e -f -f e -e f)     | $x_6 \longrightarrow \ominus \bigcirc \oplus \bigcirc \ominus \bigcirc \bigcirc x_6$                                 |
|     | k <sub>7</sub> (d - c b - a a - b c - d) |                                                                                                                      |
|     |                                          |                                                                                                                      |

Fig. 1 ICT kernel [T] (left), and ICT fast computational algorithm (right)

It can be shown that [T] is orthogonal if :

| $\mathbf{a} \cdot \mathbf{b} = \mathbf{a} \cdot \mathbf{c} + \mathbf{b} \cdot \mathbf{d} + \mathbf{c} \cdot \mathbf{d}$ |     |     |        | (1)     |        |            |    |
|-------------------------------------------------------------------------------------------------------------------------|-----|-----|--------|---------|--------|------------|----|
| Transform                                                                                                               | [T] | are | called | Integer | Cosine | Transforms | or |

Transform [T] are called Integer Cosine Transforms or ICT(a,b,c,d,e,f) if they satisfy the following conditions:

| $a \ge b \ge c \ge d$ and $e \ge f$ | (2) | and |
|-------------------------------------|-----|-----|
| a, b, c, d, e, and f are integers.  | (3) |     |

Condition (2) ensures that basis vectors of ICT(a,b,c,d,e,f) resemble those of the DCT and (3) ensures that transform components of ICT(a,b,c,d,e,f) can be represented by finite number of bits. There are many possible ICTs which will be denoted as ICT(a,b,c,d,e,f). ICT(10,9,6,2,9,3) has been shown to be a promising alternative to the DCT as its kernel components requires only 4 bits for representation and it has very close performance as DCT in both transform efficiency and mean-square-error[2]. The order-8 ICT can be computed using a fast computational algorithm as shown in Fig. 1(right). Also, this algorithm requires only integer multiplication and addition.

#### III. SELF-TIMED LOGIC DESIGN

There are many methods to implement self-timed logic system. Our previous investigation in self-timed logic design found that micropipeline structure is very suitable to implement this ICT chip because of its simple hardware structure and higher efficiency.

The micropipeline structure used in the processor was designed based the one introduced by Ivan Sutherland[3]. It consists of a handshake control circuit, a completion signal generation circuit and a data-path circuitry. It uses bounded-delay approach which is similar to traditional synchronous digital logic systems. An arbitrary delay element is used to generate the completion signal. The delay value can be estimated by simulation or by calculating the worstcase delay of the critical path in the computation. Since the completion signal is not extracted from the encoded data, a computation block can be implemented by single-rail logic and the data do not have to return to zero in each cycle. So, the computation block can be exactly the same as the one used in synchronous digital system. Fig. 2(left) shows a block diagram of the micropipeline.

Fig. 2 (right) is the timing diagram of a 2-phase handshake signaling. It basically is the same as the 4-phase signaling except that there are only two transitions in each data transfer cycle.



Fig. 2 Micropipeline structure (left); 2-phase handshake protocol (right).



Fig. 3 Voltage Controlled Delay (left); Delay selection circuit (right).

In some cases, we can select or vary the values of delay element according to different conditions or input data patterns. Muscato et al. proposed a locally clocked microprocessor using different delay values for different operations in a ALU[4]. In our ICT processor, the delay value selection method is applied to reduce the overall computation time. For example, the delay of performing addition instruction is less than that of multiplication. One obvious way is to vary the control voltage of the adjustable delay as shown in Fig 3(left). Another method is to use two or more fixed value delay elements with multiplexers as shown in Fig. 3(right).

### IV. IMPLEMENTATION OF THE SELF-TIMED 1-D ICT CHIP

Constraints such as die size, pin number and wiring complexity had been considered in this chip design. Modular architecture is employed to allow data pipeline and parallel processing. Our ICT chip can calculate the 1-D forward and inverse ICT described in section II. It consists of eight parallel Self-timed computation blocks as shown in Fig. 4. Each self-timed block is responsible for calculating a particular element of the transformed vector X(i.e. X1 or X2 ...). Each block has their own internal clocking and handshake signals and also generates and receives external handshake signals to make sure correct data flow and operation. For example, once the intermediate data from other self-timed block is ready, a request signal will be received. If the self-timed block is also ready and accepted the data, an acknowledge signal will return to the requester. That means all of these eight self-timed blocks can operates concurrently and asynchronously. Data flow and operation sequence depends on the connections of handshake control signals and microcode stored in each block.



Fig. 4 Self-timed 1-D ICT core processor: Block diagram (left);Layout (right)



Fig. 5 Block diagram of: Self-timed Computational block (left); Integer Execution Unit (right)

Fig. 5(left) is the block diagram of each self-timed computation block. Handshake control circuit (HCC) manages the request and acknowledge signals from other self-timed blocks. The HCC also

generates appropriate internal clock for the pipeline in the integer execution unit (IEU). Moreover, instruction control signals are generated from the instruction decoder to control the operation of the IEU and the delay selection unit.

Fig 5(right) is the block diagram of the IEU. The input router selects appropriate data from the data buses. Since the kernel components in transform Matrix [T] are all integers with bit length less than four. So the multiplier is very simple and needs only to calculate x1/2, x2, x3, x4 and x5. If a shift or single addition instruction is required, a smaller delay-value is selected. So the average time of calculating the transformation may be reduced.

Layout of the chip as shown in Fig. 4(right) was designed using Cadence Cell-Ensemble tool. Most of the CMOS logic gates are provided by ES2 library, while the C-element and all delay elements are custom design. The layout is partitioned into 6 groups (Input buffer, output buffer and 4x2 self-timed computation blocks). The chip has 68 I/O pins, including some pins for monitoring internal signal for testing.

## V. CHARACTERISTICS AND SIMULATION RESULTS

CMOS process / Foundry : 0.7µm SLP DLM / ES2 I/O data rate: 50MHz (forward transform), 60MHz (inverse transform)

Die size / Transistor count :



Fig. 6 Simulation result of Self-timed 1-D ICT processor

Fig. 6 is the timing simulation result of the 0.7µm CMOS selftimed ICT chip from Verilog-XL® simulator. This timing diagram shows the operation of the chip with handshake signals, input and output data. Do0~7 are the internal data bus. It shows the intermediate and final results of Do0 ~ Do7, which are calculated asynchronously.

#### VI. CONCLUSION

A 1-D ICT based on fast computational algorithm and self-timed 2-phase handshake micropipeline structure is described. It supports 16-bit data and operates in either forward or inverse transform mode. This self-timed modular design enables designer to tradeoff speed for complexity by simply adding or reduce the number of self-timed blocks. Moreover, by using variable delay-value method and its parallel structure, this self-timed ICT has the potential of operating at higher average speed than synchronous design.

#### REFERENCES

- [1] W.K.Cham and F.S.Wu, "On Compatibility of order-8 Integer Cosine Transforms and the Discrete Cosine Transform" IEEE Region 10 Conference on Computer and Communication Systems, Sept 1990.
- W.K.Cham, "Development of Integer Cosine Transforms by the [2] Principle of Dyadic Symmetry", IEE proceedings, Vol. 136, Pt. I, No.4, August 1989.
- [3] I.E. Sutherland, "Micropipeline" Communications of the ACM, June 1989
- [4] Stephen J. Muscato and Alexander Albicki, "Locally Clocked Microprocessor", IEEE proceedings, 1993.