# CELL-BROADBAND-ENGINE-BASED REALTIME WAVELET DECOMPOSITION FOR HDTV VIDEO IMAGES AND BEYOND

Akihiro Asahara, Munehiro Doi, Yumi Mori, Hiroki Nishiyama, and Hiroki Nakano

IBM Japan Ltd.

### ABSTRACT

The Cell Broadband Engine (CBE) is a novel multi-core microprocessor designed to provide compact and highperformance processing capabilities for a wide range of applications. Real-time image processing applications with parallelism for large amounts of data are good examples to demonstrate the unique capabilities of the CBE. In this paper, we describe the evaluation of the performance for image processing using wavelet transforms on CBE. Our results show that the CBE is extremely efficient in this processing compared with commercially available processors, and thus, we conclude that the CBE is quite suitable for next generation large pixel formats, such as 4K/2K-Digital Cinema.

### 1. INTRODUCTION

A number of technologies for digital images have been developed, providing digital playback and display of feature films with the same quality as 35-mm film. High-resolution film scanners, digital image compression, high-speed data networking and storage, and advanced digital projections are the key components of these technologies, "Digital-Cinema "[1]. In September 2005, the world witnessed the first real-time distribution of a 4K digital video, with a resolution of 8-million pixels, 4 times higher than HDTV [2]. This news will trigger the development of software and hardware for higher resolution image processing beyond HDTV.

To handle such high resolution images, it is well known that the wavelet transform is one of the essential methods for compressing or decompressing the images. Unfortunately the processing time increases rapidly as the amounts of input data and/or the number of filter taps increase. The image size for 4K Digital Cinema, 8-million pixels per frame, is too large for current commercial processors to finish the processing in real time. In this paper the Cell Broadband Engine (CBE) is examined for real-time processing of the wavelet transform. The CBE is capable of intensive floating processing, optimized for compute-intensive point workloads and broadband rich-media applications. We describe how to design an efficient algorithm for

multiprocessors and the results of a performance evaluation of wavelet transforms on the CBE.

### **2. CELL ARCHITECTURE**

The CBE we used for the study is a multi-core chip consisting of two different types of processor elements, the PowerPC Processor Element (PPE) and multiple Synergistic Processor Elements (SPEs). The PPE is the general-purpose processor with a traditional virtual memory subsystem. It runs an operating system and handles the CBE's general workload and manages special workloads for the SPEs [3]. Each SPE consists of a synergistic processing unit (SPU) and a 256-KB local store (LS). An SPE uses a special single-instruction multiple-data (SIMD) instruction set and relies on asynchronous direct memory access (DMA) to move data between the main memory and the LS. The PPE and SPEs communicate coherently with each other, main storage, and I/O through the Element Interconnect Bus (EIB). The programming scheme of CBE is different from a singlecore architecture. The programmer needs to take into account appropriate design for application partitioning between PPE and SPEs. The synchronization among each processor element (a PPE and its SPEs) is also an important factor. Optimization to fit in the LS is also another factor.



Fig. 1: Architecture of CBE

### **3. WAVELET TRANSFORM**

Orthogonal wavelet transforms are widely used in various compression algorithms such as MPEG4 and JPEG2000. Therefore we implemented an orthogonal wavelet transform algorithm on the CBE processor to evaluate its performance. The following equations are used for the 2D wavelet decomposition:

$$s_{m,n}^{(j+1)} = \sum_{l} \sum_{k} \overline{p_{k-2m}} \overline{p_{l-2n}} s_{k,l}^{(j)}$$

$$w_{m,n}^{(j+1,h)} = \sum_{l} \sum_{k} \overline{p_{k-2m}} \overline{q_{l-2n}} s_{k,l}^{(j)}$$

$$w_{m,n}^{(j+1,v)} = \sum_{l} \sum_{k} \overline{q_{k-2m}} \overline{p_{l-2n}} s_{k,l}^{(j)}$$

$$w_{m,n}^{(j+1,d)} = \sum_{l} \sum_{k} \overline{q_{k-2m}} \overline{q_{l-2n}} s_{k,l}^{(j)}$$

where s(j) is the level-j scaling coefficient, w(j) is the level-j wavelet coefficient, p is a scaling filter, and q is a wavelet filter [4]. The complete reconstruction is formulated as follows,

$$\begin{split} s_{m,n}^{(j)} &= \sum_{k} \sum_{l} \Big[ p_{m-2k} p_{n-2l} \, s_{k,l}^{(j+1)} + p_{m-2k} q_{n-2l} \, w_{k,l}^{(j+1,h)} \\ &+ q_{m-2k} p_{n-2l} \, w_{k,l}^{(j+1,v)} + q_{m-2k} q_{n-2l} \, w_{k,l}^{(j+1,d)} \Big] \end{split}$$

In this paper, Daubechies' wavelet filter (N=2, 4, 8) is used for the performance evaluation.

### 4. EXPERIMENTAL METHODS

#### 4.1 Equipment for the performance evaluation

The performance of the CBE was compared with that of commercially available processors. Table 1 shows the equipment used for the performance evaluation.

|           | Commercially available |                            |  |  |
|-----------|------------------------|----------------------------|--|--|
|           | processor              | Oell BE                    |  |  |
| System    | Desktop PC             | Blade Server               |  |  |
| Processor | Xeon 3.2 GHz           | CBE 2.4 GHz - PPEx1, SPEx8 |  |  |
| Memory    | 3 GB                   | Main Memory: 512 MBx1      |  |  |
|           |                        | Local Store: 256 KBx8      |  |  |
| OS        | Linux (kernel 2.6.11)  | Linux (kernel 2.6.14)      |  |  |

Table 1: Equipment for performance evaluation

A Xeon operating at 3.2 GHz was used as the commercially available processor and a wavelet transform program written in the C language in a traditional style was executed on it. A parallel version of the wavelet transform algorithm for the CBE was implemented and the SIMD program was executed on a blade server with a CBE operating at 2.4 GHz.

The well-known fast wavelet transform (FWT) algorithm was used for the performance evaluation. Fig. 2 shows the schematic diagram of the 2D wavelet transform. A 1D

wavelet transform can be enhanced into a 2D algorithm by performing the 1D algorithm in each dimension, the x and y coordinates, separately. In order to simplify the calculation, the processed image from the first FWT is transposed before the second FWT.



Fig. 2. Schematic diagram of 2D FWT for commercially available processors with Lena image

### 4.2 Wavelet transform algorithm for CBE

The modified 2D wavelet transform algorithm for CBE is described in this section. Fig. 3 shows the 1D schematic diagram (y coordinates) processing method.





The PPE reads the image file, divides it into 8 pieces and send each of the 8 SPEs context about the divided image pieces, such as the size, address in main memory, and so on. Each SPE receives the context and obtains the divided image from the main memory. The 1D FWT computations in each SPE are done using the SIMD instructions to exploit the data parallelism. Thus each partial image assigned to an SPE is formed into appropriate matrix size for the SIMD instruction by transposing the elements. The image data can be processed 4 pixels at a time with SIMD instructions. After the 1D FWT is done, the data from each SPE is stored in the main memory as shown in Fig. 3. After all of the SPEs finish the process, the whole image as decomposed along the x coordinates is ready. The same process is repeated for the decomposed image on the other dimension (along the x coordinates), and the 2D wavelet decomposition is thus performed.

#### 5. EXPERIMENTAL RESULTS AND DISCUSSION

Three sizes of gray-scale images, with resolutions of 1,024 x 1,024 (1.0 megapixels), 1,920 x 1,920 (3.7 megapixels), and 3,840 x 3,840 (14.7 megapixels), were used for the performance evaluation of the CBE and compared to the conventional CPU (the PC) as shown in Table 1. The size of HDTV, 4K format, and 2K format are 1,920 x 1,080, 4,096 x 2,160 pixels, and 2,048 x 1,080 pixels, respectively. Thus, the image sizes used in this experiment are large enough to evaluate the performance for next generation image formats. Table 2 shows a comparison of the execution speed for the 2D FWT process. The execution times were measured by using three filters having different tap lengths. "Performance index of execution speed" means the pixel processing rate per unit time based on the speed by using the PC, 1,024 x 1,024 image and 16 taps. The execution speed data on the conventional CPU are converted from the clock speed of 3.2 GHz to 2.4 GHz to correspond with the clock speed of the CBE. Figure 4 shows the results separated by numbers of taps.

|         | 1.0 Mega      |      | 3.7 Mega      |      | 14.7 Mega     |      |
|---------|---------------|------|---------------|------|---------------|------|
| Image   | pixels        |      | pixels        |      | pixels        |      |
|         | 1,024 x 1,024 |      | 1,920 x 1,920 |      | 3,840 x 3,840 |      |
|         | PC            | CBE  | PC            | CBE  | PC            | CBE  |
| 4 taps  | 3.3           | 76.2 | 3.2           | 78.7 | 2.8           | 73.2 |
| 8 taps  | 2.0           | 54.0 | 2.0           | 53.8 | 1.7           | 52.2 |
| 16 taps | 1.6           | 32.8 | 1.1           | 33.1 | 1.0           | 32.5 |

Table 2: Performance index of execution speed based on the case of PC, 3,840 x 3,840 image, up to 16 taps

Table 2 and Fig. 4 reveal the following: 1) The performance indexes on both the conventional CPU and on the CBE decrease as the number of filter taps increases; 2) The performance index on the conventional CPU decreases as the pixel size of image increases, but on the CBE it does not depend on the pixel size of the image; and 3) Regardless of the image size and the number of filter taps, the CBE gives about 30 times better performance, compared with the conventional CPU.

For the single precision operations, eight floating point calculations can be performed in each cycle by using the SPE SIMD instruction set. Also, the eight SPEs run in parallel. Thus a maximum of 64 floating point calculations can be done in each cycle by one CBE processor. However the CBE has some DMA latency to fetch image data into the LS. This is why we were limited to only 30 times faster performance than the conventional processor.

Although we cannot disclose the precise processing time, we can note that this wavelet processing using the CBE was finished in real time, even for the largest images of 3840 x 3840 pixels.







Fig. 4: Performance comparison

## 6. CONCLUSION

The results of the experiment clearly show that the SIMD parallel algorithm for the 2D FWT is effectively implemented for CBE, and we saw remarkable performance improvements for large images and large numbers of filter taps compared to the conventional CPU. Using CBE, we can achieve real-time processing of higher resolution images, well beyond HDTV.

Future work with the CBE will be conducted to study the processing of continuous streams of large images obtained by line CCD cameras or moving images with high frame rates, such as 4K Digital Cinema. Other image processing algorithms will also be implemented.

### 7. REFERENCES

- [1] Digital Cinema Initiatives, LLC Member Representatives Committee Digital, "Cinema System Specification v1.0", 2005
- [2] http://ntt.co.jp/news/news05e/0509/050927.htm
- [3] Sony, Toshiba, IBM, "Cell Broadband Engine Architecture Version 1.0", 2005
- [4] H. Nakano, S. Yamamoto, and Y Yoshida, "Signal and image processing by using wavelet analysis," Kyoritsu Publishing, 1999