# PAC DSP CORE AND APPLICATION PROCESSORS

David Chih-Wei Chang, I-Tao Liao, Jenq-Kuen Lee, Wen-Feng Chen, Shau-Yin Tseng, Chein-Wei Jen

SoC Technology Center Industrial Technology Research Institute Hsinchu, Taiwan 310, R.O.C. <u>cwchang@itri.org.tw</u>

# ABSTRACT

This paper provides an overview of the Parallel Architecture Core (PAC) project led by SoC Technology Center of Industrial Technology Research Institute (STC/ITRI) in Taiwan. The background of PAC project, a brief introduction to PAC core technologies, PAC SoC development suite, PAC benchmarks, and applications are presented. The main objective of the PAC development plan is to enhance industrial development competitiveness in the core technology related to key components, especially for portable multimedia applications.

### **1. INTRODUCTION**

In recent years, the markets of communication systems and consumer electronics grow dramatically and this also drive the demand for digital signal processor (DSP) solutions. In order to fulfill increasing high-performance, multi-function, and real-time multimedia processing requirements, DSP solutions have been embedded in a wide variety of consumer electronics and home entertainment products, such as cellular phones, MP3 players, GPS, digital cameras, DVD players, set-top boxes, and DTV consoles. The advances in DSP implementations can be in the form of ASIC chips or DSP cores. Considering the flexibility for system and leading-edge algorithm design, a programmable DSP core (DSP chip or DSP/MPU SoC) is the ideal choice for supporting multi-application, high-bandwidth, and multiple communication standard required by emerging mobile multimedia devices.

Because DSP core is regarded as the key component of modern communication and consumer electronics appliances, the PAC project was initiated in early 2004. It aims at developing a 32-bit programmable DSP core based solution to enable richer multimedia capabilities, reduce development efforts, and shorten time to market. The highly integrated PAC SoC platform features a dual-core architecture that combines the command and control capabilities of the RSIC MPU with the highperformance/low-power DSP core having parallel processing capability. The PAC application processor is developed mainly for the next-generation media-rich and multi-function portable devices, such as PMP, PDA, and smart phones.

#### **2. PAC CORE TECHNLOGIES**

PAC DSP is a 32-bit fixed-point low power high performance DSP with 5-way VLIW (Very Long Instruction Word) architecture targeted for mobile applications. It has one scalar unit and two data stream clusters. Each data stream cluster contains two functional units and a distinct partitioned low power register file structure. PAC DSP has a rich, but optimized, instruction set which supports 8-bit and 16-bit SIMD operations. It is targeted to run at a maximum frequency of 250-300MHz.

The PAC DSP core can be used as a co-processor in a dualcore processor architecture platform (e.g. PAC SoC Platform) or used as standalone unit in a single-processor DSP platform. Along with the development of PAC DSP processor, a complete tool chain of compiler, assembler, and linker is also developed. High performance assembly code library will be provided as well for multimedia applications. It targets, but not limited to, the following application domain:

- Video and Image processing (H.264, MPEG-4, JPEG, Color space transform, etc.)
- Audio and Speech processing (MP3, AAC, and GSM speech processing, etc.)
- Voice processing and enhancement (digital hearing aid, voice-controlled gadgets, VoIP Telephony, etc.)

## 2.1. PAC DSP Core

PAC DSP is a silicon-proven IP core developed by STC. It employs VLIW architecture and SIMD Instruction Set to ensure high parallel computing ability. The PAC DSP kernel contains the instruction pipeline and is the computation engine of PAC DSP. The application-specific Customized Function Unit (CFU) is used to enhance the computational power of PAC DSP kernel. One example of such a CFU is a motion-estimation engine for video encoding application. The CFU executes in parallel with the PAC DSP kernel and interface with the kernel using either the PAC DSP data memory and CFU interface. Fig. 1 shows the PAC DSP Core Architecture. Fig. 2 illustrates a block diagram of the architecture of the PAC DSP Kernel. It has three main components: a program sequence control unit, a scalar unit, and 2 clusters of VLIW data path.



Fig. 1 PAC DSP Core architecture



Fig. 2 The Architecture of PAC DSP Kernel

The program sequence control unit dispatches instructions to the scalar unit and VLIW data path. It also handles the interrupt and exception events. The scalar unit executes the scalar instructions and has 8 local registers; most of the program sequence control instruction is defined in this unit.

The VLIW data path is composed of two clusters taking care of executing data operations in the program. The number of clusters in the VLIW data path can be scaled up or down based on target application's performance requirement. Each cluster contains a load/store unit (L/S) and an arithmetic unit (AU). Both units can execute instructions concurrently. Thus, two instruction slots in the instruction packet are allocated for a cluster.

Each cluster has its own register files structure. There are private register files for L/S unit and Arithmetic unit. The

private register file for L/S unit is address register file and the private register file for Arithmetic is AC register file. The communication between two units is through pingpong register file. The specific operation defined in pingpong register file will reduce power consumption. And the data communication between clusters is achieved using explicit "data broadcast" and "receive" instructions.

The effective data communication among register files can be ensured because of well-established register file structure. The area and power consumption are greatly reduced through register file port reduction using register file partition scheme and Ping-Pong register file structure.

VLIW architecture saves more power than Super Scalar architecture for the static instruction schedule methodology. It suited the low power requirement in portable applications.

The dynamic and static power management methodologies are defined in PAC DSP. The static power management provides the control register for turning off the sub-block of PAC DSP. The dynamic power management methodology will turn off the unused processing elements in data path dynamically.

In addition, PAC DSP uses Variable Instruction/Packet Length for solving Low Code Density problems. The builtin Hierarchical Encoding/Decoding Technical feature can successfully eliminate complex Dispatch impacts.

Enough performance with minimized power consumption is the requirement for embedded systems. In order to fulfill the requirements of different applications in multi-function portable devices, the computing power of PAC DSP can be re-defined during in design time and well-designed power management methodology can reduce the power consumption.

## 2.2. PAC DSP Software Development Suite

PAC DSP Software Development Suite offers common user interfaces on Linux environment that allows for easy learning and developing across platforms. From PC to PAC-based platforms, this cross platform functionality empowers you to repurpose originally developed applications and gives you a head start for entering PACbased product development. Such a suite provides everexpanding support for the features of PAC's latest DSP processors, including dual-core technology, cluster-wise Ping-Pong architecture technology, and joint VLIW-SIMD ISA technology.

PAC DSP Software Development Suite includes C Compiler, Assembler/Linker, Debugger, Libraries, and other Supporting Utilities. Those help system developers

deliver applications with good code quality. For example, the PAC DSP C Compiler, which is ported from ORD compiler, ensures that PAC DSP application can be developed in a programmer-friendly environment, thus reducing time-to-market and development cost for the end products.

# 2.3. PAC SoC Platform

The PAC SoC Platform is designed sophisticatedly to provide an application processor SoC for the nextgeneration mobile devices such as PMP, smart phones, and PDA. PAC SoC platform features a dual-core architecture that combines the command and control capabilities of the MPU with the high-performance and low-power capabilities of DSP core. The dual-core architecture utilizes both RISC MPU and VLIW DSP technologies.

Fig. 3 shows the basic SoC Platform Architecture. For different applications, it can be either scaled up or down to meet the performance requirements. Basic PAC SoC Platform consists of Dual-Core Processor (MPU + DSP), Memory Subsystem, System DMA, I/O Peripherals, and onchip System Bus Network. They communicate through the on-chip System Bus Network.

PAC Platform uses ESL (Electronic System Level) design methodology. ESL is a platform that provides coverification of hardware and software design. In hardware RTL design, compare to traditional verification, ESL platform can provide real data such as H.264 stream data; in software design, ESL provides verification environment for compilers and debuggers, such as step-by-step debug tools, memory and register analysis.



Fig. 3 PAC SoC Platform Architecture

Besides, PAC Platform uses DVFS (Dynamic Voltage and Frequency Scaling) to solve the problem of power gap. Power gap is one of major challenges of IC design, and multiple Vdd (mVdd, ie. voltage scaling) is one of most important and effective low-power design methodology. PAC uses mVdd and power-aware management technology; thus it can save 5-70% of original power.

#### 2.4. PAC SoC Embedded Software

PAC platform provides embedded Linux software solution. Compare to the standard version kernel, lots of features are added to meet the requirement for consumer electronics products, including fast-boot, XIP, hard real-time and power management. And the Inter-Processor Communication (IPC) software framework support makes the communication between dual-cores architecture become easy. With embedded Linux technology, PAC will be a stable, flexible, and extensible platform for dual-cores architecture developers. Fig. 4 shows reference embedded software structure for PMP. The embedded software for PAC platform includes following components: HAL library and boot monitor, Embedded Linux, Middleware, Codec engine & applications, DSP microkernel.



# 2.5. PAC Benchmarks

shown in Figure 5.

| Vender<br>Property                | ITRI/STC                    | RI/STC StarCore    |                              | CEVA                        | LSI                                 | 3DSP                                 |
|-----------------------------------|-----------------------------|--------------------|------------------------------|-----------------------------|-------------------------------------|--------------------------------------|
|                                   | PAC DSP<br>v2.0             | SC2000<br>(SC2400) | SC1000<br>(SC1400)           | CEVA-X<br>1620              | ZSP500                              | SP5                                  |
| Architecture                      | 5 way VLIW                  | 6 way<br>VLIW      | 6 way<br>VLIW                | 8 way<br>VLIW               | 4 issue<br>Superscalar              | 2 way<br>Superscalar                 |
| Frequency<br>(MHz)                | 250                         | 250~350            | 305                          | 450                         | 400                                 | 320                                  |
| Process                           | 0.13µm                      | 0.13 µm ~<br>90nm  | 0.13µm                       | 0.13µm                      | 0.13µm∼?                            | 0.13µm                               |
| Performance<br>(MIPS)             | 1250                        | 1500 ~<br>         | 1830                         | 3600                        | 1600                                | 640                                  |
| Power<br>Consumption<br>(mW/MIPS) | 0.08<br>(Without<br>Memory) | -                  | 0.098<br>(Without<br>Memory) | 0.08<br>(Without<br>Memory) | 0.107<br>(Without<br><u>Memory)</u> | 0.125<br>(Without<br><u>Memory</u> ) |
| Area                              | 1.2mm <sup>2</sup>          | -                  |                              | 1.6mm <sup>2</sup>          | -                                   | 0.16mm <sup>2</sup>                  |
| Power<br>Management               | Yes                         | Yes                | Yes                          | Yes                         | Yes                                 | Yes                                  |

PAC DSP achieves the great power performance ratio as

The signal processing performance of PAC DSP is preevaluated using a suite of DSP benchmarks developed by Berkeley Design Technology Inc (BDTI). The figure 6 demonstrates execution cycle count results of each kernel for PAC DSP and its competitors. With the same MACs resource, 30% of the benchmarking results of PAC DSP are better than competitors'. The optimized ISA and special architecture of PAC DSP are the main reasons. In Fig. 7, PAC SoC Processor compares with famous Low-Power Application Processors offered by TI, Freescale, and Intel.

| DSP Platform            | PAC DSP<br>250M HZ           | CEVA-X<br>1620<br>450M HZ | CEVA-X<br>1640<br>340M HZ     | StarCore<br>SC1200<br>305M HZ | StarCore<br>SC1400<br>300M HZ | TI<br>C6414<br>1000MHZ |
|-------------------------|------------------------------|---------------------------|-------------------------------|-------------------------------|-------------------------------|------------------------|
| Architecture            | 4-way VLIW + Scalar<br>2MACs | 8-way VLIW<br>2 MACs      | 8-way VLIW<br>4 MA <i>C</i> s | 4-way VLIW<br>2 MACs          | 6-way VLIW<br>4 MACs          | 8-way VLIW             |
| Vector Add              | 21                           | 33                        | 18                            | 19                            | 19                            | 27                     |
| Vector Dot              | 23                           | 26                        | 19                            | 25                            | 16                            | 25                     |
| Vector M ax             | 43                           | 29                        | 22                            | 44                            | 27                            | 36                     |
| Control                 | 444                          | 639                       | 639                           | 425                           | 425                           | 475                    |
| Bit unpack              | 146                          | 106                       | 61                            | 164                           | 124                           | 97                     |
| Real-vauledBlock FIR    | 317                          | 351                       | 182                           | 354                           | 185                           | 194                    |
| Complex-vauledBlock FIR | 993                          | 1330                      | 690                           | 1333                          | 675                           | 674                    |
| SS FIR                  | 18                           | 21                        | 19                            | 16                            | 14                            | 26                     |
| IIR                     | 19                           | 9                         | 8                             | 10                            | 9                             | 16                     |
| LMS                     | 34                           | 29                        | 24                            | 26                            | 19                            | 37                     |
| Viterbi                 | 3505                         | 2304                      | 1925                          | 2880                          | 1935                          | 1740                   |
| FFT                     | 1684                         | 2207                      | 1248                          | 3230                          | 1631                          | 1246                   |

Fig. 6 Benchmarks of DSP Cores (2) Note: PAC DSP was submitted for seeking BDTI's official certification.

|                       | PAC             | TI OMAP<br>2410/20          | TI OMAP1610<br>(1611/1612) | Freescale<br>MXC275-30 | Intel PXA800F                  |
|-----------------------|-----------------|-----------------------------|----------------------------|------------------------|--------------------------------|
| Processor<br>Core I   | ARM9/<br>S+Core | ARM1136JF-S                 | ARM926EJ-S                 | ARM1136JF-S            | XScale                         |
| Freq (MHz)            | 244             | 330                         | 204                        | 532                    | 312                            |
| Processor<br>Core II  | PAC DSP 2.0     | TMS320C55x                  | TMS320C55x                 | StarCore<br>SC140e DSP | MSA DSP (Frio)                 |
| Freq (MHz)            | 300             | 220                         | 204                        | 208                    | 104                            |
| Accelerator<br>(s)    | Custom Core     | 2D/3D<br>Graphics,<br>Video | Video,<br>Security         | Security<br>(HW/SW)    | 16-bit SIMD,<br>Viterbi, Voice |
| Power<br>(mW@MHz)     | 450@300         | 650@330                     | 240@204                    | 650@532                | 350@312                        |
| IC Process            | 0.13µm          | 0.09um                      | 0.13µm                     | 0.09um                 | 0.13µm                         |
| Core                  | 1.2V            | N/A                         | 1.1~1.5V                   | Not Open               | 1.2V                           |
| Peripheral<br>Voltage | 2.5/3.3V        | N/A                         | 1.8V/3.0V                  | 1.8V~3.3V              | 1.8V~3.3V                      |
| Package               | 288 BGA         | 289 BGA                     | 289 BGA                    | Not Open               | 294 TPBGA                      |

Fig. 7 Benchmarks of Low-Power Application Processors

#### **3. PAC APPLICATION PROCESSORS**

STC cooperates with several fabless IC design companies in Taiwan for developing applications based on PAC design. Those primary target at low power and superior performance portable multimedia devices which need to process an enormous amount of digital audio and video stream, such as PDA, Smart Phone, PMP, DSC, and DVR; or VoIP handset/gateway which require real-time signal processing.

PMP and PDA/Smart Phone are two key potential implementations. With PAC as the system fundamental, the PMP will possess multiple multimedia functions, such as MP3/AAC audio encoding/decoding, MPEG-4 D1 resolution encoding/decoding, H.264D1 decoding/QCIF encoding, signal equivalent/amplify control. It also has different kinds of peripheral controls, including monitor, audio/video I/O, and external memory, to meet the hardware requirement of next-generation PMP. PDA/Smart Phone is regarded as biggest market segment for PAC applications. The PAC Media Processor embraces a multimedia application processor (connect to a Baseband

processor externally) and standard peripherals for highperformance PDA/smart phones. PAC Media Processor will be introduced and promoted to Taiwan-based companies in the beginning phase. The goal is to step into the market currently dominated by foreign Media Processor providers.

### 4. CONCLUSION

PAC SoC Platform consists of 32-bit PAC DSP core and MPU, memory subsystem, DMA, I/O peripherals, and onchip system bus network. In addition, low-power methodology, performance evaluation, and hardware/ software co-verification techniques are developed during the design process. The complete software tools and hardware development environment further reduce development risks and shorten time to market. Featuring high performance operations at optimized low power consumption, the dual core PAC platform provides an ideal application processor solution to implement more robust SoC designs for next-generation multimedia mobile devices.

## ACKNOWLEDGEMENT

We wish to thank those experts who offer valuable advice in the PAC project, especially Dr. HT Kung, William H. Gates Professor of Computer Science and Electrical Engineering of Harvard University, and Dr. Paul Lin, General Director of Information and Communications Research Laboratories of ITRI. On the other hand, we are very grateful to all PAC team members who made all this work. We would also like to express our appreciation of the assistance given by Alan Kang and Winnie Chu who are planners of the Planning & Promotion Division of STC/ITRI, in compiling information for this paper.

#### REFERENCES

- V. K. Madisetti, "VLSI Digital Signal Processors: An Introduction to Rapid Prototyping", IEEE Press, 1995
- [2] Keshab K. Parhi, "VLSI Digital Signal Processing Systems", John Wiley and Sons, Inc., 1999.
- [3] "DSP56800E 16-Bit DSP Core Reference Manual", Freescale, Inc.
- [4] "TMS320C55x DSP Function Overview", Texas Instruments, Inc.
- [5] "TMS320C6000 Technical Brief", Texas Instruments, Inc.
- [6] Yung-Chia Lin, Chung-Lin Tang, Chung-Ju Wu, Ming-Yu Hung, Yi-Ping You, Ya-Chiao Moo, Sheng-Yuan Chen and Jenq Kuen Lee, "Compiler Supports and Optimizations for PAC VLIW DSP Processors", *LCPC 2005*, USA, Oct. 2005 (Also to appear in LNCS).
- [7] Chien-Yuan Lai, Jin-Hon Lin, Yaw-Feng Wang, "DVFS SoC Architecture and Implementation", SoC Technology Journal, vol. 3, pp.84~91, Nov. 2005
- [8] CE Linux Forum (CELF) Kernel XIP Specification <u>http://tree.celinuxforum.org/CelfPubWiki/KernelXIPSpecificat</u> ion