Automatic Design and Optimization of Application Specific Processors

by Jelena Trajkovic

The stringent performance and power requirements of modern applications have fueled the need for application specific computing platforms. Conventional general purpose embedded processor do not provide the necessary throughput for meeting the applications’ performance and power needs. On the other hand, manual hardware design is too expensive and results in inflexible implementation. The short time to market demands also make hardware design prohibitive. Therefore, there exists a need for automatic design methods and tools that provide a high performance, programmable and energy-efficient implementation starting from standard application descriptions such as C reference code. High level synthesis and application specific processor tools are a step in this direction, but suffer from scalability, quality and controllability issues. In this dissertation, we present techniques for automatic processor design from C code that are scalable to thousands of lines of C code, are controllable at each step of the design process and provide quality of results comparable to manual design.

Our contributions enable a system level design methodology for application specific processors that starts with a reference C code of the application. An initial processor data path is constructed from a database of available components such as function units, register files and buses. The initial data path is based on the types of operations and available concurrency in the application. The data path is then iteratively refined until an efficient architecture is derived. The key optimization goal is to keep performance within given boundaries while maximizing resource utilization. We further optimize the processor design using a novel algorithm for automatic custom pipelining based on the C code. The pipelining optimization also targets both resource utilization and performance. Our experimental results with large applications such as DCT and MP3 decoder show that automatically generated architectures are comparable to manual designs, but can be obtained in a matter of few seconds, leading to significant productivity gains.

Our final contribution to design methods for embedded systems deals with low power memory design. Typically, DRAM (dynamic random-access memory) energy consumption in low-power embedded systems can be very high, exceeding that of the data cache or even that of the processor. We present and evaluate a scheme for reducing the energy consumption of SDRAM (synchronous DRAM) memory access by a combination of techniques that take advantage of SDRAM energy efficiencies in bank and row access. This is achieved by using small, cache-like structures in the memory controller to prefetch an additional cache block(s) on SDRAM reads and to combine block writes to the same SDRAM row. The results quantify the SDRAM energy consumption of MiBench applications and demonstrate significant savings in SDRAM energy consumption, 23%, on average, and reduction in the energy-delay product, 44%, on average. The approach also improves performance: the CPI is reduced by 26%, on average.