Acceleration beyond Memory Barriers in IS-extensible Processors

by Partha Biswas

Extensible processors refer to a class of processors that can be customized for a given application with new designer-defined functional units. One popular feature to extend is the Instruction Set (IS) of the processor—we call such a processor an IS-extensible processor. In an IS-extensible processor, applications can be accelerated by extending the processor architecture with Application-specific Functional Units (AFUs), which execute Instruction Set Extensions (ISEs) in the critical segments of the applications. Automatic identification of such ISEs is the key step in automating the process of processor customization.

A high-quality ISE identification approach needs to obtain results close to those achieved by experienced designers, particularly for complex applications that exhibit regularity: expert designers are able to exploit manually such regularity in the data flow graphs to generate high-quality ISEs. This thesis develops an ISE identification algorithm (ISEGEN), which identifies high-quality ISEs by iterative improvement following the basic principles of the well-known Kernighan-Lin min-cut heuristic. Traditionally, synchronization issues have discouraged the inclusion of memory operations in ISEs during the ISE identification step, causing a “memory barrier”. We show in this thesis that it is possible to go beyond these memory barriers and accelerate the application even more than the traditional ISE identification approaches by eliminating costly memory operations. We present the first ISE identification technique that can automatically identify state-holding Application-specific Functional Units (AFUs) comprehensively. Finally, we develop an interface-aware processor customization methodology and use the methodology to realize an IS-extensible-processor-based system on a real platform.

Experimental results on a number of MediaBench, EEMBC and cryptographic applications demonstrate that our ISEGEN approach matches the quality of the optimal solution obtained by exhaustive search. We show that our ISEGEN technique is on average 20X faster than a genetic formulation that generates equivalent solutions. We also demonstrate the effectiveness of our approach by incorporating the generated ISEs in the cycle-accurate SimpleScalar simulator. The average speedup on the selected benchmarks increases from 1.4X to 2.8X when the ISE identification process goes beyond memory barriers. Furthermore, we show a concomitant reduction in energy consumption as a result of reduced cache pollution. In addition to analytical and simulation-based evaluation of the identified ISEs, we implement our approach in a realistic hardware-software system that allows us to comprehensively study the effect of ISEs on performance, energy, power, and code size using our interface-aware processor customization methodology. Our experiments on a Xilinx Virtex II board—employing the Microblaze soft-core as an extensible processor—show a simultaneous increase in performance, and reduction in code size, energy and power consumption with the incorporation of the AFU in the Microblaze soft-core processor.