Professor Nicolau’s Research

Parallelizing Compilers, High-Performance Java, Power-aware Computing, Reconfigurable Computing.

Visit Web Site

Selected Projects

FAST Matrix Multiplication

We designed a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.

We have three contributions: first, we present a performance overview for double- and double-complex-precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd’s matrix multiplication (achieving up to 20% speedup with respect to the fastest prior known regular matrix multiplication to date.

Equally important is precision, which is an other aspect of this project. Reproducibility of an experiment is a commonly used metric to determine its validity. Within scientific computing, this can become difficult due to the accumulation of floating point rounding errors in the numerical computation, greatly reducing the accuracy of the computation. Matrix multiplication is particularly susceptible to these rounding errors which is why there exist so many solutions, ranging from simulating extra precision to compensated summation algorithms. These solutions however all suffer from the same problem, abysmal performance when compared against the performance of the original algorithm. Graphics cards are particularly susceptible due to a lack of double precision on all but the most recent generation graphics cards, therefore increasing the accuracy of the precision that is offered becomes paramount. By using our method of selectively applying compensated summation algorithms, we are able to return a whole digit of accuracy on current generation graphics cards and two digits of accuracy on the newly released “fermi” architecture. This is all possible with only a 2% drop in performance compared to the
current state-of-the-art.

Communicating JVMs for Heterogeneous or Distributed Memory Embedded Systems

The proliferation of multicore mobile devices and the growing complexity of mobile apps call for a more efficient, high-level inter-process communication (IPC). The IPC also needs to be conscious
of the safety of users, programs, and systems. This project is developing two such mechanisms on the Android platform and has demonstrated significantly faster and more memory efficient IPC than the mechanisms currently available via Android’s Java API – while maintaining the same level of security. Speedups on both communication alone (5x to 21x) and whole application execution (1.5x on realistic streaming plus a simple gaming engine) were achieved on Droid phones.
The applicaiton speedup was clearly observable to anyone watching the screen enhancing the user experience.

Expedition into Software for Efficient Computing in the Age of Nanoscale Devices

The research team seeks to develop computing systems (integrated hardware-software packages) that will be able to sense the nature and extent of variations in their hardware circuits, and expose these variations to software, i.e., compilers, operating systems, and applications, so as to drive adaptations in the software to overcome the variability present. A major goal of this project is to break down the rigid interfaces between software and hardware, so that software can proactively address variability in the underlying hardware. The ability of software to detect, monitor, adapt and predict variations in the underlying hardware will be critical to the success of this project, and our team at Irvine will investigate compile-time and run-time software strategies for overcoming hardware variations.

Cache-Aware Synchronization and Scheduling of Data-Parallel Programs for Multi-Core Processors

Multi-core (parallel) processors are becoming ubiquitous. The use of such systems is key to science, engineering, finance, and other major areas of the economy. However, increased applications performance on such systems can only be achieved with advances in mapping such applications to multi-core machines. This task is made more difficult by the presence of complex memory organizations which is perhaps the key bottleneck to efficient execution, and which was not previously addressed effectively. This research involves making the mapping of the program to the machine aware of the complexities of the memory-hierarchy in all phases of the compilation process. This will ensure a good fit between the application code and the actual machine and thereby guarantee much more effective utilization of the hardware (and thus efficient/fast execution) than was previously possible.

Modern processors (multi-cores) employ increasingly complex memory hierarchies. Management of such hierarchies is becoming critical to the overall success of the compilation process since effective utilization of the memory hierarchy dominates overall performance. This research develops a new cache-hierarchy-aware compilation and runtime system (i.e., including compilation, scheduling, and static/dynamic processor mapping of parallel programs). These tasks have one thing in common: they all need accurate estimates of data element (iteration, task) computation and memory access times which are currently beyond the (cache-oblivious) state-of-the-art. This research thus develops new techniques for iteration space partitioning, scheduling, and synchronization which capture the variability due to cache, memory, and conditional statement behavior and their interaction. This research will have a broad impact on the computer industry as it will allow the ubiquitous multi-core systems of the future to be efficiently exploited by parallel programs.

Please visit Professor Nicolau’s Web Site

Julius C: Compiler Optimizations for Divide and Conquer Applications

Julius C is a C compiler that determines a model for divide and conquer algorithms. The model is an extension of the call graph and it conceives information about the dynamic behavior of the algorithm. The use of the model is twofold: at compile time, the model drives code generation and code optimizations; at run time, the model offers a support driving code execution, code adaptation and hardware adaptation.

EXPRESS Retargetable Compiler

Web Site:

EXPRESS is an optimizing, memory-aware, Instruction Level Parallelizing (ILP) compiler. EXPRESS uses the EXPRESSION ADL to retarget itself to a wide class of processor architectures and memory systems. The inputs to EXPRESS are the application specified in C, and the processor architecture specified in EXPRESSION. The front-end is GCC-based and performs some of conventional optimizations. The core transformations in EXPRESS include RDLP — a loop pipelining technique, TiPS : Trailblazing Percolation Scheduling — a speculative code motion technique, Instruction Selection, Register Allocation and If-Conversion — a technique for architectures with predicated Instruction Sets. The back-end generates assembly code for the processor ISA.

Please visit the ACES Web Site 

FORGE: A Framework for Optimization of Distributed Embedded Systems Software

Web Site:

The FORGE project is a framework for optimization of distributed embedded systems software, which integrates the middleware abstraction layer with the hardware/OS abstraction layer and studies mechanisms for capturing resources/architectures at these two levels and allowing interactions between the levels. A resource description language (RDL) specifying the composition of the system as well as resource constraints can be used by a compiler for automatically generating the necessary services and middleware configuration and their deployment across the platform.

Please visit the FORGE Web Site 

SPARK: High-Level Synthesis using Parallelizing Compiler Techniques

Web Site:

SPARK is a C-to-VHDL high-level synthesis framework that employs a set of innovative compiler, parallelizing compiler, and synthesis transformations to improve the quality of high-level synthesis results. The compiler transformations have been re-instrumented for synthesis by incorporating ideas of mutual exclusivity of operations, resource sharing and hardware cost models.

Please visit the SPARK Web Site 

CoReComp: A Compiler Framework for Mapping Applications on Mesh-Based Coarse-Grain Reconfigurable Architectures

Web Site:

Coarse-grain reconfigurable architectures trade-off some of the configuration flexibility of fine-grain FPGAs in return for smaller delay, area and configuration time. They provide massive parallelism, high computational capability and their behavior can be configured dynamically, thus making them a better alternative to ASICs and fine-grain FPGAs in many aspects. Mapping applications to such architectures is a complex task that is a combination of the traditional operation scheduling, operation to PE binding (or mapping), and routing problems. The focus of this research is to build a compiler framework which should be capable of mapping applications on reconfigurable architectures by taking into account different architecture and application parameters.

Please visit the CoReComp Web Site

Past Projects

  • Annotation-Aware Java Virtual Machines
  • AMRM: Adaptive Memory Reconfiguration and Management
  • COPPER: Compiler-Controlled Continuous Power-Performance Management
  • Cache Power Optimization via Run-time Hardware Prediction
  • EVE Mutation Scheduling Compiler