Professor Veidenbaum’s Research

Research Interests

  • Computer Architecture: high-performance processors, memory hierarchy, low-power processors, multiprocessor systems
  • Compilers: optimization and restructuring techniques, compiler-assisted memory management, Java for embedded systems
  • Embedded Systems: system architecture, software, and low-power design

Visit Web Site


Communicating JVMs for Heterogeneous or Distributed Memory Embedded Systems

The proliferation of multicore mobile devices and the growing complexity of mobile apps call for a more efficient, high-level inter-process communication (IPC). The IPC also needs to be conscious
of the safety of users, programs, and systems. This project is developing two such mechanisms on the Android platform and has demonstrated significantly faster and more memory efficient IPC than the mechanisms currently available via Android’s Java API – while maintaining the same level of security. Speedups on both communication alone (5x to 21x) and whole application execution (1.5x on realistic streaming plus a simple gaming engine) were achieved on Droid phones.
The applicaiton speedup was clearly observable to anyone watching the screen enhancing the user experience.

Cache-Aware Synchronization and Scheduling of Data-Parallel Programs for Multi-Core Processors

“Multi-core (parallel) processors are becoming ubiquitous. The use of such systems is key to science, engineering, finance, and other major areas of the economy. However, increased applications performance on such systems can only be achieved with advances in mapping such applications to multi-core machines. This task is made more difficult by the presence of complex memory organizations which is perhaps the key bottleneck to efficient execution, and which was not previously addressed effectively. This research involves making the mapping of the program to the machine aware of the complexities of the memory-hierarchy in all phases of the compilation process. This will ensure a good fit between the application code and the actual machine and thereby guarantee much more effective utilization of the hardware (and thus efficient/fast execution) than was previously possible.

Modern processors (multi-cores) employ increasingly complex memory hierarchies. Management of such hierarchies is becoming critical to the overall success of the compilation process since effective utilization of the memory hierarchy dominates overall performance. This research develops a new cache-hierarchy-aware compilation and runtime system (i.e., including compilation, scheduling, and static/dynamic processor mapping of parallel programs). These tasks have one thing in common: they all need accurate estimates of data element (iteration, task) computation and memory access times which are currently beyond the (cache-oblivious) state-of-the-art. This research thus develops new techniques for iteration space partitioning, scheduling, and synchronization which capture the variability due to cache, memory, and conditional statement behavior and their interaction. This research will have a broad impact on the computer industry as it will allow the ubiquitous multi-core systems of the future to be efficiently exploited by parallel programs.

Please visit Professor Veidenbaum’s Web Site

Reducing Power Consumption in Embedded and High-Performance Processors

Power dissipation is a major issue in designing new processors. In particular, CMOS technology scaling has significantly increased the leakage power dissipation so that it accounts for an increasingly large share of processor power dissipation. One of the main issue is how to achieve power savings without loss of performance.

Much of our work in this area has focused on cache power dissipation. We addressed issues in L1 I- and D-cache dynamic as well as static power consumption. This included way caching to save static and dynamic power in high-associativity caches (as an alternative to way prediction), cached load-store queue as a low-cost alternative to L0 cache, using branch prediction information to save power in instruction caches. We addressed L2 power consumption, in particular leakage power in L2 peripheral circuits. The results of this research are applicable in both embedded and high-performance processors.

Another aspect of this research is low-power instruction queue design for out-of-order processors. CAM-based instruction queues are not scalable and consume significant amount of power due to wide issue and CAM search on each cycle. One approach we proposed used a banked queue, thus dividing a CAM into smaller banks with faster search. A pointer table indicates which bank an instruction belongs to. A more complex approach disposed of CAM-based queue altogether and used instruction dependence pointers and RAM-based queue for “direct” wakeup. It solved the problem of how to achieve fast branch misprediction recovery when using pointers while using dependent pointers.

Finally, we investigated the problem of power consumption in the register file. Content-aware register file utilized knowledge of instruction operand and effective address width to reduce the number of bits read from the RF and to speed up TLB access using an “L0 TLB”. This type of register file was also shown to enable a new type of clustered processor with improved performance and reduced power.

Please visit Professor Veidenbaum’s Web Site

Speeding up Mobile Code Execution on Resource-Constrained Embedded Processors

Embedded platforms are increasingly connected to the Web and are executing mobile code. These platforms are a resource-constrained environment in which interpreted execution of mobile codes is the norm and highly-optimized or dynamic compilation systems are not a suitable choice, primarily due to their high memory requirements. At the same time, the performance of the executed code is of critical importance and is often the limiting factor in both the capabilities of the system and user perception.The goals of this research project were to significantly improve interpreter performance for mobile code on embedded platforms without increasing its resource requirements and to design a resource constrained basic block dynamic compilation system to be used with an interpreter for adaptive optimization at “low cost”.The framework proposed to achieve for both of these goals is based on “superoperators” and code “annotations”. The former are groups of instructions that can be executed as a unit and optimized together. The latter is a mechanism for passing information from a compiler producing mobile code to the interpreter running on a client system. The proposed approach shifts as much of the work of identifying, compiling, and optimizing superoperators as possible to the compiler and thus both simplifies and speeds up interpreted execution. This is possible because the annotations can carry the additional information (otherwise unavailable in the mobile code) between the compiler and the interpreter. Annotations can also reduce delays and allow small applets to be optimized with little or no overhead when used with adaptive dynamic optimization. Currently such optimization requires dynamic profiling and incurs the associated overhead.

Please visit Professor Veidenbaum’s Web Site