profiles = new Array(
    "HCW Keynote: Aspects of Heterogeneous Computing in the Open MPI Environment|HCW Keynote: Aspects of Heterogeneous Computing in the Open MPI Environment Richard L. Graham There are several aspects to heterogeneous computing, with this talk focusing on several of these, as they relate to execution of a single parallel job. This includes processor and network heterogeneity, as well as the support for making this heterogeneity transparent in the run-time system. Design and implementation choices made by the Open MPI/Open RTE collaboration will be discussed, with an emphasis placed on the effort to make these choices transparent to the end users - both for MPI and non-MPI parallel jobs. The main standard libraries used for scientific simulation over the last decade and a half are the Message Passing Interface (MPI) and the Parallel Virtual Machine (PVM) libraries. Early implementations of PVM supported processor and network heterogeneity, allowing a single application to run in a hybrid environment. However, this came with a significant performance penalty. Implementors of the MPI standard, the de-facto communications standard for scientific computing, have tended to focus most of their efforts on highly optimized, single system implementations, largely ignoring the challenges posed with the goal of implementing an efficient MPI for a heterogeneous environment. In addition, implementations that enable running an application in a heterogeneous environment have tended to expose these details at the application level, requiring the applications to deal explicitly with these environment, limiting the extent to which heterogeneous computing has caught on. Building on experience gained with the LA-MPI, LAM/MPI, FT-MPI, and PACX-MPI, the Open MPI project is to enabling applications to run effectively on available hardware. This effort is aimed at removing the practical requirement restricting single application runs to a single type of hardware, and increasing the ability to utilize available hardware. Special attention is given to hiding these details from the end-user, so that, from an applications perspective, running on a homogeneous system is essentially the same as running on a homogenous system. Support is included for high performance communications in a hybrid environment, as well as run-time support for heterogeneous environments. This talk will describe the component architecture used by the Open MPI project which forms the foundation for providing instance specific implementations of a particular functionality, such as point-to-point communications between two specific end-points. The design choice to provide fine level control over the algorithms deployed within a single job provides a key architectural feature enabling optimal use of system resources. This talk focuses on point-to-point communications and the run-time (Open RTE) environment used to create, monitor, and terminate parallel job execution - both MPI and non-MPI jobs. The point-to-point communications discussion will focus on the architectural features enabling pair-wise communications tailored to the requirements posed only by the specific pair, as well as those that enable the simultaneous use of different network types between a given pair of communication end points. Performance data will also be presented. Aspects of Open RTE that enable distributed process monitoring and control in a heterogeneous environment will be discussed. \\\\~\\\\ % % \\noindent\\textbf Speaker biography : Richard Graham is the Computer Systems and Software Environment (ASC) Program manager, and the Advanced Computing Laboratory acting group leader at the Los Alamos National Laboratory. He joined LANL's Advanced Computing Laboratory (ACL) as a technical staff member in 1999. As team leader for the Resilient Technologies Team he started the LA-MPI project, and is one of the founders of the Open MPI collaboration. Prior to joining the ACL, he spent seven years working at Cray Research and SGI. Rich obtained his PhD in Theoretical Chemistry from Texas A\\&M University in 1990 and did post-doctoral work at the James Franck Institute of the University of Chicago. His BS in chemistry was from Seattle Pacific University.||||||||||||./pdfs/001-HCW-paper-1.pdf",
    "HIPS Keynote: Towards a Sophisticated Grid Workflow Development and Computi|HIPS Keynote: Towards a Sophisticated Grid Workflow Development and Computing Environment Thomas Fahringer While Grid infrastructures can provide massive compute and data storage power, it is still an art to effectively harness the power of Grid computing. Current application development for Grid commonly requires the programmer to deal with many low level and complex details such as selecting software components on specific Grid computers, mapping applications onto the Grid, explicitly specify data transfer operations, etc. In this talk we will present the ASKALON environment whose goal is to create an invisible Grid for both Grid users and application developers. ASKALON is centered around a set of high-level services for transparent and effective Grid access, including a Scheduler for optimized mapping of workflows onto the Grid, an Enactment Engine for reliable application execution, a Resource Manager covering both computers and application components, and a Performance Prediction and Analysis service based on a traing phase, analytical models and dynamic measurements. A sophisticated XML-based programming interface that shields the user form the Grid middleware details, allows the high-level composition of workflow applications. ASKALON is used to develop and port scientific applications as workflows in the Austrian Grid. Experimental results using several real-world scientific applications to demonstrate the effectiveness of ASKALON will be demonstrated.||||||||||||./pdfs/001-HIPS-paper-1.pdf",
    "A Nature-inspired Algorithm for the Disjoint Paths Problem |A Nature-inspired Algorithm for the Disjoint Paths Problem Maria J. Blesa Christian Blum One of the basic operations in communication networks consists in establishing routes for \\emph connection requests between physically separated network nodes. In many situations, either due to technical constraints or to quality-of-service and survivability requirements, it is required that no two routes interfere with each other. These requirements apply in particular to routing and admission control in large-scale, high-speed and optical networks. The same requirements also arise in a multitude of other applications such as real-time communications, \\textsc vlsi design, scheduling, bin packing, and load balancing. This problem can be modeled as a combinatorial optimization problem as follows. Given a graph $G$ representing a network topology, and a collection $T=\\ (s_1,t_1)\\ldots(s_k,t_k)\\ $ of pairs of vertices in $G$ representing connection request, the maximum \\emph edge-disjoint paths problem is an \\textsf NP -hard problem that consists in determining the maximum number of pairs in $T$ that can be routed in $G$ by mutually edge-disjoint $s_i-t_i$ paths. We propose an \\emph ant colony optimization (\\textsc aco ) algorithm to solve this problem. \\textsc aco algorithms are approximate algorithms that are inspired by the foraging behavior of real ants. The decentralized nature of these algorithms makes them suitable for the application to problems arising in large-scale environments. First, we propose a basic version of our algorithm in order to outline its main features. In a subsequent step we propose several extensions of the basic algorithm and we conduct an extensive parameter tuning in order to show the usefulness of those extensions. In comparison to a multi-start greedy approach, our algorithm generates in general solutions of higher quality in a shorter amount of time. In particular the run-time behaviour of our algorithm is one of its important advantages.||||||||||||./pdfs/001-NIDISC-paper-1.pdf",
    "On-chip and On-line Self-Reconfigurable Adaptable Platform: the Non-Unifor|On-chip and On-line Self-Reconfigurable Adaptable Platform: the Non-Uniform Cellular Automata Case Andres Upegui Eduardo Sanchez In spite of the high parallelism exhibited by cellular automata architectures, most implementations are usually run in software. For increasing execution parallelism, hardware implementations on FPGAs have been proposed, under the cost of being un-flexible, and inefficient in terms of resource utilization. In this paper we present a platform for evolving CA by exploiting the partial re-configurability of current commercial FPGAs. Our implementation includes an on-chip soft-processor that generates a partial bitstream, reconfigures the FPGA, and computes the fitness. After finding a good individual, the evolved CA can be used as a peripheral for performing useful computation. As case study we present CA co-evolution for a random number generator and for the firefly synchronization problem.||||||||||||./pdfs/001-RAW-paper-1.pdf",
    "SMTPS Keynote: Research and Technology Advances in Systems Software for Lar|SMTPS Keynote: Research and Technology Advances in Systems Software for Large Scale Computing Systems Frederica Darema The talk will address research and technology advances for optimized and dependable execution in large scale computing environments. Applications in nearly all sectors, scientific, engineering, and commercial, are becoming more encompassing in including the behaviors of the systems of the systems they represent, and becoming at the same time more powerful but also more complex. At the same time, driven by application requirements and enabled by hardware technology advances, computational platforms are becoming as well increasingly more powerful but also more complex. Efficient and effective development of applications, optimized use of the computational resources, and guaranteeing quality of service and dependability at all layers of the computational system, requires systems software advances, such as in programming environments, application composition systems, optimized application mapping and dynamic runtime technologies, debugging and check-pointing methods, and performance-engineered hardware and software capabilities at all layers. An overarching consideration, and thesis of this talk, is that these advances need to be made in a synergistic and integrated manner, taking a systems-view in developing these enabling technologies, rather than advancing each of the individual technologies in an isolated manner.||||||||||||./pdfs/001-SMTPS-paper-1.pdf",
    "HCW Panel: Programming heterogeneous systems - Less pain! Better performanc|HCW Panel: Programming heterogeneous systems - Less pain! Better performance! Jos&eacute; Fortes \\noindent\\textbf Chair: Jos\\' e Fortes\\\\~\\\\ % \\noindent\\textbf Panelists: \\\\ Richard Graham (Los Alamos National Laboratory)\\\\ Alexey Lastovetsky, (University College Dublin)\\\\ Samuel Midkiff (Purdue University)\\\\ ~\\\\~\\\\ \\noindent\\textbf Abstract: Heterogeneity in computing systems has been driven by, among others, one or more of the following factors: need for better performance (e.g. in multi-core chips), applications' requirements (e.g. in digital processing systems), timing and logistics of computer facility development (e.g. clusters that are extended and upgraded over time) and emergent systems of systems (e.g. for Grid-computing). The premise that these heterogeneous computing systems (HCS's) offer cost and performance benefits is true only if they can be efficiently programmed. The perspectives and questions on programming HCS's considered by this panel include the following: \\begin enumerate \\item Should programmers be exposed to heterogeneity so that they can squeeze all the necessary performance by taking advantage of the best resources for the jobs that need them? How can we handle the glut of programmers wishing to program such complex and extensive systems? \\item Should we design compilers that schedule and optimize programs written for homogeneous systems for execution on heterogeneous systems? Will the resulting programs run with better performance than they could achieve in a homogeneous system? How should ?better? be defined in this context? \\item Are programming languages irrelevant in the sense that one can always connect components and/or services to accomplish any task? How generally applicable is this programming ?model?? \\item Are HCS's inherently distributed memory entities where shared-memory programming models cannot succeed? Can shared- and distributed-memory models coexist? How do currently available languages fare in this regard? \\end enumerate||||||||||||./pdfs/002-HCW-paper-1.pdf",
    "iWarp Protocol Kernel Space Software Implementation |iWarp Protocol Kernel Space Software Implementation Dennis Dalessandro Ananth Devulapalli Pete Wyckoff Zero-copy, RDMA, and protocol offload are three very important characteristics of high performance interconnects. Previous networks that made use of these techniques were built upon proprietary, and often expensive, hardware. With the introduction of iWarp, it is now possible to achieve all three over existing low-cost TCP/IP networks. iWarp is a step in the right direction, but currently requires an expensive RNIC to enable zero-copy, RDMA, and protocol offload. While the hardware is expensive at present, given that iWarp is based on a commodity interconnect, prices will surely fall. In the meantime only the most critical of servers will likely make use of iWarp, but in order to take advantage of the RNIC both sides must be so equipped. It is for this reason that we have implemented the iWarp protocol in software. This allows a server equipped with an RNIC to exploit its advantages even if the client does not have an RNIC. While throughput and latency do not improve by doing this, the server with the RNIC does experience a dramatic reduction in system load. This means that the server is much more scalable, and can handle many more clients than would otherwise be possible with the usual sockets/TCP/IP protocol stack.||||||||||||./pdfs/003-CAC-paper-1.pdf",
    "Increasing Analog Programmability in SoCs |Increasing Analog Programmability in SoCs Erik Sch&uuml;ler Luigi Carro The use of programmability in Systems-on-Chip (SoC) brings as the main advantage the possibility of reducing the time-to-market and the cost of design, specially when different systems and functions must cover different markets, going from low-power and low-frequency instrumentation to high frequency communication. This paper presents a technique that can be used to increase the analog programmability in a SoC, also allowing one to integrate more analog functions, while guaranteeing the use of the analog part in a larger range of applications. Practical results are presented showing that the proposed technique can be used from DC to RF applications.||||||||||||./pdfs/003-RAW-paper-1.pdf",
    "Fast Barrier Synchronization for InfiniBand |Fast Barrier Synchronization for InfiniBand Torsten Hoefler Torsten Mehlan Frank Mietke Wolfgang Rehm The MPI\\_Barrier() call can be crucial for several applications and has been target of different optimizations since several decades. The best solution to the barrier problem scales with $O(log_2N)$ and uses the dissemination principle. A new method using an enhanced dissemination principle and inherent network parallelism will be demonstrated in this paper. The new approach was able to speedup the barrier performance by 40\\% in relation to the best published algorithm. It is shown that it is possible to leverage the inherent hardware parallelism inside the InfiniBand\\texttrademark network to lower the latency of the MPI\\_Barrier() operation without additional costs. The principle of sending multiple messages in (pseudo-) parallel can be implemented into a well known algorithm to decrease the number of rounds and speed the overall operation up.||||||||||||./pdfs/004-CAC-paper-1.pdf",
    "Ant Stigmergy on the Grid: Optimizing the Cooling Process in Continuous Ste|Ant Stigmergy on the Grid: Optimizing the Cooling Process in Continuous Steel Casting Peter Korosec Jurij Silc Bogdan Filipic Erkki Laitinen Most of the world steel production is nowadays based on continuous casting. This is a complex metallurgical process in which liquid steel is cooled and shaped into semi-manufactures. To achieve proper quality of cast steel, it is essential to control the metal flow and heat transfer during the casting process. They depend on numerous parameters, such as the casting temperature, casting speed and coolant flows. The paper presents a new distributed metaheuristic algorithm in an optimal control problem related to the cooling process in the continuous casting of steel. The optimization task is to tune 18 coolant flows in the caster secondary cooling system to achieve the target surface temperatures along the slab. Sequential search algorithms are proved inefficient for this problem because they take too much time to compute an appropriate solution. For this reason a new distributed search algorithm based on stigmergy perceived in ant colony was developed. The algorithm was run on the Grid that allows us to solve this optimization problem in much shorter time. As a matter of fact, the computation time can be decreased from half a day to a few hours without any decrease in the solution quality.||||||||||||./pdfs/004-NIDISC-paper-1.pdf",
    "Towards MPI progression layer elimination with TCP and SCTP|Towards MPI progression layer elimination with TCP and SCTP Brad Penoff Alan Wagner MPI middleware glues together the components necessary for execution. Almost all implementations have a communication component also called a message progression layer that progresses outstanding messages and maintains their state. The goal of this work is to thin or eliminate this communication component by pushing the functionality down onto the standard IP stack in order to take advantage of potential advances in commodity networking. We introduce a TCP-based design that successfully eliminates the communication component. We discuss how this eliminated TCP-based design doesn't scale and show a more scalable design based on the Stream Control Transmission Protocol (SCTP) that has a thinned communication component. We compare the designs showing why SCTP one-to-many sockets in their current form can only thin the communication component. We show what additional features would be required of SCTP to enable a practical design with a fully eliminated communication component.||||||||||||./pdfs/005-HIPS-paper-1.pdf",
    "Distributed Workflow Coordination: Molecules and Reactions |Distributed Workflow Coordination: Molecules and Reactions Zsolt Nemeth Christian Perez Thierry Priol Workflow execution on large-scale heterogeneous distributed computing systems, such as Grids, requires a complex coordination. Activities of complex workflow patterns must be matched with entities of the computing system that possesses highly dynamic properties. We pinpoint the key concept of such workflow coordination as actions according to actual and local conditions -- analogously to chemical reactions. Modeling workflow enactment as molecules and reactions, formalized in the nature inspired $\\gamma$-calculus, yielded an autonomously evolving, distributed, decentralized coordination model that can adapt to a dynamically changing environment.||||||||||||./pdfs/005-NIDISC-paper-1.pdf",
    "A Metaheurisitc Based on Fusion and Fission for Partitioning Problems |A Metaheurisitc Based on Fusion and Fission for Partitioning Problems Charles-edmond Bichot Metaheuristics are very useful methods because they can find (approximate) solutions of a great variety of problems. One of them, which interests us, is graph partitioning. We present a new metaheuristic based on nuclear fusion and fission of atoms. This metaheuristic, called fusion fission, is compared to other classical algorithms. First, we present spectral and multilevel algorithms which are used to solve partitioning problems. Secondly, we present two metaheuristics applied to partitioning problems : simulated annealing and ant colony algorithms. We will show that fusion fission gives good results, compared to the other algorithms. We demonstrate on a problem of Air Traffic Control that metaheuristics methods can give better results than specific methods.||||||||||||./pdfs/006-NIDISC-paper-1.pdf",
    "Seekable Sockets: A Mechanism to Reduce Copy Overheads in TCP-based Messagi|Seekable Sockets: A Mechanism to Reduce Copy Overheads in TCP-based Messaging Chase Douglas Vijay S. Pai This paper extends the traditional socket interface to TCP/IP communication with the ability to seek rather than simply receive data in order. Seeking on a TCP socket allows a user program to receive data without first receiving all previous data on the connection. Through repeated use of seeking, a messaging application or library can treat a TCP socket as a list of messages with the potential to receive and remove data from any arbitrary point rather than simply the head of the socket buffer. Seeking facilitates copy-avoidance between a messaging library and user code by eliminating the need to first copy unwanted data into a library buffer before receiving desired data that appears later in the socket buffer. The seekable sockets interface is implemented in the Linux 2.6.13 kernel. Experimental results are gathered using a simple microbenchmark that receives data out-of-order from a given socket, yielding up to a 40\\% reduction in processing time. The code for seekable sockets is now available for patching into existing Linux kernels and for further development into messaging libraries.||||||||||||./pdfs/007-CAC-paper-1.pdf",
    "A Nonself Space Approach to Network Anomaly Detection |A Nonself Space Approach to Network Anomaly Detection Marek Ostaszewski Franciszek Seredynski Pascal Bouvry The paper presents an approach for the anomaly detection problem based on principles of immune systems. Flexibility and efficiency of the anomaly detection system are achieved by building a model of network behavior based on self-nonself space paradigm. Covering both self and nonself spaces by hyperrectangular structures is proposed. Structures corresponding to self-space are built using a training set from this space. Hyperrectangular detectors covering nonself space are created using niching genetic algorithm. Coevolutionary algorithm is proposed to enhance this process. Results of conducted experiments show a high quality of intrusion detection which outperforms the quality of recently proposed approach based on hypersphere representation of self-space.||||||||||||./pdfs/007-NIDISC-paper-1.pdf",
    "A Preliminary Analysis of the InfiniPath and XD1 Network Interfaces |A Preliminary Analysis of the InfiniPath and XD1 Network Interfaces Ron Brightwell Doug Doerfler Keith D. Underwood Two recently delivered systems have begun a new trend in cluster interconnects. Both the InfiniPath network from PathScale, Inc., and the RapidArray fabric in the XD1 system from Cray, Inc., leverage commodity network fabrics while customizing the network interface in an attempt to add value specifically for the high performance computing (HPC) cluster market. Both network interfaces are compatible with standard InfiniBand (IB) switches, but neither use the traditional programming interfaces to support MPI. Another fundamental difference between these networks and other modern network adapters is that much of the processing needed for the network protocol stack is performed on the host processor(s) rather than by the network interface itself. This approach stands in stark contrast to the current direction of most high-performance networking activities, which is to offload as much protocol processing as possible to the network interface. In this paper, we provide an initial performance comparison of the two partially custom networks (PathScale's InfiniPath and Cray's XD1) with a more commodity network (standard IB) and a more custom network (Quadrics Elan4). Our evaluation includes several micro-benchmark results as well as some initial application performance data.||||||||||||./pdfs/008-CAC-paper-1.pdf",
    "Iterators in Chapel |Iterators in Chapel Mackale Joyner Bradford L. Chamberlain Steven J. Deitz A long-held tenet of software engineering is that algorithms and data structures should be specified orthogonally in order to minimize the impact that changes to one will have on the other. Unfortunately, this principle is often not well-supported in scientific and parallel codes due to the lack of abstractions for factoring iteration away from computation in traditional scientific languages. The result is a fragile situation in which complex loop nests are used to express parallelism and maximize performance, yet must be maintained individually as the algorithm and data structures evolve. In this paper, we introduce the iterator concept in the Chapel parallel programming language, designed to address this problem and provide a means for factoring iteration away from computation. The paper illustrates iterators using several examples, compares our approach with those taken in other languages, and describes our implementation in the Chapel compiler.||||||||||||./pdfs/008-HIPS-paper-1.pdf",
    "Parallel Implementation of Evolutionary Strategies on Heterogeneous Cluster|Parallel Implementation of Evolutionary Strategies on Heterogeneous Clusters with Load Balancing Juan Francisco Garamendi Jose Luis Bosque This paper presents a load balancing algorithm for a parallel implementation of an evolutionary strategy on heterogeneous clusters. Evolutionary strategies can efficiency solve a diverse set of optimization problems. Due to cluster heterogeneity and in order to improve the speedup of the parallel implementation a load balancing algorithm has been implemented. This load balancing algorithm takes into account cluster heterogeneity and it is based on an optimal intial distribution. This initial distribution is determined based on the cluster nodes' computational powers, that are dinamically measured in each slave node by an ad hoc load-bechmark. The implementation presents very satisfactory parallelization results, both in performance and scalability and Super-linear speedup is reached for several tests configurations. Experimental results show excellent perfomence, increasing the improvements with the load balancing algorithm.||||||||||||./pdfs/008-NIDISC-paper-1.pdf",
    "Communication Patterns |Communication Patterns Rolf Riesen Parallel applications have message-passing patterns that are important to understand. Network topology, routing decisions, and connection and buffer management need to match the communication patterns of an application for it to run efficiently and scale well. These patterns are not easily discerned from the source code of an application, and even when the data is available it is not easy to categorize it appropriately such that meaningful knowledge emerges. We describe a novel system to gather the information we need to discover an application's communication pattern. We create five categories that help us analyze that data and explain how information from each category can be useful in the design of networking hardware and software. We use the NAS parallel benchmarks as examples on how to apply our techniques.||||||||||||./pdfs/009-CAC-paper-1.pdf",
    "Placement and Routing of Boolean Functions in constrained FPGAs using a Dis|Placement and Routing of Boolean Functions in constrained FPGAs using a Distributed Genetic Algorithm and Local Search. Manuel Rubio Del Solar Juan Manuel S&aacute;nchez P&eacute;rez Juan Antonio G&oacute;mez Pulido Miguel &Aacute;ngel Vega Rodr&iacute;guez In this work we present a system for implementing the placement and routing stages in the FPGA cycle of design, into the physical design stage. We start with the ISCAS benchmarks, on EDIF format, of Boolean functions to be implemented. They are processed by a parser in order to obtain an internal representation which is able to be processed by a Genetic Algorithm (GA) tool. This tool develops the Placement and Routing tasks, considering possible restricted area into the FPGA. In order to help to the GA to make the Routing stage we have added a local search procedure. That local search gets a path between two points without considering neither their placement nor the restricted areas among them. The GA is fully customizable, featuring the ability to work with one or several islands. The experiments have verified that using distributing execution improves the costs and speeds up the convergence towards better results in smaller slots of time.||||||||||||./pdfs/009-NIDISC-paper-1.pdf",
    "Efficient RDMA-based Multi-port Collectives on Multi-rail QsNetII Clusters |Efficient RDMA-based Multi-port Collectives on Multi-rail QsNetII Clusters Ying Qian Ahmad Afsahi Many scientific applications use MPI collective communications intensively. Therefore, efficient and scalable implementation of collective operations is critical to the performance of such applications running on clusters. Quadrics QsNetII is a high-performance interconnect for clusters that implements some collectives at the Elan level. These collectives are directly used by their corresponding MPI collectives. Quadrics software supports point-to-point striping over multi-rail QsNetII networks. However, multi-rail collectives have not been supported. In this work, we propose a number of RDMA-based multi-port collectives over multi-rail QsNetII clusters directly at the Elan level. Our performance results indicate that the proposed multi-port gather gains an improvement of up to 6.35 for 1MB message over the native elan\\_gather. The proposed multi-port all-to-all performs better than the native elan\\_alltoall by a factor of 2.19 for 16KB message. Moreover, we have also proposed two algorithms for the scatter operation.||||||||||||./pdfs/010-CAC-paper-1.pdf",
    "Evaluating Parallel Simulated Evolution Strategies for VLSI Cell Placement |Evaluating Parallel Simulated Evolution Strategies for VLSI Cell Placement Sadiq M. Sait Mustafa Imran Ali Ali Mustafa Zaidi Simulated Evolution (SimE) is an evolutionary metaheuristic that has produced results comparable to well established stochastic heuristics such as SA, TS and GA, with shorter runtimes. However, for problems with a very large set of elements to optimize, such as in VLSI placement and routing, runtimes can still be very large and parallelization is an attractive option. Compared to other metaheuristics, parallelization of SimE has not been extensively explored. This paper presents a comprehensive set of parallelization approaches for SimE when applied to multiobjective VLSI cell placement problem. Each of these approaches are evaluated with respect to SimE characteristics and the constraints imposed by the problem instance. Conclusions drawn can be extended to parallelization of other SimE based optimization problems.||||||||||||./pdfs/010-NIDISC-paper-1.pdf",
    "A Configurable Framework for Stream Programming Exploration in Baseband App|A Configurable Framework for Stream Programming Exploration in Baseband Applications Jerker Bengtsson Bertil Svensson This paper presents a configurable framework to be used for rapid prototyping of stream based languages. The framework is based on a set of design patterns defining the elementary structure of a domain specific language for high-performance signal processing. A stream language prototype for baseband processing has been implemented using the framework. We introduce language constructs to efficiently handle dynamic reconfiguration of distributed processing parameters. It is also demonstrated how new language specific primitive data types and operators can be used to efficiently and machine independently express computations on bit-fields and data-parallel vectors. These types and operators yield code that is readable, compact and amenable to a stricter type checking than is common practice. They make it possible for a programmer to explicitly express parallelism to be exploited by a compiler. In short, they provide a programming style that is less error prone and has the potential to lead to more efficient implementations.||||||||||||./pdfs/011-HIPS-paper-1.pdf",
    "Efficient SMP-Aware MPI-Level Broadcast over InfiniBand |Efficient SMP-Aware MPI-Level Broadcast over InfiniBand Amith Rajith Mamidala Lei Chai Hyun-wook Jin Dhabaleswar K Panda Most of the high-end computing clusters found today feature multi-way SMP nodes interconnected b an ultra-low latency and high bandwidth network. InfiniBand is emerging as a high-speed network for such systems. InfiniBand provides a scalable and efficient hardware multicast primitive to efficiently implement many MPI collective operations. However, employing hardware multicast as the communication method may not perform well in all cases. This is true especially when more than one process is running per node. In this context, shared memory channel becomes the desired communication medium within the node as it delivers latencies which are of an order of magnitude lower than the inter-node message latencies. Thus, to deliver optimal collective performance, coupling hardware multicast with shared memory channel becomes necessary. In this paper we propose mechanisms to address this issue. On a 16-node 2-way SMP cluster, the Leader-based scheme proposed in this paper improves the performance of the MPI\\_Bcast operation by a factor of as much as 2.3 and 1.8 when compared to the point-to-point and original solution employing only hardware multicast. We have also evaluated our designs on NUMA based system and obtained a performance improvement of 1.7 using our designs on 2-node 4-way system. We also propose a Dynamic Attach Policy as an enhancement to this scheme to mitigate the impact of process skew on the performance of the collective operation.||||||||||||./pdfs/012-CAC-paper-1.pdf",
    "Automatic Code Generation for Distributed Memory Architectures in the Polyt|Automatic Code Generation for Distributed Memory Architectures in the Polytope Model Michael Cla&szlig;en Martin Griebl The polytope model has been used successfully as a tool for program analysis and transformation in the field of automatic loop parallelization. However, for the final step of automatic code generation, the generated code is either only usable on shared memory architectures or severely restricts the parallelization methods that can be applied. In this paper, we present a fully automated method for generating efficient target code, which is executable on clusters that are based on a distributed memory architecture. We also provide speedup results of experiments on a local cluster.||||||||||||./pdfs/012-HIPS-paper-1.pdf",
    "Advances in Applying Genetic Programming to Machine Learning, Focussing on |Advances in Applying Genetic Programming to Machine Learning, Focussing on Classification Problems Stephan Winkler Michael Affenzeller Stefan Wagner A Genetic Programming based approach for solving classification problems is presented in this paper. Classification is understood as the act of placing an object into a set of categories, based on the object's properties; classification algorithms are designed to learn a function which maps a vector of object features into one of several classes. This is done by analyzing a set of input-output examples (``training samples'') of the function. Here we present a method based on the theory of Genetic Algorithms and Genetic Programming that interprets classification problems as optimization problems: Each presented instance of the classification problem is interpreted as an instance of an optimization problem, and a solution is found by a heuristic optimization algorithm. The major new aspects presented in this paper are suitable genetic operators for this problem class (mainly the creation of new hypotheses by merging already existing ones and their detailed evaluation) we have designed and implemented. The experimental part of the paper documents the results produced using new hybrid variants of genetic algorithms as well as investigated parameter settings.||||||||||||./pdfs/012-NIDISC-paper-1.pdf",
    "Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets|Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol (SDP) over InfiniBand P. Balaji S. Bhagvat H. W. Jin D. K. Panda Sockets Direct Protocol (SDP) is an industry standard pseudo sockets-like implementation to allow existing sockets applications to directly and transparently take advantage of the advanced features of current generation networks such as InfiniBand. The SDP standard supports two kinds of sockets semantics, viz., Synchronous sockets (e.g., used by Linux, BSD, Windows) and Asynchronous sockets (e.g., used by Windows, upcoming support in Linux). Due to the inherent benefits of asynchronous sockets, the SDP standard allows several intelligent approaches such as \\em source-avail and sink-avail based zero-copy for these sockets. Unfortunately, most of these approaches are not beneficial for the synchronous sockets interface. Further, due to its portability, ease of use and support on a wider set of platforms, the synchronous sockets interface is the one used by most sockets applications today. Thus, a mechanism by which the approaches proposed for asynchronous sockets can be used for synchronous sockets is highly desirable. In this paper, we propose one such mechanism, termed as \\em AZ-SDP (Asynchronous Zero-Copy SDP) , where we memory-protect application buffers and carry out communication asynchronously while maintaining the synchronous sockets semantics. We present our detailed design in this paper and evaluate the stack with an extensive set of benchmarks. The experimental results demonstrate that our approach can provide an improvement of close to 35\\% for medium-message uni-directional throughput and up to a factor of 2 benefit for computation-communication overlap tests and multi-connection benchmarks.||||||||||||./pdfs/013-CAC-paper-1.pdf",
    "A Parallel Exact Hybrid Approach for Solving Multi-Objective Problems on th|A Parallel Exact Hybrid Approach for Solving Multi-Objective Problems on the Computational Grid Mohand Mezmaz Nouredine Melab El-ghazali Talbi This paper presents a parallel hybrid exact multi-objective approach which combines two metaheuristics - a genetic algorithm (GA) and a memetic algorithm (MA), with an exact method - a branch and bound (B\\&B) algorithm. Such approach profits from both the exploration power of the GA, the intensification capability of the MA and the ability of the B\\&B to provide optimal solutions with proof of optimality. To fully exploit the resources of a computational grid, the hybrid method is parallelized according to three well-known parallel models - the island model for the GA, the multi-start model for the MA and the parallel tree exploration model for the B\\&B. The obtained method has been experimented and validated on a bi-objective flow-shop scheduling problem. The approach allowed to solve exactly for the first time an instance of the problem - 50 jobs on 5 machines. More than 400 processors belonging to 4 administrative domains have contributed to the resolution process during more than 6 days.||||||||||||./pdfs/013-NIDISC-paper-1.pdf",
    "A Look at Application Performance Sensitivity to the Bandwidth and Latency |A Look at Application Performance Sensitivity to the Bandwidth and Latency of Infiniband Networks. Darren J. Kerbyson This work explores the expected performance of three applications on a High Performance Computing cluster interconnected using Infiniband. In particular, the expected performance across a range of configurations is analyzed notably Infiniband 4x, 8x and 12x representing link-speeds of 10Gb/s, 20Gb/s, and 30Gb/s respectively as well as near-neighbor MPI message latencies of 4$\\mu$s and 1.5$\\mu$s. In addition we also consider the impact of node size, from one to eight processors that share a single network connection. The performance analysis is based on the use of detailed performance models of the three applications developed at Los Alamos. The results of the analysis show that the application performance can range by as much as 60\\% from best to worst. The relative importance of bandwidth, latency and node size differs between the applications.||||||||||||./pdfs/014-CAC-paper-1.pdf",
    "Workforce Planning with Parallel Algorithms |Workforce Planning with Parallel Algorithms Enrique Alba Gabriel Luque Francisco Luna Workforce planning is an important activity that enables organizations to determine the workforce needed for continued success. A workforce planning problem is a very complex task that requires modern techniques to be solved adequately. In this work, we describe the development of two parallel metaheuristic methods, a parallel genetic algorithm and a parallel scatter search, which can find high-quality solutions to 20 different problem instances. Our experiments show that parallel versions do not only allow to reduce the execution time but they also improve the solution quality.||||||||||||./pdfs/015-NIDISC-paper-1.pdf",
    "Self-Organized Task Allocation for Computing Systems with Reconfigurable Co|Self-Organized Task Allocation for Computing Systems with Reconfigurable Components Daniel Merkle Martin Middendorf Alexander Scheidler A self-organized allocation scheme for service tasks in computing systems is proposed in this paper. Usually components of a computing system need some service from time to time in order perform their work efficiently. In adaptive computing systems the components and the necessary tasks adapt to the needs of users or the environment. Since in such cases the type of service tasks will often change it is attractive to use reconfigurable hardware to perform the service tasks. The studied system consists of normal worker components and helper components which have reconfigurable hardware and can perform different service tasks. The speed with which a service tasks is executed by a helper depends on its actual configuration. Different strategies for the helpers to decide about service task acceptance and reconfiguration are proposed. These strategies are inspired by stimulus-threshold models that are used to explain task allocation in social insects.||||||||||||./pdfs/016-NIDISC-paper-1.pdf",
    "Techniques and Tools for Dynamic Optimization |Techniques and Tools for Dynamic Optimization Jason D. Hiser Naveen Kumar Min Zhao Shukang Zhou Bruce R. Childers Jack W. Davidson Mary Lou Soffa Traditional code optimizers have produced significant performance improvements over the past forty years. While promising avenues of research still exist, traditional static and profiling techniques have reached the point of diminishingreturns. The main problem is that these approaches have only a limited view of the program and have difficulty taking advantage of the actual run-time behavior of a program. We are addressing this problem through the development of a dynamic optimization system suited for aggressive optimization?using the full power of the most beneficial optimizations. We have designed our optimizer to operate using a software dynamic translation (SDT) execution system. Difficult challenges in this research include reducing SDT overhead and determining what optimizations to apply and where in the code to apply them. Another challenge is having the necessary tools to ensure the reliability of software that is dynamically optimized. In this paper, we describe our efforts in reducing overhead in SDT and efficient techniques for instrumenting the application code. We also describe our approach to determine what and where an optimization should be applied. We discuss other fundamental issues in developing a dynamic optimizer and finally present a basic debugger for SDT systems.||||||||||||./pdfs/02-NSFNGS-paper-1.pdf",
    "Implementation of a Reconfigurable Hard Real-Time Control System for Mechat|Implementation of a Reconfigurable Hard Real-Time Control System for Mechatronic and Automotive Applications Steffen Toscher Roland Kasper Thomas Reinemann Control algorithms implemented directly in hardware take advantage of parallel signal processing. Furthermore, implementing controller functionality in reconfigurable hardware facilitates modification of controller structure and parameters during run-time. In this paper, we introduce an implemented and tested reconfigurable hard real-time control system based on an FPGA device. It supports dynamic partial reconfiguration of controller functionality by self-reconfiguration mechanisms. Self-reconfiguration is performed using an internal configuration interface. We also present sophisticated on-chip and off-chip communication solutions. Specification of controller functionality involves Finite State Machines (FSMs) and comprises parts of the distributed communication and reconfiguration solution.||||||||||||./pdfs/027-RAW-paper-1.pdf",
    "A High-level Target-precise Model for Designing Reconfigurable HW Tasks |A High-level Target-precise Model for Designing Reconfigurable HW Tasks Maik Boden Steffen Ruelke Juergen Becker The increasing complexity of embedded digital HW/SW systems, rising chip development and fabrication costs, and a shortened time-to-market require system-level design methods and the use of reconfigurable architectures. Our design method concerns the modelling of a system and its HW tasks at a high abstraction level. Using design patterns and macros, our library-based approach provides a consistent flow from an executable specification to its implementation. These templates ease the efficient application of partially run-time reconfigurable architectures. A case study depicts the high-level modelling of a HW task and its implementation in detail.||||||||||||./pdfs/029-RAW-paper-1.pdf",
    "Program Phase Detection and Exploitation |Program Phase Detection and Exploitation Chen Ding Sandhya Dwarkadas Michael C. Huang Kai Shen John B. Carter Studies of application behavior reveal the nested repetition of large and small program phases, with significant variation among phases in such characteristics as memory reference patterns, memory and energy usage, I/O activity, and occupancy of micro-architectural resources. In this project, we study theories and techniques for reliably predicting and exploiting phased behavior, so an advanced execution environment may allocate resources in a way that better matches program needs, or to transform programs so that their needs better match the available resources. In this paper, we present the basic components of the study and report the progress in the past half year.||||||||||||./pdfs/03-NSFNGS-paper-1.pdf",
    "Run-Time Reconfiguration of Communication in SIMD Architectures |Run-Time Reconfiguration of Communication in SIMD Architectures Hamed Fatemi Bart Mesman Henk Corporaal Twan Basten Pieter Jonker SIMD processors are increasingly used in embedded systems for multi-media applications because of their advantages with regard to area- and energy-efficiency. Communication between the processing elements in an SIMD processor has remained a cause of inefficiency however, the SIMD concept prescribes that all processing elements communicate in the same clock cycle. Existing SIMD architectures solve this problem either by multi-hop communication (causing cycle overhead), or by a fully connected communication network (causing area overhead). In order to solve the communication bottleneck we have introduced a new SIMD architecture (RC-SIMD) with a set of delay-lines in the instruction bus, causing the accesses to the communication network to be distributed over time. We can (re-)configure the size and number of delay-lines, where a specific configuration represents a trade-off between number of clock cycles and size of a clock period. The reconfiguration process is simple, the reconfiguration time is typically around 30 clock cycles (which is far less than 1\\% of the typical execution time of algorithms), and the added configuration hardware is less than 2\\%. Furthermore, an experimental study shows that our reconfigurable architecture achieves (on average) more than 10\\% performance improvement over a non-reconfigurable architecture.||||||||||||./pdfs/030-RAW-paper-1.pdf",
    "Exploiting Processing Locality through Paging Configurations in Multitasked|Exploiting Processing Locality through Paging Configurations in Multitasked Reconfigurable Systems Mohamed Taher Tarek El-ghazawi FPGA chips in reconfigurable computer systems are used as malleable coprocessors where components of a hardware library of functions can be configured as needed. As the number of hardware functions to be configured typically exceeds the underlying chip area during the execution of an application, previous efforts have introduced configuration caching. Those efforts, however, can exploit either spatial or temporal processing locality. In this work, we propose a technique suitable for multitasking and for cases of single applications that can change the course of processing in a non-deterministic fashion based on data. In order to exploit both spatial and temporal processing locality, simultaneously, the proposed model groups hardware functions into hardware configuration blocks (pages) of fixed size, where multiple pages can be configured on a chip simultaneously. By grouping only related functions that are typically requested together, processing spatial locality can be exploited. Temporal locality is exploited through page replacement techniques. Data mining techniques were used to group related functions into pages. Standard, replacement algorithms as those found in caching were considered. Simulations, as well as emulation using the Cray XD1 reconfigurable high-performance computer were used in the experimental study. The results show a significant improvement in performance using the proposed paging technique.||||||||||||./pdfs/031-RAW-paper-1.pdf",
    "A Pattern Selection Algorithm for Multi-Pattern Scheduling |A Pattern Selection Algorithm for Multi-Pattern Scheduling Yuanqing Guo Cornelis Hoede Gerard J.m. Smit The multi-pattern scheduling algorithm is designed to schedule a graph onto a coarse-grained reconfigurable architecture, the result of which depends highly on the used patterns. This paper presents a method to select a near-optimal set of patterns. By using these patterns, the multi-pattern scheduling will result in a better schedule in the sense that the schedule will have fewer clock cycles.||||||||||||./pdfs/037-RAW-paper-1.pdf",
    "Rapid Development of High Performance Floating-Point Pipelines for Scientif|Rapid Development of High Performance Floating-Point Pipelines for Scientific Simulation Gerhard Lienhart Andreas Kugel Reinhard Maenner In the last years, FPGAs became capable of performing complex floating-point based calculations. For many applications, highly parallel calculation units can be implemented which deliver a better performance than general-purpose processors. This paper focuses on applications where the calculations can be done in a pipeline, as it is often the case for simulations. A framework for rapid design of such calculation pipelines is described. The central part is a Perl based code generator, which automatically assembles floating-point operators into synthesizable hardware description code where the generator is directed by a pipeline description file. The framework is supplemented by various floating-point operators and support modules, which allow generating ready-to-use pipelines. The code generator dramatically reduces development time and produces high-quality results. The performance of the framework is demonstrated by the implementation of pipelines for gravitational forces and hydrodynamics.||||||||||||./pdfs/039-RAW-paper-1.pdf",
    "An overview of the ECO Project |An overview of the ECO Project Jacqueline Chame Chun Chen Pedro Diniz Mary Hall Yoon-ju Lee Robert F. Lucas In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. The overall approach can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs. The main focus of the paper is an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. We have developed an initial compiler implementation, and present automatically-generated results on matrix multiply. Results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. This paper describes other components of the ECO system, including supporting tools and experiments with programmer-guided performance tuning. This approach has provided a foundation for a general framework for systematic optimization of domain-specific applications. Specifically, we are developing an optimization system for signal and image processing that exploits signal properities, and we are using machine learning and a knowledge-rich representation can be exploited to optimize molecular dynamics simulation.||||||||||||./pdfs/04-NSFNGS-paper-1.pdf",
    "Coupling of a Reconfigurable Architecture and a Multithreaded Processor Cor|Coupling of a Reconfigurable Architecture and a Multithreaded Processor Core with Integrated Real-Time Sascha Uhrig Stefan Maier Georgi Kuzmanov Theo Ungerer This paper defines a real-time capable interface between the simultaneous multithreaded CarCore processor and a MOLEN-based reconfigurable unit. CarCore is an IP core that enables simultaneous execution of one hard-real-time thread and further non-real-time threads. The coupling described in this paper extends CarCore by a reconfigurable hardware such that both can execute different threads simultaneously, while the real-time behavior of the hard-real-time thread is not harmed. The challenge is the design of a common memory interface for both, the CarCore and the reconfigurable hardware, such that memory operations fulfil hard-real-time constraints. Experimental results with an MJPEG benchmark show an overall application speedup of 2.75 which approaches the theoretically attainable maximum speedup of 2.78.||||||||||||./pdfs/040-RAW-paper-1.pdf",
    "Towards a Universal Client for Grid Monitoring Systems: Design and Implemen|Towards a Universal Client for Grid Monitoring Systems: Design and Implementation of the Ovid Browser Marios D. Dikaiakos Artemakis Artemiou George Tsouloupas In this paper, we present the design and implementation of Ovid, a browser for Grid-related information. The key goal of Ovid is to support the seamless navigation of users in the Grid information space. Key aspects of Ovid are: (i) A set of navigational primitives, which are designed to cope with problems such as network disorientation and information overloading; (ii) A small set of Ovid views, which present the enduser with high-level, visual abstractions of Grid information; these abstractions correspond to simple models that capture essential aspects of a Grid infrastructure. (iii) Support for embedding and implementing hyperlinks that connect related entities represented within different information views; (iv) A plug-in mechanism, which enables the seamless integration with Ovid of third-party software that retrieves and displays data from various Grid information sources, and (v) a modular software design, which allows the easy integration of different visualization algorithms that support the graphical representation of large amounts of Gridrelated information in the context of Ovid?s views.||||||||||||./pdfs/043-HIPS-paper-1.pdf",
    "Dedicated Module Access in Dynamically Reconfigurable Systems |Dedicated Module Access in Dynamically Reconfigurable Systems J. Hagemeyer B. Kettelhoit M. Porrmann Modern FPGAs, such as the Xilinx Virtex-II Series, offer the feature of partial and dynamic reconfiguration, allowing to load various hardware configurations (i.e., HW modules) during run-time. To enable communication with these modules and for controlling purposes, dedicated access to each module as well as dedicated signals to control the global communication are required. This paper discusses several ways of implementing dedicated signals and addresses the impact on dynamically reconfigurable systems. Two new approaches are introduced, which allow a permanent access to the modules and to the communication infrastructure even during reconfiguration.||||||||||||./pdfs/044-RAW-paper-1.pdf",
    "Dynamic Program Phase Detection in Distributed Shared-Memory Multiprocessor|Dynamic Program Phase Detection in Distributed Shared-Memory Multiprocessors Engin Ipek Jos&eacute; F. Mart&iacute;nez Bronis R. De Supinski Sally A. Mckee Martin Schulz We present a novel hardware mechanism for dynamic program phase detection in distributed shared memory (DSM) multiprocessors. We show that successful hardware mechanisms for phase detection in uniprocessors do not necessarily work well in DSM systems, since they lack the ability to incorporate the parallel application?s global execution information and memory access behavior based on data distribution. We then propose a hardware extension to a well-known uniprocessor mechanism that significantly improves phase detection in the context of DSM multiprocessors. The resulting mechanism is modest in size and complexity, and is transparent to the parallel application.||||||||||||./pdfs/05-NSFNGS-paper-1.pdf",
    "Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Per|Phylogenetic Models of Rate Heterogeneity: A High Performance Computing Perspective Alexandros Stamatakis Inference of phylogenetic trees using the maximum likelihood (ML) method is NP-hard. Furthermore, the computation of the likelihood function for huge trees of more than 1,000 organisms is computationally intensive due to a large amount of floating point operations and high memory consumption. Within this context, the present paper compares two competing mathematical models that account for evolutionary rate heterogeneity: the $\\Gamma$ and CAT models. The intention of this paper is to show that---from a purely empirical point of view---CAT can be used instead of $\\Gamma$. The main advantage of CAT over $\\Gamma$ consists in significantly lower memory consumption and faster inference times. An experimental study using RAxML has been performed on 19 real-world datasets comprising 73 up to 1,663 DNA sequences. Results show that CAT is on average 5.5 times faster than $\\Gamma$ and---surprisingly enough---also yields trees with slightly superior \\bf \\boldmath$ \\Gamma$ likelihood values . The usage of the CAT model decreases the amount of average L2 and L3 cache misses by factor 8.55.||||||||||||./pdfs/052-HiCOMB-paper-1.pdf",
    "Implementation of a Programmable Array Processor Architecture for Approxima|Implementation of a Programmable Array Processor Architecture for Approximate String Matching Algorithms on FPGAs Panagiotis D. Michailidis Konstantinos G. Margaritis Approximate string matching problem is a common and often repeated task in information retrieval and bioinformatics. This paper proposes a generic design of a programmable array processor architecture for a wide variety of approximate string matching algorithms to gain high performance at low cost. Further, we describe the architecture of the array and the architecture of the cell in detail in order to efficiently implement for both the preprocessing and searching phases of most string matching algorithms. Further, the architecture performs approximate string matching for complex patterns that contain don't care, complement and classes symbols. We also implement and evaluate the proposed architecture on a field programmable gate array (FPGA) device using the JHDL tool for synthesis and the Xilinx Foundation tools for mapping, placement, and routing. Finally, our programmable implementation achieves about 9-340 times faster than a desktop computer with a Pentium 4 3.5 GHz for all algorithms when the length of the pattern is 1024.||||||||||||./pdfs/053-RAW-paper-1.pdf",
    "An Experimental Study of Optimizing Bioinformatics Applications |An Experimental Study of Optimizing Bioinformatics Applications Guangming Tan Lin Xu Shengzhong Feng Ninghui Sun As bioinformatics is an emerging application of high performance computing, this paper first evaluates the memory performance of several representative bioinformatics applications so that some appropriate optimization methods can be applied. Based on the computational behavior of these bioinformatics applications, we propose two optimized algorithms on high performance computer architectures. 1) For the data(I/O) intensive program, MegaBlast, we overlap computation with I/O to produce an improved high-throughput algorithm with reduced time and memory requirements. 2) For a CPU-intensive RNA secondary structure prediction algorithm, we propose a fine-grain parallel $O(N^3)$ algorithm based on reconfigurable arrays (FPGAs). In order to optimize the FPGA architecture, we evaluate the performance in different architectures using cycle-by-cycle simulator.||||||||||||./pdfs/054-HiCOMB-paper-1.pdf",
    "Application Re-Structuring and Data Management on a GRID Environment: a Cas|Application Re-Structuring and Data Management on a GRID Environment: a Case Study for Bioinformatics Giovanni Ciriello Matteo Comin Concettina Guerra This paper describes a distributed implementation of PROuST, a method for protein structure comparison, that involves a major restructuring of the application for an efficient grid immersion. PROuST consists of several components that perform different tasks at different stages. Given a target protein, an index-based search retrieves from a database a list of proteins that are good candidates for similarity, then a dynamic programming algorithm aligns the target protein with each candidate protein. The same geometric properties of secondary structure elements of proteins are used by different components of PROuST. Thus, an important issue of the distributed implementation is data transfer vs. data recomputation tradeoffs. Our implementation avoids recomputation by re-using the hash table data as much as possible, once they are accessed. The algorithmic changes to the application allow to reduce the number of data accesses to storage elements and consequently the execution time. In addition this paper discusses data replication strategies on a grid environment to optimize the data transfer time.||||||||||||./pdfs/055-HiCOMB-paper-1.pdf",
    "Hierarchically Tiled Arrays for Parallelism and Locality |Hierarchically Tiled Arrays for Parallelism and Locality Jia Guo Ganesh Bikshandi Daniel Hoeflinger Gheorghe Almasi Basilio Fraguela Mar&yacute;a Jes&uacute;s Garzar&aacute;n David Padua Christoph Von Praun Parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives containing information about data and task distribution, language extensions specifically designed for parallel computation, or classes that encapsulate parallelism. In this paper, we describe a class developed at Illinois and its MATLAB implementation. This class can be used to conveniently express both parallelism and locality. A C++ implementation is now underway. Its characteristics will be reported in a future paper. We have implemented most of the NAS benchmarks using our HTA MATLAB extensions and found during that HTAs enable the fast prototyping of parallel algorithms and produce programs that are easy to understand and maintain.||||||||||||./pdfs/06-NSFNGS-paper-1.pdf",
    "Some Initial Results on Hardware BLAST Acceleration with a Reconfigurable A|Some Initial Results on Hardware BLAST Acceleration with a Reconfigurable Architecture Euripides Sotiriades Christos Kozanitis Apostolos Dollas The BLAST algorithm is the prevalent tool that is used by molecular biologists for DNA Sequence Matching and Database Search. In this work we demonstrate that with an appropriate reconfigurable architecture, BLAST performance can be improved with a single-chip solution 5 times over a specialized and optimized computer cluster, or 37 times over a single computer. These initial results account for I/O and are very encouraging for the development of a large scale,reconfigurable BLAST engine.||||||||||||./pdfs/061-HiCOMB-paper-1.pdf",
    "MT-ClustalW: Multithreading Multiple Sequence Alignment |MT-ClustalW: Multithreading Multiple Sequence Alignment Kridsadakorn Chaichoompu Surin Kittitornkun Sissades Tongsima ClustalW is the most widely used tool for aligning multiple protein or nucleotide sequences. The alignment is achieved via three stages: pairwise alignment, guide tree generation and progressive alignment. This paper analyzes and enhances a multithreaded implementation of ClustalW called ClustalW-SMP for higher throughput. Our goal is to multithread ClustalW maximize the degree of parallelism on multithreading ClustalW called MultiThreading-ClustalW (MT-ClustalW). As a result, bioinformatics laboratories are able to use this MT-ClustalW with much less energy consumption on multicore and SMP (Symmetric MultiProcessor) machines than that of PC clusters. The experiment results show that the MT-ClustalW framework can achieve a considerable speedup over the sequential ClustalW and original multithreaded ClustalW-SMP implementations.||||||||||||./pdfs/062-HiCOMB-paper-1.pdf",
    "FPGA based Architecture for DNA Sequence Comparison and Database Search|FPGA based Architecture for DNA Sequence Comparison and Database Search Euripides Sotiriades Christos Kozanitis Apostolos Dollas .||||||||||||./pdfs/062-RAW-paper-1.pdf",
    "Parallel Multiple Sequence Alignment with Local Phylogeny Search by Simulat|Parallel Multiple Sequence Alignment with Local Phylogeny Search by Simulated Annealing Jaroslaw Zola Denis Trystram Andrei Tchernykh Carlos Brizuela The problem of multiple sequence alignment is one of the most important problems in computational biology. In this paper we present a new method that simultaneously performs multiple sequence alignment and phylogenetic tree inference for large input data sets. We describe a parallel implementation of our method that utilises simulated annealing metaheuristic to find locally optimal phylogenetic trees in reasonable time. To validate the method, we perform a set of experiments with synthetic as well as real-life data.||||||||||||./pdfs/063-HiCOMB-paper-1.pdf",
    "An Automated Development Framework for a RISC Processor with Reconfigurable|An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis George Theodoridis Spiridon Nikolaidis By coupling a reconfigurable hardware to a standard processor, high levels of flexibility and adaptability are achieved. However, this approach requires modifications to the compiler of the processor to take into account reconfigurable aspects. In this paper, a development framework for a RISC processor with reconfigurable instruction set extensions is presented. The framework is fully automated, hiding all reconfigurable related issues from the user and can be used for both program and fine-tune the architecture at design time. We demonstrate the above issues using a set of benchmarks. Experimental results show an x2.9 average speedup in addition to potential energy reduction.||||||||||||./pdfs/063-RAW-paper-1.pdf",
    "A Distributed Object System Approach for Dynamic Reconfiguration |A Distributed Object System Approach for Dynamic Reconfiguration Ronald Hecht Stephan Kubisch Harald Michelsen Elmar Zeeb Dirk Timmermann Managing reconfigurable hardware resources at runtime is expected to be a new task for future operating systems. But due to the mixture of parallel and sequential parts of dynamically reconfigurable applications, it is not entirely clear so far, how to use and to program such systems. A new interpretation of dynamically reconfigurable applications is presented. It will be shown, that the parallel computing concept of distributed object systems may be adapted for dynamically reconfigurable architectures. This approach answers many open questions concerning communication, interruption, and relocation of reconfigurable modules. It is explored by means of an extended Linux operating system in conjunction with a SystemC model of a dynamically reconfigurable FPGA.||||||||||||./pdfs/064-RAW-paper-1.pdf",
    "Phylospaces: Reconstructing Evolutionary Trees in Tuple Space |Phylospaces: Reconstructing Evolutionary Trees in Tuple Space Marc L. Smith Tiffani L. Williams Phylospaces is a novel framework for reconstructing evolutionary trees in tuple space, a distributed shared memory that permits processes to communicate and coordinate with each other. Our choice of tuple space as a concurrency model is somewhat unusual, given the prominence and success of pure message passing models, such as MPI. We use Phylospaces to devise Cooperative Rec-I-DCM3, a population-based strategy for navigating tree space. Cooperative Rec-I-DCM3 is based on Rec-I-DCM3, the fastest sequential algorithm under maximum parsimony. We compare the performance of the algorithms on two datasets consisting of 2,000 and 7,769 taxa, respectively. Our results demonstrate that Cooperative Rec-I-DCM3 outperforms its sequential counterpart by at least an order of magnitude.||||||||||||./pdfs/065-HiCOMB-paper-1.pdf",
    "An Optimal Architecture for a DDC |An Optimal Architecture for a DDC Tjerk Bijlsma Pascal T. Wolkotte Gerard J. M. Smit Digital Down Conversion (DDC) is an algorithm, used to lower the amount of samples per second by selecting a limited frequency band out of a stream of samples. A possible DDC algorithm consists of two simple Cascading Integrating Comb (CIC) filters and a Finite Input Response (FIR) filter preceded by a modulator that is controlled with a Numeric Controlled Oscillator (NCO). Implementations of the algorithm have been made for five architectures, two Application Specific Integrated Circuits (ASIC), a General Purpose Processor (GPP), a Field Programmable Gate Array (FPGA), and the Montium Tile Processor (TP). All architectures are functionally capable of performing the algorithm. The differences between the architectures are their performance, flexibility and energy consumption. In this paper we compared the energy consumption of the architectures when performing the DDC algorithm. The ASIC is the best solution if digital down conversion is constantly required. When digital down conversion is needed only parts of the time, the Altera Cyclone II is the best solution due to its smaller technology size. In the spare time the reconfigurable architectures can be reconfigured for other tasks of today's multimedia devices.||||||||||||./pdfs/065-RAW-paper-1.pdf",
    "High-Level Synthesis with Reconfigurable Datapath Components |High-Level Synthesis with Reconfigurable Datapath Components George Economakos High-level synthesis is becoming more popular as design densities keep increasing, especially in the ASIC design world. Although FPGA design follows ASIC design methodologies and FPGA densities are increasing too, programmable devices also offer the advantage of partial reconfiguration, which allows an algorithm to be partially mapped into a small and fixed FPGA device that can be reconfigured at run time, as the mapped application changes its requirements. This paper presents a novel resource constrained high-level synthesis scheduling heuristic, which utilizes reconfigurable datapath components. The resulting schedule can be shortened so as the gain in clock cycles can overcome the timing overhead of reconfiguration. The main advantage of the proposed methodology is that through run time reconfiguration, more complicated algorithms can be mapped into smaller devices without speed degradation.||||||||||||./pdfs/066-RAW-paper-1.pdf",
    "A Stochastic Multi-Objective Algorithm for the Design of High Performance R|A Stochastic Multi-Objective Algorithm for the Design of High Performance Reconfigurable Architectures Wing On Fung Tughrul Arslan The increasing demand for FPGAs and reconfigurable hardware targeting high performance low power applications has lead to an increasing requirement for new high performance reconfigurable embedded FPGA cores. This paper presents a multi-objective population based algorithm which given a library of basic blocks and a list of constraints, identifies an optimum reconfigurable embedded reconfigurable core suitable for the target application.||||||||||||./pdfs/068-RAW-paper-1.pdf",
    "Hierarchical Multithreading: Programming Model and System Software |Hierarchical Multithreading: Programming Model and System Software Guang R. Gao Thomas Sterling Rick Stevens Mark Hereld Weirong Zhu This paper addresses the underlying sources of performance degradation (e.g. latency, overhead, and starvation) and the difficulties of programmer productivity (e.g. explicit locality management and scheduling, performance tuning, fragmented memory, and synchronous global barriers) to dramatically enhance the broad effectiveness of parallel processing for high end computing. We are developing a hierarchical threaded virtual machine (HTVM) that defines a dynamic, multithreaded execution model and programming model, providing an architecture abstraction for HEC system software and tools development. We are working on a prototype language, LITL-X (pronounced ?little-X?) for Latency Intrinsic-Tolerant Language, which provides the application programmers with a powerful set of semantic constructs to organize parallel computations in a way that hides/manages latency and limits the effects of overhead. This is quite different from locality management, although the intent of both strategies is to minimize the effect of latency on the efficiency of computation. We will work on a dynamic compilation and runtime model to achieve efficient LITL-X program execution. Several adaptive optimizations will be studied. A methodology of incorporating domain-specific knowledge in program optimization will be studied. Finally, we plan to implement our method in an experimental testbed for a HEC architecture and perform a qualitative and quantitative evaluation on selected applications.||||||||||||./pdfs/07-NSFNGS-paper-1.pdf",
    "Reconfigurable Communications for Image Processing Applications |Reconfigurable Communications for Image Processing Applications Andr&eacute; Borin Soares Luigi Carro Altamiro Amadeu Susin This work tries to reuse programmable communication resources like a Network-on-Chip (NoC) in the acceleration of image applications. We show a mathematical model for the computation and communication pattern of two distributed motion estimation algorithms, Full Search Block Matching Algorithm and Multi-Resolution Block Matching Algorithm. Experimental results show that the use of the Multi-Resolution method reduces not only the computation time but also the traffic of messages on the NoC. This leads to a lower power consumption in the NoC during the processing time of each image. The studied examples show the importance of the link between algorithms and their mapping onto a programmable fabric, not only regarding computation, but facing communication as well.||||||||||||./pdfs/070-RAW-paper-1.pdf",
    "Reconfigurable Memory Based AES Co-Processor |Reconfigurable Memory Based AES Co-Processor Ricardo Chaves Georgi Kuzmanov Stamatis Vassiliadis Leonel Sousa We consider the AES encryption/decryption algorithm and propose a memory based hardware design to support it. The proposed implementation is mapped on the Xilinx Virtex II Pro technology. Both the byte substitution and the polynomial multiplication of the AES algorithm are implemented in a single dual port on-chip memory block (BRAM). Two AES encryption/decryption cores have been designed and implemented on a prototyping XC2VP20-7 FPGA: a completely unrolled loop structure capable of achieving a throughput above 34 Gbits/s, with an implementation cost of 3513 slices and 80 BRAMs; and a fully folded structure, requiring only 515 slices and 12 BRAMs, capable of a throughput above 2 Gbits/s. To evaluate the proposed AES design, its has been embedded in a polymorphic processor organization, as a reconfigurable co-processor. Comparisons to state-of-the-art AES cores indicate that the proposed unfolded core outperforms the most recent works by 34\\% in throughput and requires 68\\% less reconfigurable area. Experimental results of both folded and unfolded AES cores suggest over 560\\% improvement in the throughput/slice metric when compared to the recent AES related art.||||||||||||./pdfs/073-RAW-paper-1.pdf",
    "Recent Advances in Checkpoint/Recovery Systems |Recent Advances in Checkpoint/Recovery Systems Greg Bronevetsky Rohit Fernandes Daniel Marques Keshav Pingali Paul Stodghill Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of the uses of checkpointing is to help mitigate the effects of interruptions in computational service (both planned and unplanned) In fact, some supercomputing centers expect their users to use checkpointing as a matter of policy. And yet, few centers provide fully automatic checkpointing systems for their high-end production machines. The paper is a status report on our work on the family of $C^3$ systems for (almost) fully automatic checkpointing for scientific applications. To date, we have shown that our techniques can be used for checkpointing sequential, MPI and OpenMP applications written in C, Fortran, and several other languages. A novel aspect of our work is that we have not built a single checkpointing system, rather, we have developed a methodology and a set of techniques that have enabled us to develop a number of systems, each meeting different design goals and efficiency requirements.||||||||||||./pdfs/08-NSFNGS-paper-1.pdf",
    "Dynamic Aspects for Runtime Fault Determination and Recovery |Dynamic Aspects for Runtime Fault Determination and Recovery Jeremy Manson Jan Vitek Suresh Jagannathan One of the most promising applications of Aspect Oriented Programming (AOP) is the area of fault tolerance and recovery. In traditional programming languages, error handling code must be closely interwoven with program logic. AOP allows the programmer to take a more modular approach - error handling code can be woven into the code by expressing it as an aspect. One major impediment to handling error code in this way is that while errors are a dynamic, runtime property, most research on AOP has focused on static properties. In this paper, we propose a method for handling a variety of run-time faults as dynamic aspects. First, we separate fault handling into two different notions: fault determination, or the discovery of faults within a program, and and fault recovery, or the logic used to recover from a fault. Our position is that fault determination should be expressed as dynamic aspects. We propose a system, called Rescue, that exposes underlying features of the virtual machine in order to express faults as variety of run-time constraints. We show how our methodology can be used to address several of the flaws in state of the art fault fault handling techniques. This includes their limitations in handling parallel and distributed faults, their obfuscated nature and their overly simplistic notion of what a ?fault? actually may comprise.||||||||||||./pdfs/09-NSFNGS-paper-1.pdf",
    "Physically-aware Exploitation of Component Reuse in a Partially Reconfigura|Physically-aware Exploitation of Component Reuse in a Partially Reconfigurable Architecture Love Singhal Elaheh Bozorgzadeh The major drawback of partial dynamic reconfiguration is the reconfiguration delay overhead. To reduce the reconfiguration bits between two consecutive implementations, design components are reused. In this paper, we propose a floorplanner to support two-dimensional partial reconfiguration. Our floorplanner handles many features like mapping, selection and placement of the fixed components, and interconnect planning between the fixed and reconfigurable components. We implemented a sequence of dataflow graphs on Xilinx Virtex-4 devices. The component reuse results in more than 50\\% savings in reconfiguration bits. The results show a need to tune the physical design tools for minimizing runtime reconfiguration delay overhead.||||||||||||./pdfs/092-RAW-paper-1.pdf",
    "Practical Design of a Computation and Energy Efficient Hardware Task Schedu|Practical Design of a Computation and Energy Efficient Hardware Task Scheduler in Embedded Reconfigurable Computing Systems Tyrone Tai-on Kwok Yu-kwong Kwok By utilizing massively parallel circuit design in FPGAs, the overall system efficiency, in terms of computation efficiency and energy efficiency, can be greatly enhanced by offloading some computation-intensive tasks which are originally executed in the instruction set processor to the FPGA fabric. In essence, a hardware task scheduler is needed. However, most of the work in the literature considers scheduling algorithms which are unable or difficult to be implemented using the design flows in current development platform. Moreover, little of the work takes energy consumption into consideration. In this paper, we present the design of a hardware task scheduler which takes energy consumption into consideration, and can be readily implemented using current design flows.||||||||||||./pdfs/096-RAW-paper-1.pdf",
    "Reconfigurable Context-Free Grammar Based Data Processing Hardware with Err|Reconfigurable Context-Free Grammar Based Data Processing Hardware with Error Recovery James Moscola Young H. Cho John W. Lockwood This paper presents an architecture for context-free grammar (CFG) based data processing hardware for reconfigurable devices. Our system leverages on CFGs to tokenize and parse data streams into a sequence of words with corresponding semantics. Such a tokenizing and parsing engine is sufficient for processing grammatically correct input data. However, most pattern recognition applications must consider data sets that do not always conform to the predefined grammar. Therefore, we augment our system to detect and recover from grammatical errors while extracting useful information. Unlike the table look up method used in traditional CFG parsers, we map the structure of the grammar rules directly onto the Field Programmable Gate Array (FPGA). Since every part of the grammar is mapped onto independent logic, the resulting design is an efficient parallel data processing engine. To evaluate our design, we implement several XML parsers in an FPGA. Our XML parsers are able to process the full content of the packets up to 3.59 Gbps on Xilinx Virtex 4 devices.||||||||||||./pdfs/097-RAW-paper-1.pdf",
    "Scalable Resilience - The ReSIST Network of Excellence |Scalable Resilience - The ReSIST Network of Excellence Jean-claude Laprie ReSIST is a Network of Excellence that integrates leading researchers active in the multidisciplinary domains of Dependability, Security, and Human Factors, in order that Europe will have a well-focused coherent set of research activities aimed at ensuring that future ?ubiquitous computing systems? (the immense systems of ever-evolving networks of computers and mobile devices which are needed to support and provide Ambient Intelligence), have the necessary resilience and survivability, despite any residual development and physical faults, interaction mistakes, or malicious attacks and disruptions. At the heart of ReSIST is the Joint Programme of Research (JPR). Two main steps will take place, according to the structuring of the research activities:\\\\ 1) first according to the basic resilience building technologies for the survivability of information infrastructures, i.e., resilience design, resilience verification and resilience evaluation \\\\2) then according to the resilience scaling technologies: evolvability, assessability, usability and diversity. This move from resilience building technologies towards resilience scaling technologies will be accompanied and facilitated by the resilience integration technologies: a resilience knowledge base, and the development of a resilience-explicit computing approach. The Joint Programme of Excellence Spreading (JPES) contributes to integration via the production of documents incorporating results from the JPR, e.g., a) common courseware for training activities, and b) best practices for dissemination activities. The Joint Steering Programme \\\\a) guides integration in assigning and updating the activities of the JPR and the JPES, \\\\b) favours integration via the allocation of the resources of ReSIST, and \\\\c) assesses integration.||||||||||||./pdfs/1-DPDNS-paper-1.pdf",
    "An Extensible Global Address Space Framework with Decoupled Task and Data A|An Extensible Global Address Space Framework with Decoupled Task and Data Abstractions Sriram Krishnamoorthy Umit Catalyurek Jarek Nieplocha Atanas Rountev P. Sadayappan Although message passing using MPI is the dominant model for parallel programming today, the significant effort required to develop high-performance MPI applications has prompted the development of several parallel programming models that are more convenient. Programming models such as Co-Array Fortran, Global Arrays, Titanium, and UPC provide a more convenient global view of the data, but face significant challenges in delivering high performance over a range of applications. It is particularly challenging to achieve high performance using global-address-space languages for unstructured applications with irregular data structures. In this paper, we describe a global-address-space parallel programming framework with decoupled task and data abstractions. The framework centers around the use of task pools, where tasks specify operands in a distributed, globally addressable pool of data chunks. The data chunks can be addressed in a logical multidimensional ?tuple? space, and are distributed among the nodes of the system. Locality-aware load balancing of tasks in the task pool is achieved through judicious mapping via hyper-graph partitioning, as well as dynamic task/data migration. The framework implements a transparent interface for out-of-core data, so that explicit orchestration of movement of data between disks and memory is not required of the programmer. The use of the framework for implementation of parallel block-sparse tensor computations in the context of a quantum chemistry application is illustrated.||||||||||||./pdfs/10-NSFNGS-paper-1.pdf",
    "POHLL Keynote: New Parallel Programming Abstractions and the Role of Compil|POHLL Keynote: New Parallel Programming Abstractions and the Role of Compilers Laxmikant V. Kale Most of the parallel programming, especially in applications in Computational Science and Engineering (CSE), is done using MPI. OpenMP is used on some shared memory platforms. However, it is becoming increasingly evident that new higher level parallel programming abstractions are needed if we have to increase programming productivity further. Here, I present my views on what kinds of high level languages and abstractions one should look for, what research is needed to develop them, what obstacles I see in their development and adoption, and what role compilers can and should play in their development. In particular, I argue that adaptive run-time systems to separate the issues of resource management and abstractions for supporting global (but disciplined) view of data and global view of control are needed. Further, the role of compiler research needs to be directed to supporting such models, even though that requires a paradigm shift (toward simpler problems!) for the compiler research community.||||||||||||./pdfs/100-POHLL-paper-1.pdf",
    "Accelerating DTI Tractography using FPGAs |Accelerating DTI Tractography using FPGAs Aditya Kwatra Viktor Prasanna Manbir Simgh Diffusion Tensor Imaging (DTI) tractography in Magnetic Resonance Imaging (MRI) is a computationally intensive procedure, requiring on the order of tens of minutes to complete tractography of the entire brain. Tractography computations can be accelerated significantly by use of reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs). Such acceleration has the potential to lead to real-time tractography, which would greatly facilitate on-site diagnosis and acquisition of additional scans while the patient is still inside the scanner. In this paper we report the development of an FPGA based architecture to accelerate DTI tractography. We identify computationally intensive kernels and design pipelined implementations. Our performance analysis based on the developed architecture gives on the order of 100x speed-up over an optimized C-code based implementation of tractography on a state-of-the-art processor.||||||||||||./pdfs/100-RAW-paper-1.pdf",
    "Simulating a PR-Mesh on an LARPBS |Simulating a PR-Mesh on an LARPBS Mathura Gopalan Anu Goel Bourgeois Jos&aacute; Alberto Fern&aacute;ndez Zepeda The unidirectional nature of propagation and predictable delays are two characteristics of optically pipelined buses that have made them popular in recent years. Many models have been proposed that use reconfigurable optically pipelined buses. In this paper we establish a relationship between a one dimensional and a two dimensional model of this type. This simulation shows that the challenge is to map the processors so that those belonging to a two-dimensional bus segment are contiguous and in the same order on the simulating one-dimensional model. We focus on the Linear Array with a Reconfigurable Pipelined Bus System (LARPBS) and its two dimensional counterpart the Pipelined Reconfigurable Mesh (PR-Mesh).||||||||||||./pdfs/101-APDCM-paper-1.pdf",
    "Multisite Co-allocation Algorithms for Computational Grid |Multisite Co-allocation Algorithms for Computational Grid Weizhe Zhang Albert M.k.cheng Mingzeng Hu Efficient multisite job scheduling facilitates the cooperation of multi-domain massively parallel processor systems in a computing grid environment. However, co-allocation, heterogeneity, adaptability, and scalability emerge as tough challenges for the design of multisite job scheduling models and algorithms. This paper presents a new multisite job scheduling schema based on the multisite job scheduling model and the performance model for a heterogeneous grid environment. There are three key components: resource selection, reservation, and backfilling. The optimal and greedy-heuristic adaptive resource selection strategies are introduced. The conservative and easy backfilling are incorporated into the backfilling procedure. Experiments indicate that the scheduler and the algorithm are effective and perform better than a non-adaptive algorithm.||||||||||||./pdfs/101-HPGC-paper-1.pdf",
    "PDSEC Keynote: Facing the Challenges of Multicore Processor Technologies us|PDSEC Keynote: Facing the Challenges of Multicore Processor Technologies using Autonomic System Software Dimitris Nikolopoulos Multicore processor technologies, which appear to dominate the processor design landscape, require a shift of paradigm in the development of programming models and supporting environments for scientific and engineering applications. System software for multicore processors needs to exploit fine-grain concurrent execution capabilities and cope with deep, non- uniform memory hierarchies. Software adaptation to multicore technologies needs to happen even as hardware platforms change underneath the software. Last but not least, due to the extremely high compute density of chip multiprocessing components, system software needs to increase its energy-awareness and treat energy and temperature distribution as first-class optimization targets. Unfortunately, energy awareness is most often at odds with high performance. In the first part of this talk I will discuss some of the major challenges of software adaptation to multicore technologies and motivate the use of autonomic, self- optimizing system software, as a vehicle for both high performance portability and energy-efficient program execution. In the second part of the talk I will present ongoing research in runtime environments for dense parallel systems built from multicore and SMT components, and focus on two topics, polymorphic multithreading, and power-aware concurrency control with quality-of-service guarantees. In the same context, I will discuss enabling technologies for improved software autonomy via dynamic runtime optimization, including continuous hardware profilers, and online power-efficiency predictors.||||||||||||./pdfs/101-PDSEC-paper-1.pdf",
    "Performance Evaluation of Wormhole Routed Network Processor-Memory Intercon|Performance Evaluation of Wormhole Routed Network Processor-Memory Interconnects Taskin Kocak Jacob Engel Network line cards are experiencing ever increasing line rates, random data bursts, and limited space. Hence, they are more vulnerable than other processor-memory environments, to create data transfer bottlenecks and hot-spots. Solutions to the memory bandwidth bottleneck are limited by the area available on the line card and network processor I/O pins. As a result, we propose to explore more suitable off-chip interconnect and communication mechanisms that will replace the existing systems and that will provide extraordinary high throughput. We utilize our custom-designed, event-driven, interconnect simulator to evaluate the performance of wormhole routed packet-based off-chip \\it k -ary \\it n -cube interconnect architectures for line cards. Our performance results show that wormhole routed \\it k -ary \\it n -cube based interconnect topologies significantly outperform the existing line card interconnects and they are able to sustain higher traffic loads.||||||||||||./pdfs/101-PMEO-paper-1.pdf",
    "Automatically Translating a General Purpose C++ Image Processing Library fo|Automatically Translating a General Purpose C++ Image Processing Library for GPUs Jay L. T. Cornwall Olav Beckmann Paul H. J. Kelly This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor. We report on our experience with accelerating an industrial image processing library. To increase the effectiveness of our approach, we exploit some domain-specific knowledge of the library's semantics. We outline the architecture of our translator and how it uses the ROSE source-to-source transformation library to overcome complexities in the C++ language. Techniques for parallel analysis and source transformation are presented in light of their uses in GPU code generation. We conclude with results from a performance evaluation of two examples, image blending and an erosion filter, hand-translated with our parallelisation techniques. We show that our approach has potential and explain some of the remaining challenges in building an effective tool.||||||||||||./pdfs/101-POHLL-paper-1.pdf",
    "Partitioned Scheduling of Periodic Real-Time Tasks onto Reconfigurable Hard|Partitioned Scheduling of Periodic Real-Time Tasks onto Reconfigurable Hardware Klaus Danne Marco Platzner Reconfigurable hardware devices, such as FPGAs, are increasingly used in embedded systems. To utilize these devices for real-time work loads, scheduling techniques are required that generate predictable task timings. In this paper, we present a partitioning-EDF (earliest deadline first) approach to find such schedules. The FPGA area is partitioned along one dimension into slots. The tasks are partitioned into groups. Then, each group is scheduled to exactly one slot using the EDF rule. We show that the problem of finding an optimal partitioning is related to the well-known 2-dimensional level bin-packing problem. We extend a previously reported ILP model to solve our partitioning problem to optimality. By a simulation study we demonstrate that the partitioning-EDF approach is able to find feasible schedules for most task sets with a system utilization of up to 70\\%. Additionally, we allow a task to be realized in alternative implementations. A simulation study reveals that the scheduling performance increases considerably if three instead of one task variants are considered. Finally, we model and study the impact of the device reconfiguration time on the scheduling performance.||||||||||||./pdfs/101-RAW-paper-1.pdf",
    "A Multiprocessor Architecture for the Massively Parallel Model GCA |A Multiprocessor Architecture for the Massively Parallel Model GCA Wolfgang Heenes Rolf Hoffmann Johannes Jendrsczok The GCA (Global Cellular Automata) model consists of a collection of cells which change their states synchronously depending on the states of their neighbors like in the classical CA model. In differentiation to the CA model the neighbors are not fixed and local, they are variable and global. The GCA model is applicable to a wide range of parallel algorithms. In this paper a multiprocessor architecture for the massively parallel GCA model is presented. In contrast to a special purpose implementation of a GCA algorithm the multiprocessor system allows the implementation in a flexible way through programming. The architecture mainly consists of a number of cell processors and a network. The cell processors are dedicated RISC processors, the network is a crossbar implemented with multiplexers. Only read-accesses through the network are necessary in the GCA model leading to a simplified structure. A system with 32 processors was implemented as a prototype on a FPGA. The analysis and implementation results have shown that the performance of the system scales very well with the number of processors.||||||||||||./pdfs/101-SMTPS-paper-1.pdf",
    "Memory Minimization for Tensor Contractions using Integer Linear Programmin|Memory Minimization for Tensor Contractions using Integer Linear Programming A. Allam J. Ramanujam G. Baumgartner P. Sadayappan This paper presents a technique for memory optimization for a class of computations that arises in the field of correlated electronic structure methods such as coupled cluster and configuration interaction methods in quantum chemistry. In this class of computations, loop computations perform a multi-dimensional sum of product of input arrays. There are many different ways to get the same final results that differ in the required number of arithmetic operations required. In addition, for a given number of arithmetic operations, different expressions of the loop have different memory requirements. Loop fusion is a plausible solution for reducing memory usage. By fusing loops between the producer and consumer loop nests, the required storage of intermediate array is reduced by the range of the fused loop. Because resultant loops have to be legal after fusion, some loops can not be fused at the same time. This paper develops a novel integer linear programming (ILP) formulation that is shown to be highly effective on a number of test cases producing the optimal solutions using very small execution times. The main idea in the ILP formulation is the encoding of legality rules for loop fusion of a special class of loops using logical constraints over binary decision variables and a highly effective approximation of memory usage.||||||||||||./pdfs/102-POHLL-paper-1.pdf",
    "Power Consumption Advantage of a Dynamic Optically Reconfigurable Gate Arra|Power Consumption Advantage of a Dynamic Optically Reconfigurable Gate Array Minoru Watanabe Fuminori Kobayashi Recently, various types of ORGAs have been developed. However, their gate counts were not satisfactory compared with those of FPGAs. Therefore, to improve the gate density of conventional ORGAs, a dynamic ORGA (DORGA) architecture that can remove static memory functions to store a configuration context has been proposed. However, the DORGA architecture offers not only the advantages of a high gate count, but also the advantage of low reconfiguration power consumption.This paper presents measurement results of the optical reconfiguration power consumption of a DORGA-VLSI chip and shows the power consumption advantages of the DORGA architecture through comparison with other ORGAs.||||||||||||./pdfs/102-RAW-paper-1.pdf",
    "Dynamic Performance Prediction of an Adaptive Mesh Application |Dynamic Performance Prediction of an Adaptive Mesh Application Mark M Mathis Darren J Kerbyson While it is possible to accurately predict the execution time of a given iteration of an adaptive application, it is not generally possible to predict the data-dependent adaptive behavior the application will take and therefore to predict the total execution time for a given execution. To remedy this situation we have developed an executable performance model that can be utilized dynamically at runtime directly from the application of interest. In this manner, the application itself can rapidly predict the expected execution time for its next iteration based on current information on the data layout and level of adaptivity. This enables the application itself to determine: if an optimum level of performance is being achieved (i.e. by comparing measured and predicted times); when to perform a checkpoint (if the next iteration will exceed a predefined time limit between checkpoints); or when to terminate (if the next iteration will exceed the application's system time allocation for instance). The dynamic model is shown to have high accuracy over a number of test cases, even in the presence of interference (system activities that are not a part of application activities).||||||||||||./pdfs/102-SMTPS-paper-1.pdf",
    "A Self-Stabilizing Minimal Dominating Set Algorithm with Safe Convergence |A Self-Stabilizing Minimal Dominating Set Algorithm with Safe Convergence Hirotsugu Kakugawa Toshimitsu Masuzawa A self-stabilizing distributed system is a fault-tolerant distributed system that tolerates any kind and any finite number of transient faults, such as message loss and memory corruption. In this paper, we formulate a concept of safe convergence in the framework of self-stabilization. An ordinary self-stabilizing algorithm has no safety guarantee while it is in converging from any initial configuration. The safe convergence property guarantees that a system quickly converges to a safe configuration, and then, it gracefully moves to an optimal configuration without breaking safety. Then, we propose a minimal independent dominating set algorithm with safe convergence property. Especially, the proposed algorithm computes the lexicographically first minimal independent dominating set according to the process identifier as a priority. The priority scheme can be arbitrarily changed such as stability, battery power and/or computation power of node.||||||||||||./pdfs/103-APDCM-paper-1.pdf",
    "Price-based User-optimal Job Allocation Scheme for Grid Systems |Price-based User-optimal Job Allocation Scheme for Grid Systems Satish Penmatsa Anthony T. Chronopoulos In this paper we propose a price-based user-optimal job allocation scheme for grid systems whose nodes are connected by a communication network. The job allocation problem is formulated as a noncooperative game among the users who try to minimize the expected cost of their own jobs. We use the concept of Nash equilibrium as the solution of our noncooperative game and derive a distributed algorithm for computing it. The prices that the grid users has to pay for using the computing resources owned by different resource owners are obtained using a pricing model based on a game theory framework. Finally, our scheme is compared with a system-optimal job allocation scheme under simulations with various system loads and configurations and conclusions are drawn.||||||||||||./pdfs/103-HPGC-paper-1.pdf",
    "A Simulator for Parallel Applications with Dynamically Varying Compute Node|A Simulator for Parallel Applications with Dynamically Varying Compute Node Allocation Basile Schaeli Sebastian Gerlach Roger D. Hersch Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. We introduce the concept of dynamic efficiency which expresses the resource utilization efficiency as a function of time. We propose a simulation framework which enables predicting the dynamic efficiency of a parallel application. It relies on the DPS parallelization framework to which we add direct execution simulation capabilities. The high level flow graph description of DPS applications enables the accurate simulation of parallel applications without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements may be reduced. In simulations under partial direct execution, the application's parallel behavior is simulated thanks to direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator. These comparisons are performed for an LU factorization application under different parallelization and dynamic node allocation strategies.||||||||||||./pdfs/103-PMEO-paper-1.pdf",
    "Improving Locality of Nonserial Polyadic Dynamic Programming |Improving Locality of Nonserial Polyadic Dynamic Programming Guangming Tan Ninghui Sun Dongbo Bu Dynamic programming (DP) is a commonly used technique for solving a wide variety of discrete optimization problems, which have different variants of dynamic programming formulation. This paper investigated one important DP formulation, which called nonserial polyadic dynamic programming formulation and time complexity is $O(n^ 3 )$. We exploit the property of the algorithm to develop a high performance implementation using the combination of cache-oblivious and cache-conscious strategy. The efficiency in our improved algorithm comes from two sources: reducing the number of cache misses and TLB misses. Experiments on three modern computing platforms show a performance improvement of 2-10 times over a standard implementation of DP formulation.||||||||||||./pdfs/103-POHLL-paper-1.pdf",
    "An Approach to Locality-Conscious Load Balancing and Transparent Memory Hie|An Approach to Locality-Conscious Load Balancing and Transparent Memory Hierarchy Management with a Global-Address-Space Parallel Programming Model Sriram Krishnamoorthy Umit Catalyurek Jarek Nieplocha P. Sadayappan The development of efficient parallel out-of-core applications is often tedious, because of the need to explicitly manage the movement of data between files and data structures of the parallel program. Several large-scale applications require multiple passes of processing over data too large to fit in memory, where significant concurrency exists within each pass. This paper describes a global-address-space framework for the convenient specification and efficient execution of parallel out-of-core applications operating on block-sparse data. The programming model provides a global view of block-sparse matrices and a mechanism for the expression of parallel tasks that operate on block-sparse data. The tasks are automatically partitioned into phases that operate on memory-resident data, and mapped onto processors to optimize load balance and data locality. Experimental results are presented that demonstrate the utility of the approach.||||||||||||./pdfs/104-POHLL-paper-1.pdf",
    "A Strategyproof Mechanism for Scheduling Divisible Loads in Bus Networks wi|A Strategyproof Mechanism for Scheduling Divisible Loads in Bus Networks without Control Processors Thomas E. Carroll Daniel Grosu Divisible Load Theory (DLT) considers the scheduling of arbitrarily partitionable loads in distributed systems. The underlying assumption of DLT is that the processors are obedient (\\textit i.e. , they do not ``cheat'' the protocol), which is unrealistic when the processors are owned by autonomous, self-interested organizations that have no \\textit a priori motivation for cooperation and which strive to maximize their own welfare. In this scenario, they will manipulate the algorithm if it is beneficial to do so. In this paper we propose a strategyproof mechanism for scheduling divisible loads in bus networks \\emph without control processors. We augment DLT with incentives so that it is to the benefit of a processor to truthfully report its processing capacity and to process its assignment at full capacity. The mechanism provides incentives to processors for reporting deviants and issues fines to deviants, which results in abated willingness to deviate.||||||||||||./pdfs/105-APDCM-paper-1.pdf",
    "TPCC-UVa: An Open-Source TPC-C Implementation for Parallel and Distributed |TPCC-UVa: An Open-Source TPC-C Implementation for Parallel and Distributed Systems Diego R. Llanos Bel&eacute;n Palop This paper presents TPCC-UVa, an open-source implementation of the TPC-C benchmark intended to be used in parallel and distributed systems. TPCC-UVa is written entirely in C language and it uses the PostgreSQL database engine. This implementation includes all the functionalities described by the TPC-C standard specification for the measurement of both uni- and multiprocessor systems performance. The major characteristics of the TPC-C specification are discussed, together with a description of the TPCC-UVa implementation and architecture and real examples of performance measurements.||||||||||||./pdfs/105-PMEO-paper-1.pdf",
    "Support for Adaptivity in ARMCI Using Migratable Objects |Support for Adaptivity in ARMCI Using Migratable Objects Chao Huang Chee Wai Lee Laxmikant V. Kale Many new paradigms of parallel programming have emerged that compete with and complement the standard and well-established MPI model. Most notable, and successful, among these are models that support some form of global address space. At the same time, approaches based on migratable objects (also called virtualized processes) have shown that resource management concerns can be separated effectively from the overall parallel programming effort. For example, Charm++ supports dynamic load balancing via an intelligent adaptive runtime system. It is also becoming clear that a multi-paradigm approach that allows modules written in one or more paradigms to coexist and co-operate will be necessary to tame the parallel programming challenge. ARMCI is a remote memory copy library that serves as a foundation of many global address space languages and libraries. This paper presents our preliminary work on integrating and supporting ARMCI with the adaptive run-time system of Charm++ as a part of our overall effort in the multi-paradigm approach.||||||||||||./pdfs/105-POHLL-paper-1.pdf",
    "WPDRTS Keynote: Component-based Construction of Embedded Systems |WPDRTS Keynote: Component-based Construction of Embedded Systems Joseph Sifakis We present a framework for the component-based construction of embedded systems. The framework is based on a general semantic model, encompassing various models of computation for real-time systems. It is characterized by the combined use of models for behavior, interaction and dynamic priorities. Interaction models describe interactions between components by using connectors with synchronization types. Dynamic priorities are used to specify controllers and schedulers in particular. We also present a methodology for model-based composition of real-time systems using this semantic model. The methodology enables correct-by-construction development for properties such as deadlock-freedom and progress, as well as incremental construction and associativity of composition operators. We present two implementations of the framework in system modeling and validation tools developed at Verimag:\\\\ - A partial implementation in the state exploration platform of the IF tool suite dedicated to the validation of asynchronous system modeling languages such as UML and SDL;\\\\ - A more recent full implementation in a platform for the execution of both synchronous and asynchronous components. The methodology is illustrated by the use of these tools on case studies for real-time systems modeling and validation.||||||||||||./pdfs/105-WPDRTS-paper-1.pdf",
    "An Advanced Performance Analysis of Self-stabilizing Protocols : Stabilizat|An Advanced Performance Analysis of Self-stabilizing Protocols : Stabilization Time with Transient Faults during Convergence Yoshihiro Nakaminami Hirotsugu Kakugawa Toshimitsu Masuzawa A self-stabilizing protocol is a brilliant framework for fault tolerance. It can recover from any number and any type of transient faults and eventually converge to its intended behavior. Performance of a self-stabilizing protocol is usually measured by stabilization time: the time required to complete the convergence to its intended behavior under the assumption that no new fault occurs during the convergence. But a self-stabilizing protocol has no guarantee to complete the convergence if faults are frequently occurred. This paper brings new light to efficiency analysis of stabilization. The efficiency is evaluated with consideration for faults occurring during the convergence. We propose a new performance measure of self-stabilizing protocols, a stabilization time with $\\sharp F$ faults. It is the worst case time to converge to intended behaviors with consideration for $\\sharp F$ faults occurring during the convergence. To show the feasibility and effectiveness of the approach, this paper applies the approach to the maximal matching protocol proposed by Hsu and Huang and show that its stabilization time with $\\sharp F$ faults is $2m + n + 4\\Delta\\cdot \\sharp\\textit F $ where $m,n$ and $\\Delta$ are the number of links, the number of vertices and maximum degree respectively.||||||||||||./pdfs/106-APDCM-paper-1.pdf",
    "Loosely-coupled Loop Scheduling in Computational Grid |Loosely-coupled Loop Scheduling in Computational Grid Jos&eacute; Herrera Eduardo Huedo Rub&eacute;n Santiago Montero Ignacio Mart&iacute;n Llorente Loop distribution is one of the most useful techniques to reduce the execution time of parallel applications. Traditionally, loop scheduling algorithms are implemented based on parallel programming paradigms such as MPI. This approximation presents three main disadvantages when applied in a Grid environment, namely: (i) all resources must be simultaneously allocated to begin execution of the application; (ii) it is necessary to restart the whole application when a resource fails; (iii) it is not possible to add new resources to a currently running application. To overcome these limitations, we propose a new approach to implement loop distribution schemes in computational Grids. This approach is implemented using the Distributed Resource Management Application API (DRMAA) standard and the GridWay meta-scheduling framework. The efficiency of this approach to solve the Mandelbrot set problem is analyzed in a Globus-based research testbed.||||||||||||./pdfs/106-HPGC-paper-1.pdf",
    "A Decomposition Approach for Optimizing the Performance of MPI Libraries |A Decomposition Approach for Optimizing the Performance of MPI Libraries Olaf Hartmann Matthias K&uuml;hnemann Thomas Rauber Gudula R&uuml;nger MPI provides a portable message passing interface for many parallel execution platforms but may lead to inefficiencies for some platforms and applications. In this article we show that the performance of both, standard libraries and vendor-specific libraries, can be improved by an orthogonal organization of the processors in 2D or 3D meshes and by decomposing the collective communication operations into several phases. We describe an adaptive approach with a configuration phase to determine for a specific execution platform and a specific MPI library which decomposition leads to the best performance. This may also depend on the number of processors and the size of the messages to be transferred. The decomposition approach has been implemented in the form of a library extension which is called for each activation of a collective MPI operation. This has the advantage that neither the application programs nor the MPI library need to be changed while leading to significant performance improvements for many collective MPI operations.||||||||||||./pdfs/106-POHLL-paper-1.pdf",
    "Cache-Oblivious Simulation of Parallel Programs |Cache-Oblivious Simulation of Parallel Programs Andrea Pietracaprina Geppino Pucci Francesco Silvestri This paper explores the relation between the structured parallelism exposed by the Decomposable BSP (D-BSP) model through submachine locality and locality of reference in multi-level cache hierarchies. Specifically, an efficient cache-oblivious algorithm is developed to simulate D-BSP programs on the Ideal Cache Model (ICM). The effectiveness of the simulation is proved by showing that optimal cache-oblivious algorithms for prominent problems can be obtained from D-BSP algorithms. Finally, a tight relation between optimality in the D-BSP and ICM models is established.||||||||||||./pdfs/107-APDCM-paper-1.pdf",
    "Annotating User-Defined Abstractions for Optimization |Annotating User-Defined Abstractions for Optimization Dan Quinlan Markus Schordan Richard Vuduc Qing Yi Although conventional compilers implement a wide range of optimization techniques, they frequently miss opportunities to optimize the use of abstractions, largely because they are not designed to recognize and use the relevant semantic information about such abstractions. In this position paper, we propose a set of annotations to help communicate high-level semantic information about abstractions to the compiler, thereby enabling the large body of traditional compiler optimizations to be applied to the use of those abstractions. Our annotations explicitly describe properties of abstractions that are needed to guarantee the applicability and profitability of a broad variety of such optimizations, including memoization, reordering, data layout transformations, and inlining and specialization.||||||||||||./pdfs/107-POHLL-paper-1.pdf",
    "An Optically Differential Reconfigurable Gate Array with a Holographic Memo|An Optically Differential Reconfigurable Gate Array with a Holographic Memory Minoru Watanabe Mototsugu Miyano Fuminori Kobayashi cally Reconfigurable Gate Arrays (ORGAs) offer the possibility of providing a virtual gate count that is much larger than those of currently available VLSIs by exploiting the large storage capacity of holographic memory. We developed an Optically Differential Reconfigurable Gate Array (ODRGA-VLSI) with no overhead and fast reconfiguration capability. This paper presents the results of development of a perfect optical reconfigurable system with the ODRGA-VLSI chip and holographic memory. Experimental results of the reconfiguration procedure and circuit performance on a gate array are also presented.||||||||||||./pdfs/107-RAW-paper-1.pdf",
    "Decontamination of Chordal Rings and Tori |Decontamination of Chordal Rings and Tori Paola Flocchini Miaojun Huang Flaminia Luccio In this paper we consider the problem of decontaminating a network, i.e., protecting it from unwanted and dangerous intrusions. Initially all nodes are contaminated and a team of agents is deployed to clean the entire network. When an agent transits on a node, it can clean it, when the node is left unguarded, however, it will be recontaminated as soon as at least one of its neighbour is contaminated. We study the problem in asynchronous chordal ring networks with $n$ nodes and chord lengths $d_1=1, d_ 2 , ..., d_ k $, and in tori. We consider two variations of the model: one where an agent has only local knowledge, the other in which it has ``visibility, i.e., it can ``see the state of its neighbouring nodes. We first show that, when the largest chord $d_ k $ is not too large ($d_ k \\leq \\sqrt n$), the number of agents necessary to perform the task in chordal rings does not depend on the size of the network but only on the length of the longest chord. We also show a lower bound on the number of agents for the torus topology. We then propose tight strategies for decontamination. We analyse the number of moves and the time complexity of the decontamination algorithms showing that the visibility assumption allows us to decrease substantially both complexity measures. Another advantage of the ``visibility model'' is that agents move independently and autonomously without requiring any coordination.||||||||||||./pdfs/108-APDCM-paper-1.pdf",
    "A Systematic Multi-step Methodology for Performance Analysis of Communicati|A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering Gaby Aguilera Patricia J. Teller Michela Taufer Felix Wolf Often parallel scientific applications are instrumented and traces are collected and analyzed to identify processes with performance problems or operations that cause delays in program execution. The execution of instrumented codes may generate large amounts of performance data, and the collection, storage, and analysis of such traces are time and space demanding. To address this problem, this paper presents an efficient, systematic, multi-step methodology, based on hierarchical clustering, for analysis of communication traces of parallel scientific applications. The methodology is used to discover potential communication performance problems of three applications: TRACE, REMO, and SWEEP3D.||||||||||||./pdfs/108-PMEO-paper-1.pdf",
    "Effecting Parallel Graph Eigensolvers Through Library Composition |Effecting Parallel Graph Eigensolvers Through Library Composition Alex Breuer Peter Gottschling Douglas Gregor Andrew Lumsdaine Many interesting problems in graph theory can be reduced to solving an eigenproblem of the adjacency matrix or Laplacian of a graph. Given the availability of high-quality linear algebra and graph libraries, one might expect that one could merely use a graph data structure within a eigensolver. However, conventional libraries are rigidly constructed, requiring conversion to library-specific data structures or using heavyweight abstraction methods that prevent efficient composition. The Generic Programming methodology addresses the problems of reusability and composability by careful factorization of a domain into efficient library abstractions. We describe the composition process that makes the data structures from a library supporting one domain usable with the algorithms of another library for a disjoint domain without conversion or heavyweight abstractions. To illustrate the process, we compose two separately-developed libraries, one for solving eigenproblems sequentially and the other for solving graph problems in parallel, effecting an efficient, scalable parallel graph eigensolver.||||||||||||./pdfs/108-POHLL-paper-1.pdf",
    "Automatic Application-Specific Microarchitecture Reconfiguration |Automatic Application-Specific Microarchitecture Reconfiguration Shobana Padmanabhan Ron K. Cytron Roger D. Chamberlain John W. Lockwood Applications for constrained embedded systems are subject to strict time constraints and restrictive resource utilization. With soft core processors, application developers can customize the processor for their application, constrained by resources but aimed at high application performance. With such freedom in the design space of the processor, however, comes complexity. We present here an automatic optimization technique that helps the developers with the processor microarchitecture customization. A naive approach exploring all possible configurations is exponential with the number of parameters and hence is clearly infeasible, even with only tens of reconfigurable parameters. Instead, our approach runs in time that is linear with the number of parameter values, based on an assumption of parameter independence. This makes the approach feasible and scalable. For the dimensions that we customize, namely application runtime and hardware resources, we formulate their costs as a constrained binary integer nonlinear optimization program. Though the results are not guaranteed to be optimal, we find they are near-optimal in practice. Our technique itself is general and can be applied to other design-space exploration problems.||||||||||||./pdfs/108-RAW-paper-1.pdf",
    "Optimal Map Contruction of an Unknown Torus |Optimal Map Contruction of an Unknown Torus Hanane Becha Paola Flocchini In this paper we consider the map construction problem in the case of an \\em anonymous, unoriented torus of unknown size. An agent that can move from node to neighbouring node in the torus is initially placed in an arbitrary node and has to construct an edge-labeled map. In other words, it has to draw, in its local memory, an edge-labeled torus isomorphic to the one it is moving on. The agent has enough local memory to represent the torus and one or two tokens that can be dropped on and picked up from nodes. Efficiency is measured in terms of number of moves performed by the agent. When the agent has no token available, the problem is clearly unsolvable. In the paper we show that, when the agent has one token available there exists an optimal algorithm for constructing the map of the torus; the agent, in fact, performs $\\Theta(N)$ moves (where $N$ is the number of nodes of the torus). Before showing the optimal solution with the optimal number of tokens, we describe a simpler solution that works when two tokens are available, we then modify it to obtain the same bound when the agent has only one token available.||||||||||||./pdfs/109-APDCM-paper-1.pdf",
    "Execution and Composition of E-Science Applications using the WS-Resource C|Execution and Composition of E-Science Applications using the WS-Resource Construct Evangelos Floros Yannis Cotronis Service Oriented Architectures are emerging as the recommended paradigm for developing dispersed e-science environments. In this paper we analyze the characteristics and requirements of a common class of scientific applications, namely Computational Simulation Models, and define a generic service-oriented framework for their execution and composition. Finally we present the work done so far towards the implementation of such framework based on the WSRF set of specifications and the Globus Toolkit.||||||||||||./pdfs/109-HPGC-paper-1.pdf",
    "On the Impact of Data Input Sets on Statistical Compiler Tuning |On the Impact of Data Input Sets on Statistical Compiler Tuning Masayo Haneda Peter M. W. Knijnenburg Harry A. G. Wijshoff In recent years, several approaches have been proposed to use profile information in compiler optimization. This profile information can be used at the source level to guide loop transformations as well as in the backend to guide low level optimizations. At the same time, profile guided library generators have been proposed also, like Atlas, Spiral, or FFTW, that tune their routines for the underlying hardware. These approaches have led to excellent performance improvements. However, a possible drawback of these approaches is that applications are optimized using a single or a limited set of data inputs. It is well known that programs can exhibit vastly differing behaviors for different inputs. Therefore, it is not clear whether the performance numbers reported are still valid for other input than the input used to optimize the program. In this paper, we address this problem for a specific statistical compiler tuning method. We use three different platforms and several SPECint2000 benchmarks. We show that when we tune the compiler using train data, we obtain a compiler setting that still performs well for reference data. These results suggest that profile guided optimization may be more stable than is sometimes believed and that a limited number of train data sets is sufficient to obtain a well optimized program for all inputs.||||||||||||./pdfs/109-POHLL-paper-1.pdf",
    "Elementary Block Based 2-Dimensional Dynamic and Partial Reconfiguration fo|Elementary Block Based 2-Dimensional Dynamic and Partial Reconfiguration for Virtex-II FPGAs Michael H&uuml;bner Christian Schuck J&uuml;rgen Becker The development of Field Programmable Gate Arrays (FPGAs) had tremendous improvements in the last few years. They were extended from simple logic circuits to complex Systems-on-Chip which enable the integration of complete microcontroller systems and their peripheral devices. Virtex-II FPGAs from Xilinx provide the possibility of dynamic and partial reconfiguration. This can be taken advantage of to substitute inactive parts of a hardware system and adapt the complete chip to a different requirement of an application while run-time. Existing approaches allow reconfiguration of slot based systems while run-time. Unfortunately such systems suffer from the fact, that fixed sized reconfigurable slots are not completely utilized by all functional blocks. Therefore a new 2-dimensional approach is necessary to optimize the placement of functions on the reconfiguration area for the FPGA. Benefit is a reduced chip size which leads to a reduction of power dissipation. This paper describes the method and procedure to include a 2-dimensional placement of reconfigurable blocks and the integration to a run-time system.||||||||||||./pdfs/109-RAW-paper-1.pdf",
    "Benefits of High Speed Interconnects to Cluster File Systems: A Case Study |Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre Weikuan Yu Ranjit Noronha Shuang Liang Dhabaleswar K. Panda Cluster file systems and Storage Area Networks (SAN) make use of network IO to achieve higher IO bandwidth. Effective integration of networking mechanisms is important to their performance. In this paper, we perform an evaluation of a popular cluster file system, Lustre, over two of the leading high speed cluster interconnects: InfiniBand and Quadrics. Our evaluation is performed with both sequential IO and parallel IO benchmarks in order to explore the capacity of Lustre under different communication characteristics. Experimental results show that direct implementations of Lustre over both interconnects can improve its performance, compared to an IP emulation over InfiniBand (IPoIB). The performance of Lustre over Quadrics is comparable to that of Lustre over InfiniBand with the platforms we have. Latest InfiniBand products can embrace latest technologies, such as PCI-Express and DDR, and provide higher capacity. Our results show that over a Lustre file system with two Object Storage Servers (OSSs), InfiniBand with PCI-Express technology can improve Lustre write performance by 24\\%. Furthermore, our experimental results indicate that Lustre meta-data operations do not scale with an increasing number of OSSs, in spite of using high performance interconnects.||||||||||||./pdfs/11-CAC-paper-1.pdf",
    "A Proposal of Metaheuristics Based in the Cooperation between Operators in |A Proposal of Metaheuristics Based in the Cooperation between Operators in Combinatorial Optimization Problems Alejandro Sancho-royo David Pelta Jos&eacute; L. Verdegay In the context of optimization problems, metaheuristics are tools that stand out by its excellent results and generality. A lot of metaheuristics are formed by a population of agents that operates in a search space. A frame of metaheuristics inspired in cooperation between unrelated individuals is proposed and three different methods of cooperation are suggested. The implementation of the cooperation between agents is made using Soft Computing techniques. A fuzzy rules system has been designed concretely to perform the cooperation. Details about the implementation of three methods of cooperation and the computation of the fuzzy rules are offered for the models considered. A framework of experimentation over the combinations of methods and models is proposed.||||||||||||./pdfs/11-NIDISC-paper-1.pdf",
    "Toward Reliable and Efficient Message Passing Software Through Formal Analy|Toward Reliable and Efficient Message Passing Software Through Formal Analysis Ganesh Gopalakrishnan Robert M. Kirby The quest for high performance drives parallel scientific computing software design. Well over 60\\% of the high-performance computing (HPC) community writes programs using the MPI library; to gain performance, they are known to perform many manual optimizations. Even tools that accept high level descriptions often generate MPI code, due to its eminent portability. However, since the overall performance of a program does not usually port (due to variations in the target architecture, cluster size, etc.), manual changes to the code are inevitable in today's approaches to MPI programming and optimization. This, together with the vastness and evolving nature of the MPI standard, and the innate complexity of concurrent programming introduces costly bugs. Our research addresses these challenges through specific efforts in the following broad areas: (i) high level expression of the parallel algorithm and compilation thereof into optimized MPI programs, (ii) optimizations of user-written detailed MPI programs through localized transformations such as barrier removal, (iii) formal modeling of complex communication standards, such as the MPI-2 standard and a facility for answering putative queries (this need arises when standard documents are impossibly difficult to manually study in order to answer questions that are not explicitly addressed in the standard), (iv) formal modeling of new (and hence relatively less well understood) features of communication libraries, such as the one-sided communication facility of MPI-2, and (v) formal modeling of intricate control algorithms in these libraries such as the progress engine for TCP and/or shared memory in MPICH2 (a formal model can explicate commonalities, help formally verify, as well as help create better future implementations). Our research gains focus through numerous collaborations.||||||||||||./pdfs/11-NSFNGS-paper-1.pdf",
    "Network Decontamination with Local Immunization |Network Decontamination with Local Immunization Fabrizio Luccio Linda Pagli Nicola Santoro We consider the problem of decontaminating a network infected by a mobile virus. The task is to be carried out by a team of antiviral system agents ( \\em cleaners ), able to disinfect visited sites, avoiding any recontamination of disinfected areas. The goal is to perform the task using as small a team of antiviral agents as possible and minimizing the amount of agents' movements across the network. In all the existing literature, it is assumed that a disinfected site, in absence of a cleaner, becomes recontaminated if just one of its neighbours is contaminated. In other words, it is assumed that the immunity level of a disinfected site is \\em nil . This assumption is quite strong and not necessarily realistic, e.g., in systems that employ local majority-based rules to enhance reliability and fault-tolerance. In this paper we consider the network decontamination problem under a new model of \\em immunity to recontamination: we consider the case when a disinfected vertex, after the cleaning agent has gone, will become recontaminated only if a weak majority of its neighbours are infected. We study the effects of this level of immunity on the nature of the problem, in particular on the number of antiviral agents necessary to decontaminate the entire network. We focus on tori and on trees, and establish lower-bounds on the team size; we also establish lower bounds on the number of moves performed by an optimal-size time of cleaners. We design and present strategies for disinfecting tori and trees; we prove that these strategies are optimal in terms of both team size and number of moves. In particular, the upper and lower bounds are are tight for tree networks and for synchronous tori; the bounds are within a constant factor of each other in the case of asynchronous tori.||||||||||||./pdfs/110-APDCM-paper-1.pdf",
    "A Job Monitoring System for the LCG Computing Grid |A Job Monitoring System for the LCG Computing Grid Ahmad Hammad Torsten Harenberg Dimitri Igdalov Peter M&auml;ttig David Meder Peer Ueberholz Experience with generating simulation data of high energy physics experiments has shown that a job monitoring system (JMS) is essential to understand failures of jobs within the Grid. Such a system can give information about the status of the user job as well as the worker node in parallel while a user job is running. It should support the user directly by allowing the user to interact with the running job and should be able to make an automatic error correction. Furthermore, such a system can be extended for an automatic classification of errors which can improve the stability and performance of the Grid environment. To increase the acceptance of the Grid, a graphical user interface (GUI) has been developed and integrated with the job monitoring system. Both components are currently integrated in the computing environment for generating data for the \\verb+DO+ Experiment. In this paper we want to describe the basic components of the job monitoring software.||||||||||||./pdfs/110-HPGC-paper-1.pdf",
    "A General Data Dependence Analysis to Nested Loop Using Integer Interval Th|A General Data Dependence Analysis to Nested Loop Using Integer Interval Theory Zhou Jing Zeng Guosun Many dependence tests have been proposed for loop parallelization in the case of arrays with linear subscripts, but little work has been done on the arrays with non-linear subscripts, which sometimes occur in parallel benchmarks and scientific and engineering applications. This paper focuses on array subscripts coupled integer power index variables. We attempt to use the integer interval theory to solve the above difficult dependence test problem. Some ?interval solution? rules for polynomial equations have been proposed in this paper. Furthermore, based on the proposed rules, we present a novel approach to loop dependence analysis, which is termed the Polynomial Variable Interval test or PVI-test, and also develop a related algorithm. Some case studies show that the PVI-test is effective and efficient. Compared to the VI test, the PVI-test makes significant improvement, and is therefore a more general scheme of dependence test.||||||||||||./pdfs/110-POHLL-paper-1.pdf",
    "VoC: A Reconfigurable Matrix for Stereo Vision Processing |VoC: A Reconfigurable Matrix for Stereo Vision Processing Ricardo Pezzuol Jacobi Renato Barreto Cardoso Geovany Borges This paper presents a reconfigurable matrix VoC that can be applied to stereo vision computation. VoC accelerates block pixel matching by providing a highly parallel implementation of the Sum of Absolute Differences metric. Reconfigurability allows VoC to deal with different block sizes, ranging from a single 7x7 SAD computation to 9 simultaneous 5x5 block computations. The pipelined version mapped to Xilinx FPGA could be simulated at 158 MHz, producing 1,42 billion matchings per second.||||||||||||./pdfs/110-RAW-paper-1.pdf",
    "The Robot Software Communications Architecture (RSCA): Embedded Middleware |The Robot Software Communications Architecture (RSCA): Embedded Middleware for Networked Service Robots Seongsoo Hong Jaesoo Lee Hyeonsang Eom Gwangil Jeon In this paper, we present a robot middleware technology named Robot Software Communications Architecture (RSCA) for its use in networked home service robots. The RSCA provides a standard operating environment for the robot applications together with a framework that expedites the development of such applications. The operating environment is comprised of a real-time operating system, a communication middleware, and a deployment middleware. Particularly, the deployment middleware supports the reconfiguration of component-based robot applications including installation, creation, start, stop, tear-down, and un-installation. In designing RSCA, we have adopted a middleware called SCA from the software defined radio domain and extend it since the original SCA lacks the real-time guarantees and appropriate event services. We have fully implemented RSCA and performed measurements to quantify its run-time performance. Our implementation clearly shows the viability of RSCA.||||||||||||./pdfs/110-WPDRTS-paper-1.pdf",
    "Reducing the Associativity and Size of Step Caches in CRCW Operation |Reducing the Associativity and Size of Step Caches in CRCW Operation Martti Forsell Step caches are caches in which data entered to an cache array is kept valid only until the end of ongoing step of execution. Together with an advanced pipelined multithreaded architecture they can be used to implement concurrent read concurrent write (CRCW) memory access in shared memory multiprocessor systems on chip (MP-SOC) without cache coherency problems. Unfortunately obvious step cache architectures assume full associativity, which can become expensive since the size and thus associativity of caches equal the number of threads per processor being at least the square root of the number of processors. In this paper, we describe a technique to radically reduce the associativity and even size of step caches in CRCW operation. We give a short performance evaluation of limited associativity step cache systems with different settings using simple parallel programs on a parametrical MP-SOC framework. According to the evaluation, the performance of limited associativity step cache systems comes very close to that of fully associative step cache systems, while decreasing the size of caches decreases the performance gradually.||||||||||||./pdfs/111-APDCM-paper-1.pdf",
    "More on JACE: New Functionalities, New Experiments |More on JACE: New Functionalities, New Experiments Jacques Mohcine Bahi Stephane Domas Kamel Mazouzi Java is often criticized for its poor performances compared to native codes. Nevertheless, this language provides lots of interesting functionalities to easily implement scientific applications on a widely distributed architecture (i.e. grid). The context of this paper is that of iterative algorithms. In order to increase the efficiency of the code, we suggest to use a special class of algorithms called AIACs (Asynchronous Iterations, Asynchronous Computations). This paper presents new results on our works to combine Java and asynchronism within a programming/execution environment called JACE. New functionalities have been added and interesting comparisons with C/MPI and on the impact of overlap techniques are given.||||||||||||./pdfs/111-JAVAPDC-paper-1.pdf",
    "A Configuration Memory Hierarchy for Fast Reconfiguration with Reduced Ener|A Configuration Memory Hierarchy for Fast Reconfiguration with Reduced Energy Consumption Overhead Elena Perez Ramo Javier Resano Daniel Mozos Francky Catthoor Currently run-time reconfigurable hardware offers really attractive features for embedded systems, such as flexibility, reusability, high performance and, in some cases, low-power consumption. However, the reconfiguration process often introduces significant overheads in performance and energy consumption. In our previous work we have developed a reconfiguration manager that minimizes the execution time overhead. Nevertheless, since the energy overhead is equally important, in this paper we propose a configuration memory hierarchy that provides fast reconfiguration while achieving energy savings. To take advantages of this hierarchy we have developed a configuration mapping algorithm and we have integrated it in our reconfiguration manager. In our experiments we have reduced the energy consumption 22.5\\% without introducing any performance degradation.||||||||||||./pdfs/111-RAW-paper-1.pdf",
    "A Real-Time PES Supporting Runtime State Restoration after Transient Hardwa|A Real-Time PES Supporting Runtime State Restoration after Transient Hardware-Faults Skambraks Controlling safety-critical real-time applications that cannot immediately be transferred to a safe state requires highly reliable Programmable Electronic Systems (PESs). This demand for fault-tolerance is usually satisfied by applying redundant processing structures inside each PES and, additionally, configuring multiple PES redundantly. Instead of minimising the failure probability of single PESs, it is also desirable to provide a redundant configuration of PESs with the capability to re-start single units at runtime. This requires copying a PES's internal state at runtime, since a re-started unit must equalise its internal state with that of its redundant counterparts before the redundant processing can be rejoined. As a result, redundancy attrition due to transient faults is prevented, since failed channels can be brought back on line. This article states the problems concerned with runtime state restoration of real-time systems, discusses the advantages and disadvantages of existing techniques and introduces a hardware-supported state restoration concept.||||||||||||./pdfs/111-WPDRTS-paper-1.pdf",
    "Broadcasting and Routing in Faulty Mesh Networks |Broadcasting and Routing in Faulty Mesh Networks Milos Stojmenovic Amiya Nayak Broadcasting is a data communication task in which one processor sends the same message to all other processors. Routing is a task where a source processor sends a message to a destination processor. A faulty node is in an error state and cannot participate in the activities or the communication in a given network. In this paper, we consider the family of mesh networks, which include the mesh connected computer (MCC), k-dimensional mesh, torus, and k-ary n-cube. Our goal is to design routing and broadcasting algorithms which will use local knowledge of faults, no additional resources, will work for an arbitrary number and structure of faults, will guarantee delivery to all nodes connected to the source, and will remain optimal in a fault free mesh. We did not find any solution in literature to satisfy these desirable properties. Our routing and broadcasting schemes for MCCs and tori, and our broadcasting algorithm for the all-port model on any faulty mesh network satisfy all of these properties. For routing and broadcasting in a one-port model in higher dimensions, a condition on fault structure needs to be met. We propose a new broadcasting algorithm which guarantees delivery to all processors connected to the source in the all-port model of faulty meshes. We then describe a routing algorithm that guarantees delivery in faulty MCCs and tori, the connectivity of the source and destination being the only obvious requirement. The algorithm can be extended to faulty k-D meshes and k-ary n-cubes, where the delivery will be guaranteed if healthy nodes in every 2-D submesh (sub-tori) remain connected. We then describe broadcasting algorithms for the one-port model, which again guarantee delivery to all connected processors in two-dimensional cases, and guarantee delivery in k-dimensional cases if healthy processors in every 2-D submesh (sub-tori) remain connected.||||||||||||./pdfs/112-APDCM-paper-1.pdf",
    "Anticipated Distributed Task Scheduling for Grid Environments |Anticipated Distributed Task Scheduling for Grid Environments Thomas Rauber Gudula R&uuml;nger Heterogeneous distributed environments or grid environments provide large computing resources for the execution of large scientific applications. The effective use of those platforms requires a suitable representation of the application algorithm which makes a distribution of parts of the application across the distributed environment possible. A representation of an application algorithm in form of interacting tasks has been shown to be a suitable programming model for those distributed environments, where tasks can be shipped to remote computing resources for execution. The efficient execution of an application also depends on the time for sending tasks and data to remote resoures, which adds an additional overhead to the distributed execution time. In this paper, we propose a method to overlap the execution of current tasks with the shipping time for tasks to be executed later. The efficient overlapping is achieved by an anticipated scheduling algorithm for the placement of future task executions.||||||||||||./pdfs/112-HPGC-paper-1.pdf",
    "An Adaptive System-on-Chip for Network Applications |An Adaptive System-on-Chip for Network Applications Roman Koch Thilo Pionteck Carsten Albrecht Erik Maehle This paper presents the hardware architecture of DynaCORE, a dynamically reconfigurable system-on-chip for network applications. DynaCORE is an application specific coprocessor for offloading computationally intensive tasks from a network processor. The system-on-chip architecture is based on an adaptable network-on-chip which allows the dynamic replacement of hardware modules as well as the adaptation of the on-chip communication structure. The coprocessor leverages the active partial reconfiguration feature of modern FPGAs in order to adapt to shifting demand patterns. An embedded general-purpose processor core within the coprocessor runs software which manages the configurations of the device. With reference to a prototypical implementation targeting a Xilinx Virtex-II Pro FPGA, this paper focuses on on-chip communication issues. Topics include the integration of PowerPC processor cores into the configurable logic as well as the mode of operation of the network-on-chip.||||||||||||./pdfs/112-RAW-paper-1.pdf",
    "A Portable Real-time Emulator for Testing Multi-Radio MANETs |A Portable Real-time Emulator for Testing Multi-Radio MANETs Weirong Jiang Chao Zhang In building a real-life mobile ad-hoc network (MANET), network emulation has been appraised as an efficient approach for testing the real implementations of routing algorithms and protocol stacks. Most existing MANET emulators can hardly support both real-time scene construction for proof-of-concept test and real-time traffic recording for performance evaluation simultaneously. They also lack the ability to emulate the multi-radio environment. This paper presents a flexible TCP/IP-based real-time MANET emulator that can be portably deployed to facilitate the development of real multi-radio MANET routing protocols. It friendly provides visual interaction of topology control and rich configuration of emulation conditions to enable a real-time and comprehensive examination of protocol implementations.||||||||||||./pdfs/112-WPDRTS-paper-1.pdf",
    "Enhancing the Performance of HLA-Based Simulation Systems via Software Dive|Enhancing the Performance of HLA-Based Simulation Systems via Software Diversity and Active Replication Francesco Quaglia In this paper we explore active replication based on software diversity for improving the responsiveness of simulation systems. Our proposal is framed by the High-Level-Architecture (HLA), namely the emerging standard for interoperability of simulation packages, and results in the design and implementation of an Active Replication Management Layer (ARML), which supports the execution of multiple software diversity-based replicas of a same simulator in a totally transparent manner. Beyond presenting the replication framework and the design/implementation of ARML, we also report the results of an experimental evaluation on a case study, quantifying the benefits from our proposal in terms of execution speed.||||||||||||./pdfs/113-APDCM-paper-1.pdf",
    "IMAGE: An approach to building standards-based enterprise Grids |IMAGE: An approach to building standards-based enterprise Grids Gabriel Mateescu Masha Sosonkina We describe a system for aggregating heterogeneous resources from distinct administrative domains into an enterprise-wide compute grid, such that the aggregated resource provides the services of reliable and flexible queuing, scheduling, execution, and monitoring of batch applications. The system provides scheduling across multiple cluster Grids, user account mapping across domains, and file staging, thereby enabling the consolidation of organization-wide distributed resources into a virtual resource, while preserving local control of resources. The concept of abstract queue, as the unit of aggregating heterogeneous resources, is introduced and instantiated for distributed resource scheduling. The proposed system is an open source, standards-based alternative to similar commercial systems.||||||||||||./pdfs/113-HPGC-paper-1.pdf",
    "Exploiting Dynamic Proxies in Middleware for Distributed, Parallel, and Mob|Exploiting Dynamic Proxies in Middleware for Distributed, Parallel, and Mobile Java Applications Willem Van Heiningen Tim Brecht Steve Macdonald Babylon v2.0 is a collection of tools and services that provide a 100\\% Java compatible environment for developing, running and managing parallel, distributed and mobile Java applications. It incorporates features like object migration, asynchronous method invocation and remote class loading while providing an easy-to-use interface. The implementation of Babylon v2.0 exploits \\textit dynamic proxies , a feature added to Java 1.3 that allows runtime creation of proxy objects. This paper shows how Babylon v2.0 exploits dynamic proxies to implement several key features without the need for special language or virtual machine extensions, preprocessors, or compilers. The resulting Babylon programs are portable across all Java virtual machines, and the development process is simplified by removing the extra steps needed to invoke external stub compilers and incorporate the generated code into an application. This simplification also allows remote objects to be created for any class that supports an interface to its methods, even if source code is not available.||||||||||||./pdfs/113-JAVAPDC-paper-1.pdf",
    "Towards Building a Highly-Available Cluster Based Model for High Performanc|Towards Building a Highly-Available Cluster Based Model for High Performance Computing Azzedine Boukerche Raed Al-shaikh Mirela Sechi In recent years, we have witnessed a growing interest in high performance computing (HPC) using a cluster of workstations. However, many challenges remain to be resolved before these systems become dependable. One of the challenges in a clustered environment is to keep system failure to the minimum level and while achieving the highest possible level of system availability. High-Availability (HA) computing attempts to avoid the problems of unexpected failures through active redundancy and preemptive measures. In this paper, we propose to build HA-clusters based model for high performance computing. Our model is based on combination of both HPC and HA concepts, we also propose to investigate further the hardware and the management layers of the HA-HPC cluster design, and the parallel-applications layer (i.e. FT-MPI implementations). In this work, we focus upon the latter layer. We discuss our model, and present our simulation experiments we have carried out to evaluate our proposed model.||||||||||||./pdfs/113-PMEO-paper-1.pdf",
    "VHDL to FPGA automatic IPCore generation: A case study on Xilinx design flo|VHDL to FPGA automatic IPCore generation: A case study on Xilinx design flow Fabrizio Ferrandi Giovanna Ferrara Roberto Palazzo Vincenzo Rana Marco Domenico Santambrogio This paper aims at introducing a methodology that allows an easy implementation of IP-Cores focusing only on their functionalities rather than their interfaces and their integration in a given architecture. The proposed approach implements all the communication infrastructure needed by a component, described in VHDL, to be finally inserted into a real architecture that can be implemented on FPGAs, reducing the time to market of the final implementation of the system. To validate the entire methodology, we have performed a comparison based on the CoreConnect communication infrastructure, between our results with the classical Xilinx design flow using EDK and ISE.||||||||||||./pdfs/113-RAW-paper-1.pdf",
    "Battery Aware Dynamic Scheduling for Periodic Task Graphs |Battery Aware Dynamic Scheduling for Periodic Task Graphs Venkat Rao Gaurav Singhal Nicolas Navet Anshul Kumar G.s Visweswaran Battery lifetime, a primary design constraint for mobile embedded systems, has been shown to depend heavily on the load current profile. This paper explores how scheduling guidelines from battery models can help in extending battery capacity. It then presents a 'Battery-Aware Scheduling' methodology for periodically arriving taskgraphs with real time deadlines and precedence constraints. Scheduling of even a single taskgraph while minimizing the weighted sum of a cost function has been shown to be NP-Hard. The presented methodology divides the problem in to two steps. First, a good DVS algorithms dynamically determines the minimum frequency of execution. Then, a greedy algorithm allows a near optimal priority function to choose the task which would maximize slack recovery. The methodology also ensures adherence of real time deadlines independent of the choice of the DVS algorithm and priority function used, while following battery guidelines to maximize battery lifetime. Battery simulations carried out on the profile generated by our methodology for a large set of taskgraphs show that battery life time is extended up to 23.3\\% as compared to existing dynamic scheduling schemes.||||||||||||./pdfs/113-WPDRTS-paper-1.pdf",
    "Self-Stabilizing Distributed Algorithms for Graph Alliances |Self-Stabilizing Distributed Algorithms for Graph Alliances Pradip Srimani Zhenyu Xu Graph alliances are recently developed global properties of any symmetric graph. Our purpose in the present paper is to design self-stabilizing fault tolerant distributed algorithms for the global offensive and the global defensive alliance in a given arbitrary graph. We also provide complete analysis of the convergence time of both the algorithms.||||||||||||./pdfs/114-APDCM-paper-1.pdf",
    "An Evaluation of Heuristics for SLA Based Parallel Job Scheduling |An Evaluation of Heuristics for SLA Based Parallel Job Scheduling Viktor Yarmolenko Rizos Sakellariou In the context of SLA based job scheduling for high performance grid computing, this paper investigates the behaviour of various scheduling heuristics to schedule SLA-bounded jobs onto a parallel computing resource. The key objective of this investigation is to evaluate the effectiveness of simple scheduling heuristics using as criteria the maximization of resource utilization (both in terms of time and SLAs serviced) and income. Our results suggest how each SLA constraint ought to be prioritized in order to improve the income.||||||||||||./pdfs/114-HPGC-paper-1.pdf",
    "Communication Concept for Adaptive Intelligent Run-Time Systems Supporting |Communication Concept for Adaptive Intelligent Run-Time Systems Supporting Distributed Reconfigurable Embedded Systems Michael Ullmann J&uuml;rgen Becker Reconfigurable computing systems have already shown their abilities to accelerate embedded hardware/ software systems. Since standard processor-based embedded applications have come to their limits we need new concepts for controlling and managing embedded, possibly distributed, reconfigurable hardware/ software computing systems. Succeeding to previous papers which dealt with management aspects of run-time reconfigurable systems and related AI-approaches this contribution describes an approach and proof of concept of a transparent communication mechanism between the application layer and its possibly distributed and reconfigurable hardware/ software sub-function modules.||||||||||||./pdfs/114-RAW-paper-1.pdf",
    "Realization of Virtual Networks in the DECOS Integrated Architecture |Realization of Virtual Networks in the DECOS Integrated Architecture Roman Obermaisser Philipp Peti Due to the better utilization of computational and communication resources and the improved coordination of application subsystems, designers of large distributed embedded systems (e.g., in the automotive domain) are eager to replace existing federated architectures with integrated ones. This paper focuses on the communication infrastructure of the DECOS integrated system architecture, which realizes for each application subsystem a so-called virtual network as an overlay network on top of a time-triggered communication protocol. Since all virtual networks share a single physical network, virtual networks promise massive cost savings through the reduction of physical networks and reliability improvements with respect to wiring and connectors. Furthermore, virtual networks support application subsystems that range from ultra-dependable control applications (e.g., an X-by-wire system) to non safetycritical applications such as comfort systems. For this reason, two classes (event-triggered and time-triggered) of virtual networks are realized. Event-triggered virtual networks provide high flexibility for non safetycritical application subsystems, while the predictability of the time-triggered paradigm is better suited for safety-critical application subsystems. Encapsulation mechanisms ensure that the temporal properties of each virtual network are known a priori and independent from the communication activities in other virtual networks. In order to ensure that the virtual network abstractions hold also in the case of software faults, each application subsystem possesses a dedicated virtual network with statically assigned resources at the underlying time-triggered communication service.||||||||||||./pdfs/114-WPDRTS-paper-1.pdf",
    "SmartNetSolve: High-Level Programming System for High Performance Grid Comp|SmartNetSolve: High-Level Programming System for High Performance Grid Computing Thomas Brady Eugene Konstantinov Alexey Lastovetsky The paper presents SmartNetSolve, an extension of NetSolve, the programming system for high performance Grid computing. The extension is aimed at higher performance of Grid applications by improving the mapping of remote tasks and allowing them to communicate directly. To achieve more optimal mapping SmartNetSolve allows a group of tasks to be scheduled collectively, meanwhile NetSolve only allows for individual and independent mapping of remote tasks. SmartNetSolve also extends the communication model of the application by allowing remote tasks to communicate directly. The paper presents the overall design of the SmartNetSolve programming system with particular focus on its motivation and the underlying execution and communication models.||||||||||||./pdfs/115-HPGC-paper-1.pdf",
    "Efficient Broadcasting of Safety Messages in Multihop Vehicular Networks |Efficient Broadcasting of Safety Messages in Multihop Vehicular Networks Carla-fabiana Chiasserini Rossano Gaeta Michele Garetto Marco Gribaudo Matteo Sereno We focus on a vehicular network supporting safety applications, and we present an application and a channel access mechanism for efficient multihop broadcasting. We study the performance of the proposed solution by developing an analytical framework, which provides several metrics relevant to message dissemination. Analytical results are compared with the performance obtained through \\em ns .||||||||||||./pdfs/115-PMEO-paper-1.pdf",
    "Fault Tolerance with Real-Time Java |Fault Tolerance with Real-Time Java Damien Masson Serge Midonnet After having drawn up a state of the art on the theoretical feasibility of a system of periodic tasks scheduled by a preemptive algorithm at fixed priorities, we show in this article that temporal faults can occur all the same within a theoretically feasible system, that these faults can lead to a failure of the system and that we can use the data calculated during control of admission to install detectors of faults and to define a factor of tolerance. We show then the results obtained on a system of periodic tasks coded with Java Real-Time and carried out with the virtual machine \\em jRate . These results show that the installation of the detectors and the tolerance to the faults makes an improvement of the behavior of the system in the presence of faults.||||||||||||./pdfs/115-WPDRTS-paper-1.pdf",
    "Speeding up NGB with Distributed File Streaming Framework |Speeding up NGB with Distributed File Streaming Framework Bingchen Li Kang Chen Zhiteng Huang Hrabri L. Rajic Robert H. Kuhn Grid computing provides a very rich environment for scientific calculations. In addition to the challenges it provides, it also offers new opportunities for optimization. In this paper we have utilized DFS (Distributed File Streaming) framework to speed up NAS Grid Benchmark workflows. By studying I/O patterns of NGB codes we have identified program locations where it is possible to overlap computation and data workflow phases. By integrating DFS into NGB, we demonstrate a useful method of improving overall workflow efficiency by streaming the output of the current process to make an input of the following stage, reducing a workflow to a series of distributed producer consumer stages. DFS framework eliminates file transfers and in the process makes process scheduling more efficient, leading to overall performance improvements in the turnaround time for HC (Helical Chain) data flow graph under Globus grid environment with the embedded DFS over the original version of the benchmark.||||||||||||./pdfs/116-HPGC-paper-1.pdf",
    "On the Performance Analysis of Recursive Data Replication Scheme for File S|On the Performance Analysis of Recursive Data Replication Scheme for File Sharing in Mobile Peer-to-Peer Devices Using the HyMIS Scheme Constandinos X. Mavromoustakis Helen D. Karatza Advances in wireless networks enable high rates interaction between mobile devices. Short-range wireless communication technologies such as wearable PCs demand low latency and reliability as the first thing for considering QoS. Mobile Peer-to-Peer devices as an autonomous system of mobile routers that are self-organized, self-configured and completely decentralized are characterized by bounded resource sharing reliability. Due to the uncertainty in available resources wireless networks could rarely host file sharing applications in a reliable manner. This paper examines the response of a gossip-based data replication scheme for reliable file sharing under specified patterns and conditions, using the Hybrid Mobile Infostation System (HyMIS). This scheme is based on the advantages of mobile Infostations. Combining the strengths of autonomic gossiping and the hybrid ?entirely mobile- Infostation concept, this scheme enables end to end reliability. Examination is performed for the response, the robustness and the offered reliability while examining the effectiveness of the proposed scheme for facing mobility limitations using the gossip-based ?selection? of users.||||||||||||./pdfs/116-PMEO-paper-1.pdf",
    "Scheduling of Tasks with Precedence Delays and Relative Deadlines - Framewo|Scheduling of Tasks with Precedence Delays and Relative Deadlines - Framework for Time-optimal Dynamic Reconfiguration of FPGAs Premysl Sucha Zdenek Hanzalek This paper is motivated by existing architectures of field programmable gate arrays (FPGAs). To facilitate the design process we present an optimal scheduling algorithm using a very universal framework, where tasks are constrained by precedence delays and relative deadlines. The precedence relations are given by an oriented graph, where tasks are represented by nodes. Edges in the graph are related either to the minimum time or to the maximum time elapsed between the start times of the tasks. This framework is used to model the runtime dynamic reconfiguration, synchronization with an on-chip processor and simultaneous availability of arithmetic units and SRAM memory. The NP-hard problem of finding an optimal schedule satisfying the timing and resource constraints while minimizing the makespan $C_ max $, is solved using two approaches. The first one is based on Integer Linear Programming and the second one is implemented as a Branch and Bound algorithm. Experimental results show the efficiency comparison of the ILP and Branch and Bound solutions.||||||||||||./pdfs/116-WPDRTS-paper-1.pdf",
    "HPGC Keynote: Major Grid Projects Around the World |HPGC Keynote: Major Grid Projects Around the World Wolfgang Gentzsch This talk will present and compare several major grid projects, with a focus on measuring and achieving performance of applications on grids. For these purposes we will investigate and compare the areas of applications, infrastructure, management, scheduling, load balancing, and benchmarking, for the Teragrid project and the North Carolina Statewide Grid effort in the US, Naregi in Japan, and EGEE and the German D-Grid project in Europe.||||||||||||./pdfs/117-HPGC-paper-1.pdf",
    "Comparison of MPI Benchmark Programs on an SGI Altix ccNUMA Shared Memory M|Comparison of MPI Benchmark Programs on an SGI Altix ccNUMA Shared Memory Machine Nor Asilah Wati Abdul Hamid Paul Coddington Francis Vaughan The results produced by five different MPI benchmark programs on an SGI Altix 3700 are analyzed and compared. There are significant differences in the results for some MPI operations. We investigate the reasons for these discrepancies, which are due to differences in the measurement techniques, implementation details and default configurations of the different benchmarks. The variation in results on the Altix are generally much greater than on a distributed memory machine, due primarily to the ccNUMA architecture and the importance of cache effects, as well as some implementation details of the SGI MPI libraries.||||||||||||./pdfs/117-PMEO-paper-1.pdf",
    "A Probabilistic Approach for Fault Tolerant Multiprocessor Real-time Schedu|A Probabilistic Approach for Fault Tolerant Multiprocessor Real-time Scheduling Vandy Berten Jo&euml;l Goossens Emmanuel Jeannot In this paper we tackle the problem of scheduling a periodic real-time system on identical multiprocessor platforms, moreover the tasks considered may fail with a given probability. For each task we compute its duplication rate in order to (1) given a maximum tolerated probability of failure, minimize the size of the platform such at least one replica of each job meets its deadline (and does not fail) using a variant of \\textsf EDF namely \\textsf EDF $^ (k) $ or (2) given the size of the platform, achieve the best possible reliability with the same constraints. Thanks to our probabilistic approach, no assumption is made on the number of failures which can occur. We propose several approaches to duplicate tasks and we show that we are able to find solutions always very close to the optimal one.||||||||||||./pdfs/117-WPDRTS-paper-1.pdf",
    "A Calculus of Functional BSP Programs with Projection |A Calculus of Functional BSP Programs with Projection Fr&eacute;d&eacute;ric Loulergue Bulk Synchronous Parallel ML (BSML) is an extension of the functional language Objective Caml to program Bulk Synchronous Parallel (BSP) algorithms. It is deterministic, deadlock free and performances are good and predictable. Parallelism is expressed with a set of 4 primitives on a parallel data structure called parallel vector. These primitives are pure functional ones: they have no side-effect. It is thus possible to prove the correctness of BSML programs using a proof assistant like Coq. The BS$\\lambda$-calculus is an extension of the $\\lambda$-calculus which models the core semantics of BSML. Nevertheless some principles of BSML are not well captured by this calculus. This paper presents a new calculus, with a projection primitive, which provides a better model of the core semantics of BSML.||||||||||||./pdfs/118-APDCM-paper-1.pdf",
    "Cost Evaluation from Specifications for BSP Programs |Cost Evaluation from Specifications for BSP Programs Virginia Niculescu BSP has shown that structured parallel programming is not only a performance win, but it is also a program construction win, especially if we add a formal method for designing. Maybe the most important advantage that BSP brings is the effective cost model that allows a good evaluation of the performance. The paper presents a technique for cost evaluation from specifications for BSP programs. We consider parameterized specifications and processes for BSP programs, and the parameters are the number of processes, the index of the local process, and the data distribution. The possibility of counting the number of communications from postconditions, allows us to make a cost evaluation even at the early stages of the design, and so it leads us to the right decisions.||||||||||||./pdfs/118-PMEO-paper-1.pdf",
    "A Hierarchical Scheduling Model for Component-Based Real-Time Systems |A Hierarchical Scheduling Model for Component-Based Real-Time Systems Jos&eacute; L. Lorente Giuseppe Lipari Enrico Bini In this paper, we propose a methodology for developing component-based real-time systems based on the concept of hierarchical scheduling. Recently, much work has been devoted to the schedulability analysis of hierarchical scheduling systems, in which real-time tasks are grouped into components, and it is possible to specify a different scheduling policy for each component. Until now, only independent components have been considered. In this paper, we extend this model to tasks that interact through remote procedure calls. We introduce the concept of abstract computing platform on which each component is executed. Then, we transform the system specification into a set of real-time transactions and present a schedulability analysis algorithm. Our analysis is a generalization of the holistic analysis to the case of abstract computing platforms. We demonstrate the use of our methodology on a simple example.||||||||||||./pdfs/118-WPDRTS-paper-1.pdf",
    "Ant-inspired Query Routing Performance in Dynamic Peer-to-Peer Networks |Ant-inspired Query Routing Performance in Dynamic Peer-to-Peer Networks Mojca Ciglaric Tone Vidmar P2P Networks are highly dynamic structures since their nodes ? peer users keep joining and leaving continuously. In the paper, we study the effects of network change rates on query routing efficiency. First, the problem background is described and abstract system model is defined. The system characteristics and behavior are analyzed and abstracted with a set of measurable metrics. The paper studies ant-inspired Mute query routing protocol and compares its behavior to previously suggested routing protocols. The chosen routing technique makes use of cached metadata from previous answer messages (analogy to ants laying feromone on their trail when searching for food). The paper also discusses mechanisms for broken path detection and metadata maintenance. Further, simulations in various dynamic network environments are presented and discussed: the degree of network dynamics varies from one node departure and node join per ten queries generated to five node departures and joins per one generated query. Several metrics are used to clarify the protocol behavior even with high rate of node departures, but it is shown that above a certain threshold it literally breaks down and exhibits considerable efficiency degradation.||||||||||||./pdfs/119-APDCM-paper-1.pdf",
    "Modelling Job Allocation where Service Duration is Unknown |Modelling Job Allocation where Service Duration is Unknown Nigel Thomas In this paper a novel job allocation scheme in distributed systems (TAG) is modelled using the Markovian process algebra PEPA. This scheme requires no prior knowledge of job size and has been shown to be more efficient than round robin and random allocation when the job size distribution is heavy tailed and the load is not high. In this paper the job size distribution is assumed to be of a phase-type and the queues are bounded. Numerical results are derived and compared with those derived from models employing random allocation and the shortest queue strategy. It is shown that TAG can perform well for a range of performance metrics.||||||||||||./pdfs/119-PMEO-paper-1.pdf",
    "Schedulability Analysis of Non-Preemptive Recurring Real-Time Tasks |Schedulability Analysis of Non-Preemptive Recurring Real-Time Tasks Sanjoy K. Baruah Samarjit Chakraborty The recurring real-time task model was recently proposed as a model for real-time processes that contain code with conditional branches. In this paper, we present a necessary and sufficient condition for uniprocessor non-preemptive schedulability analysis for this task model. We also derive a polynomial-time approximation algorithm for testing this condition. Preemptive schedulers usually have a larger schedulability region compared to their non-preemptive counterparts. Further, for most realistic task models, schedulability analysis for the non-preemptive version is computationally more complex compared to the corresponding preemptive version. Our results in this paper show that (surprisingly) the recurring real-time task model does not fall in line with these intuitive expectations, i.e. there exists polynomial-time approximation algorithms for both preemptive and non-preemptive versions of schedulability analysis. This has important implications on the applicability of this model, since fully preemptive scheduling algorithms often have significantly larger runtime overheads.||||||||||||./pdfs/119-WPDRTS-paper-1.pdf",
    "Compiler-Assisted Software Verification Using Plug-Ins |Compiler-Assisted Software Verification Using Plug-Ins Sean Callanan Radu Grosu Xiaowan Huang Scott A. Smolka Erez Zadok We present Protagoras, a new plug-in architecture for the GNU compiler collection that allows one to modify GCC's internal representation of the program under compilation. We illustrate the utility of Protagoras by presenting plug-ins for both compile-time and runtime software verification and monitoring. In the compile-time case, we have developed plug-ins that interpret the GIMPLE intermediate representation to verify properties statically. In the runtime case, we have developed plug-ins for GCC to perform memory leak detection, array bounds checking, and reference-count access monitoring.||||||||||||./pdfs/12-NSFNGS-paper-1.pdf",
    "A Framework for Developing Distributed Location Based Applications |A Framework for Developing Distributed Location Based Applications Andrej Krevl Mojca Ciglaric Location based services and applications are buzzwords nowadays, yet they have been around for quite some time in a variety of applications. However these applications are scarce because of the high costs associated with the positioning equipment. This paper presents different options for determining location of mobile devices such as mobile phones and Pocket PCs. It describes positioning possibilities using WiFi networks, GSM networks, Bluetooth beacons and the GPS system. Furthermore, it proposes a framework for developing distributed location based applications. The paper specifies which components comprise the framework, data structures that are used for spatial data interchange and Web Services that are used for communication between components. It also describes a location aware application prototype built on top of the proposed framework. It concludes that building applications on top of the proposed framework is feasible and discusses benefits and drawbacks of this approach.||||||||||||./pdfs/120-APDCM-paper-1.pdf",
    "Analytical Performance Modelling of Partially Adaptive Routing in Hypercube|Analytical Performance Modelling of Partially Adaptive Routing in Hypercubes Ahmad Patooghy Hamid Sarbazi-azad Although several analytical models have been proposed in the literature for different interconnection networks with different routing algorithms, there is only one work dealing with partially adaptive routing algorithms. This paper proposes an accurate analytical model to predict message latency in wormhole-routed hypercube based networks using the partially adaptive routing algorithm. The results obtained from simulation experiments confirm that the proposed model exhibits a good accuracy for various network sizes and under different operating conditions.||||||||||||./pdfs/120-PMEO-paper-1.pdf",
    "Towards an Analysis of Race Carrier Conditions in Real-time Java |Towards an Analysis of Race Carrier Conditions in Real-time Java M. T. Higuera-toledano The RTSJ memory model propose a mechanism based on a scope three containing all region-stacks in the system and a reference-counter collector. In order to avoid reference cycles among regions on the region-stack, RTSJ defines the single parent rule. The given algorithms to maintain the region-stack tructure are not compliant with the defined parentage relation. More over, the suggested algorithms to maintain the single parent rule introduces race carrier conditions on the application behaviour. This paper proposes alternative approaches in order to avoid this problem.||||||||||||./pdfs/120-WPDRTS-paper-1.pdf",
    "Efficient Hardware Algorithms for N Choose K Counters |Efficient Hardware Algorithms for N Choose K Counters Yasuaki Ito Koji Nakano Youhei Yamagishi An ``n choose k'' counter (C(n,k) counter for short) is a counter which lists all n-bit numbers with (n-k) 0's and $k$ 1's. The ``n choose k'' counter has applications to solving combinatorial optimization problems and image processing. The main contribution of this work is to present an efficient hardware implementation of the C(n,k) counter. In some applications, C(n,k) counters are used only for small k. The second contribution is to show more efficient implementations that support C(n,k) counters only for small k. We evaluate the performance of our new implementation and known implementations in terms of the number of used slices and the clock frequency for the Xilinx VirtexII family FPGA XC2V3000-4. Although the theoretical analysis shows that our implementation is not the best, it runs in higher clock frequency using fewer number of slices than the other implementations.||||||||||||./pdfs/121-APDCM-paper-1.pdf",
    "Performance Analysis of Stochastic Process Algebra Models using Stochastic |Performance Analysis of Stochastic Process Algebra Models using Stochastic Simulation Jeremy T. Bradley Stephen T. Gilmore Nigel Thomas We present a translation of a generic stochastic process algebra model into a form suitable for stochastic simulation. By systematically generating rate equations from a process description, we can use tools developed for chemical and biochemical reaction analysis to provide time-series output for models with state spaces of $O(10^ 10000 )$ and beyond. We apply these techniques to a significant case study: that of a secure electronic voting protocol.||||||||||||./pdfs/121-PMEO-paper-1.pdf",
    "Lossless Compression for Large Scale Cluster Logs |Lossless Compression for Large Scale Cluster Logs Raju Balakrishnan Ramendra K. Sahoo The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM?s Blue Gene/L which can accommodate as many as 128K processors. One of the biggest challenges these systems face, is to manage generated system logs while deploying in production environments. Large amount of log data is created over extended period of time, across thousands of processors. These logs generated can be voluminous because of the large temporal and spatial dimensions, and containing records which are repeatedly entered to the log archive. Storing and transferring such large amount of log data is a challenging problem. Commonly used generic compression utilities are not optimal for such large amount of data considering a number of performance requirements. In this paper we propose a compression algorithm which preprocesses these logs before trying out any standard compression utilities. The compression ratios and times for the combination shows 28.3\\% improvement in compression ratio and 43.4\\% improvement in compression time on average over different generic compression utilities. The test data used is log data produced by 64 racks, 65536 processor Blue Gene/L installation at Lawrence Livermore National Laboratory.||||||||||||./pdfs/121-SMTPS-paper-1.pdf",
    "Schedulability Analysis of AR-TP, a Ravenscar Compliant Communication Proto|Schedulability Analysis of AR-TP, a Ravenscar Compliant Communication Protocol for High-Integrity Distributed Systems Santiago Urue&ntilde;a Juan Zamorano Daniel Berj&oacute;n Jos&eacute; A. Pulido Juan A. De La Puente A new token-passing algorithm called AR-TP for avoiding the non-determinism of some networking technologies is presented. This protocol allows the schedulability analysis of the network, enabling the use of standard Ethernet hardware for Hard Real-Time behavior while adding congestion management. It is specially designed for High-Integrity Distributed Hard Real-Time Systems, being fully compliant with the Ravenscar Profile.||||||||||||./pdfs/121-WPDRTS-paper-1.pdf",
    "Scheduling Heuristics for Efficient Broadcast Operations on Grid Environmen|Scheduling Heuristics for Efficient Broadcast Operations on Grid Environments Luiz Angelo Barchet- Steffenel Gr&eacute;gory Mounie The popularity of large-scale parallel environments like computational grids has emphasised the influence of network heterogeneity on the performance of parallel applications. Collective communication operations are especially concerned by this problem, as heterogeneity interferes directly on the performance of the communication strategies. In this paper we focus on the development of scheduling techniques to minimise the total communication time (makespan) of a broadcast operation on a grid environment. We observed that most optimisation techniques present in the literature are unable to deal with the complexity of a large network environment. In our work we propose the use of hierarchical communication levels to reduce the optimisation complexity, while keeping high performance levels. Indeed, we propose three heuristics designed to meet the requirements of a hierarchically structured grid composed of tenths of clusters, a tendency for the next years.||||||||||||./pdfs/122-PMEO-paper-1.pdf",
    "On-the-Fly Kernel Updates for High-Performance Computing Clusters |On-the-Fly Kernel Updates for High-Performance Computing Clusters Kristis Makris Kyung Dong Ryu High-performance computing clusters running long-lived tasks currently cannot have kernel software updates applied to them without causing system downtime. These clusters miss opportunities for increased performance via specialized kernel support, cannot benefit from new kernel features, and continue to operate with kernel security holes unpatched, at least until the next scheduled maintenance date. We developed a system enabling dynamic kernel updates in parallel computing clusters to address these problems. Our system, DynAMOS, is founded on execution flow high-jacking through \\emph function cloning . It enables commodity operating systems popularly used in clusters gain adaptive and mutative capabilities. To demonstrate the efficacy of our system, we illustrate our experience in dynamically updating and extending a Linux cluster. We introduce adaptive memory paging for efficient gang-scheduling, extend the kernel's process scheduler to support unobtrusive fine-grain cycle stealing, apply public security fixes, and inject performance monitoring functionality to a selection of kernel functions. Our benchmarks show that the overhead imposed by DynAMOS is mostly in the range of 1-8\\% for common Linux kernel functions.||||||||||||./pdfs/122-SMTPS-paper-1.pdf",
    "An Optimal Approach to the Task Allocation Problem on Hierarchical Architec|An Optimal Approach to the Task Allocation Problem on Hierarchical Architectures Alexander Metzner Martin Fraenzle Christian Herde Ingo Stierand We present a SAT-based approach to the task and message allocation problem of distributed real-time systems with hierarchical architectures. In contrast to the heuristic approaches usually applied to this problem, our approach is guaranteed to find an optimal allocation for realistic task systems running on complex target architectures. Our method is based on the transformation of such scheduling problems into nonlinear integer optimization problems. The core of the numerical optimization procedure we use to discharge those problems is a solver for arbitrary Boolean combinations of integer constraints. Optimal solutions are obtained by imposing a binary search scheme on top of that solver. Experiments show the applicability of our approach to industrial-size task systems, which are mapped to heterogeneous hierarchical hardware architectures.||||||||||||./pdfs/122-WPDRTS-paper-1.pdf",
    "LogfP - A Model for small Messages in InfiniBand |LogfP - A Model for small Messages in InfiniBand Torsten Hoefler Torsten Mehlan Frank Mietke Wolfgang Rehm Accurate models of parallel computation are often crucial to optimize parallel algorithms for their running time. In general the easier the model's use and the smaller the number of parameters and interdependencies among them, the more inaccuarcies are introduced by simplification. On the other hand a too complex model is unusable. We show that it is possible to derive a relatively accurate and easy model for small message performance over the InfiniBand network. This model allows the developer to gain knowledge about the inherent parallelism of a specific InfiniBand hardware and encourages him to use this parallelism efficiently. Several well known models hide this feature and some of them even penalize the use of parallelism because the model designers were not aware of new emerging architectures like InfiniBand.||||||||||||./pdfs/123-PMEO-paper-1.pdf",
    "A Tool for Environment Deployment in Clusters and light Grids |A Tool for Environment Deployment in Clusters and light Grids Yiannis Georgiou Julien Leduc Brice Videau Johann Peyrard Olivier Richard Focused around the field of the exploitation and the administration of high performance large-scale parallel systems , this article describes the work carried out on the deployment of environment on high computing clusters and grids. We initally present the problems involved in the installation of an environment (OS, middleware, libraries, applications...) on a cluster or grid and how an effective deployment tool, \\emph Kadeploy2 , can become a new form of exploitation of this type of infrastructures. We present the tool's design choices, its architecture and we describe the various stages of the deployment method, introduced by \\emph Kadeploy2 . Moreover, we propose methods on the one hand, for the improvement of the deployment time of a new environment; and in addition, for the support of various operating systems. Finally, to validate our approach we present tests and evaluations realized on various clusters of the experimental grid \\emph Grid5000 .||||||||||||./pdfs/123-SMTPS-paper-1.pdf",
    "Timed Automata Based Analysis of Embedded System Architectures |Timed Automata Based Analysis of Embedded System Architectures Martijn Hendriks Marcel Verhoef We show that timed automata can be used to model and to analyze timeliness properties of embedded system architectures. Using a case study inspired by industrial practice, we present in detail how a suitable timed automata model is composed. Exact upper bounds on the timeliness properties can be found with the Uppaal model checker for a number of usage scenarios. We compare our results with a few other performance modeling techniques. This comparison shows that, if the state space of the model is tractable, Uppaal gives the most accurate results at similar cost. The proposed modeling strategy can be automated, which alleviates the difficulty and error-proneness of manually constructing timed automata models.||||||||||||./pdfs/123-WPDRTS-paper-1.pdf",
    "Performance Analysis of Java Concurrent Programming: A Case Study of Video |Performance Analysis of Java Concurrent Programming: A Case Study of Video Mining System Wenlong Li Eric Li Ran Meng Tao Wang Carole Dulong As multi/many core processors become prevalent, programming language is important in constructing efficient parallel applications. In this work, we build a multithreaded video mining application with Java, examine the thread profiling information and micro-architecture metrics to identify the factors limiting the scalability, and employ a number of ways to improve performance. Besides, we conduct some thread scheduling experiments. According to the experiments and detailed analysis, we conclude that for this video mining application: (1) Java is a good parallel language candidate for many core processors in terms of performance, scalability, and ease of programming; (2) Thread affinity mechanism is effective in improving data locality, but brings little benefit to multithreaded Java application due to its conservative memory model in JVM.||||||||||||./pdfs/124-JAVAPDC-paper-1.pdf",
    "Multiprocessor on Chip : Beating the Simulation Wall Through Multiobjective|Multiprocessor on Chip : Beating the Simulation Wall Through Multiobjective Design Space Exploration with Direct Execution Riad Ben Mouhoub Omar Hamami Design space exploration of multiprocessors on chip requires both automatic performance analysis techniques and efficient multiprocessors configuration performance evaluation. Prohibitive simulation time of single multiprocessor configuration makes large design space exploration impossible without massive use of computing resources and still implementation issues are not tackled. This paper proposes a new performance evaluation methodology for multiprocessors on chip which conduct a multiobjective design space exploration through emulation. The proposed approach is validated on a 4 way multiprocessor on chip design space exploration where a 6 order of magnitude improvement have been achieved over cycle accurate simulation.||||||||||||./pdfs/124-PMEO-paper-1.pdf",
    "Evaluating Cooperative Checkpointing for Supercomputer Systems |Evaluating Cooperative Checkpointing for Supercomputer Systems Adam J. Oliner Ramendra K. Sahoo Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: \\emph work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and \\emph risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, risk-based checkpointing with event prediction accuracy as low as 10\\% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of $9$, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large checkpoint overheads.||||||||||||./pdfs/124-SMTPS-paper-1.pdf",
    "Schedulability Analysis of AADL Models |Schedulability Analysis of AADL Models Oleg Sokolsky Insup Lee Duncan Clarke The paper discusses the use of formal methods for the analysis of architectural models expressed in the modeling language AADL. AADL describes the system as a collection of interacting components. The AADL standard prescribes semantics for the thread components and rules of interaction between threads and other components in the system. We present a semantics-preserving translation of AADL models into the real-time process algebra ACSR, allowing us to perform schedulability analysis of AADL models.||||||||||||./pdfs/124-WPDRTS-paper-1.pdf",
    "High-Level Execution and Communication Support for Parallel Grid Applicatio|High-Level Execution and Communication Support for Parallel Grid Applications in JGrid Szabolcs Pota Zoltan Juhasz This paper describes the high-level execution and communication support provided in JGrid, a service-oriented dynamic grid framework. One of its core services, the Compute Service, is the key component in creating dynamic computational grid systems that enable the execution of sequential and parallel interactive grid applications. A fundamental set of program execution modes supported by the service is described, then a programming model and its corresponding application programming interface is presented. The execution support of the service architecture is described in detail illustrating how remote evaluation and run-time task spawning are provided. The paper also shows in detail how task spawning and dynamic proxies can be used for a service-oriented communication mechanism for coarse-grain parallel grid applications.||||||||||||./pdfs/125-JAVAPDC-paper-1.pdf",
    "Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Op|Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge Rod Fatoohi Subhash Saini Robert Ciotti We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different numbers of communicating processors and communication patterns - such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 shared-memory machine using Itanium-2 1.6 GHz processors and interconnected by SGI NUMAlink-4 switch with 3.2 GB/s bandwidth per node; a 64-processor (single-streaming) Cray X1 shared-memory machine using 800 MHz processor with 16 processors per node and 32 1.6 GB/s full duplex links; a 128-processor Cray Opteron cluster using 2 GHz AMD Opteron processors and interconnected by a Myrinet network; and a 1280-node Dell PowerEdge cluster with Intel Xeon 3.6 GHz processors interconnected by an InfiniBand network. Our results show the impact of the network bandwidth and topology on the overall performance of each interconnect.||||||||||||./pdfs/125-PMEO-paper-1.pdf",
    "Time Abstraction in Timed &#181; CRL &agrave; la Regions  |Time Abstraction in Timed &#181; CRL &agrave; la Regions Jan Friso Groote Michel A. Reniers Yaroslav S. Usenko In this paper we present an idea to combine the best parts of the real-time verification methods based on timed automata (the use of \\emph regions and \\emph zones ), and of the process-algebraic approach of the languages like LOTOS and $\\mu$CRL. $\\mu$CRL targets the specification of system behavior in a process-algebraic (ACP) style and deals with data elements in the form of abstract data types. In order to combine the two approaches we propose the following scheme. Both zones and regions, as well as the operations on them could be specified as the abstract data types in $\\mu$CRL, either as \\emph clock constraints or as \\emph difference-bound matrices . A timed automata specification is a parallel composition of timed automata. We use the existing results to translate it to a parallel composition of timed $\\mu$CRL processes. This translation uses a very simple sort \\emph Time to represent the real-time clock values. As the result we obtain a semantically equivalent specification in timed $\\mu$CRL. As the next step in our scheme, we aim at replacing all parameters of sort Time occurring in the resulting process equation by the parameters of sort Region or Zone. This can be done in a similar way as for timed automata. These data types are countable and because of the decidability results for timed automata only finitely many different values of these parameters will be reached. We could even go further, i.e. due to the fact that infinite state spaces can also be analyzed in $\\mu$CRL, we could go beyond timed automata verification. In the final step we transform the resulting timed process equation with regions to an untimed process equation with a finite underlying state space. This is achieved by applying \\emph time-free abstraction and \\emph relativization techniques. As a result, the existing untimed analysis tools in the $\\mu$CRL Toolset could become applicable to the analysis of real-time systems.||||||||||||./pdfs/125-WPDRTS-paper-1.pdf",
    "Fault Injection in Distributed Java Applications |Fault Injection in Distributed Java Applications William Hoarau S&eacute;bastien Tixeuil Fabien Vauchelles In a network consisting of several thousands computers, the occurrence of faults is unavoidable. Being able to test the behaviour of a distributed program in an environment where we can control the faults (such as the crash of a process) is an important feature that matters in the deployment of reliable programs. In this paper, we investigate the possibility of injecting software faults in distributed java applications. Our scheme is by extending the FAIL-FCI software. It does not require any modification of the source code of the application under test, while retaining the possibility to write high level fault scenarios. As a proof of concept, we use our tool to test FreePastry, an existing java implementation of a Distributed Hash Table (DHT), against node failures.||||||||||||./pdfs/126-JAVAPDC-paper-1.pdf",
    "Resource Management with Stateful Support for Analytic Applications |Resource Management with Stateful Support for Analytic Applications Liana L. Fong Catherine H. Crawford Hidayatullah Shaikh Analytic applications from various industrial sectors have specific attributes and requirements including relatively long processing time, parallelization, multiple interactive invocations, web services, and expected quality of service objectives. Current parallel resource management systems for batch-oriented jobs lack the effective support for multiple interactive invocations with consideration in quality of service objectives, while transaction processing systems do not support dynamic creation of parallel application instances. To better serve the analytic applications, a set of additional resource management services, defined as stateful support, introduces the concept of Service Instance and Service Instance Management. This set of stateful support services can be implemented as extension to existing parallel resource management to serve these analytic applications that rapidly increase in the demand of computing power.||||||||||||./pdfs/126-SMTPS-paper-1.pdf",
    "Performance Analysis of the Reactor Pattern in Network Services |Performance Analysis of the Reactor Pattern in Network Services Swapna Gokhale Aniruddha Gokhale Jeff Gray Paul Vandal Upsorn Praphamontripong The growing reliance on services provided by software applications places a high premium on the reliable and efficient operation of these applications. A number of these applications follow the event-driven software architecture style since this style fosters evolvability by separating event handling from event demultiplexing and dispatching functionality. The event demultiplexing capability, which appears repeatedly across a class of event-driven applications, can be codified into a reusable pattern, such as the Reactor pattern. In order to enable performance analysis of event-driven applications at design time, a model is needed that represents the event demultiplexing and handling functionality that lies at the heart of these applications. In this paper, we present a model of the Reactor pattern based on the well-established Stochastic Reward Net (SRN) modeling paradigm. We discuss how the model can be used to obtain several performance measures such as the throughput, loss probability and upper and lower bounds on the response time. We illustrate how the model can be used to obtain the performance metrics of a Virtual Private Network (VPN) service provided by a Virtual Router (VR). We validate the estimates of the performance measures obtained from the SRN model using simulation.||||||||||||./pdfs/127-PMEO-paper-1.pdf",
    "Improving Cluster Utilization through Intelligent Processor Sharing |Improving Cluster Utilization through Intelligent Processor Sharing Gary Stiehr Roger D. Chamberlain A dedicated cluster is often not fully utilized even when all of its processors are allocated to jobs. This occurs any time that a running job does not use 100\\% of each of the processors allocated to it. We increase the throughput and efficiency of the cluster by scheduling background jobs to run concurrently with the ?primary? jobs originally scheduled on the cluster. We do this while maintaining the quality of service provided to the primary jobs. Our results come from empirical measurements using production applications.||||||||||||./pdfs/127-SMTPS-paper-1.pdf",
    "Decentralized and Dynamic Bandwidth Allocation in Networked Control Systems|Decentralized and Dynamic Bandwidth Allocation in Networked Control Systems Ahmad T. Al-hammouri Michael S. Branicky Vincenzo Liberatore Stephen M. Phillips In this paper, we propose a bandwidth allocation scheme for networked control systems that have their control loops closed over a geographically distributed network. We first formulate the bandwidth allocation as a convex optimization problem. We then present an allocation scheme that solves this optimization problem in a fully distributed manner. In addition to being fully distributed, the proposed scheme is asynchronous, scalable, dynamic and flexible. We further discuss mechanisms to enhance the performance of the allocation scheme. We present analytical and simulation results.||||||||||||./pdfs/127-WPDRTS-paper-1.pdf",
    "Using Stochastic Petri Nets for Performance Modelling of Application Server|Using Stochastic Petri Nets for Performance Modelling of Application Servers F&aacute;bio N. Souza Roberto D. Arteiro Nelson S. Rosa Paulo R. M. Maciel Application servers have been widely adopted as distributed infrastructure (or middleware) for developing distributed systems. Current approaches for performance evaluation of application servers have mainly concentrated on the adoption of measurement techniques. This paper, however, focuses on the use of simulation techniques and presents an approach for performance modelling and evaluation of application servers using Petri nets. In order to illustrate how the proposed approach may be applied, Petri net models of JBoss application server are presented and their performance results are compared with ones that have been measured.||||||||||||./pdfs/128-PMEO-paper-1.pdf",
    "Easy and Reliable Cluster Management: The Self-management Experience of Fir|Easy and Reliable Cluster Management: The Self-management Experience of Fire Phoenix Zhang Zhi-hong Meng Dan Zhan Jian-feng Wang Lei Wu Lin-ping Huang Wei High-Performance clusters are rapidly becoming an important computing platform for both scientific and business applications. To fulfill the new demands and challenges, cluster system software is inevitably complex. Even for experienced administrators, the management of a cluster system is an exhausting job. This paper introduces Fire Phoenix, a scalable and self-managing cluster system software that supports both scientific and commercial applications. With the self-configuring and self-healing features, much of the machine configuration and error recovery can be done automatically. Our design has been proven effective in the operations of the Dawning 4000A supercomputer, which is the biggest cluster system in China.||||||||||||./pdfs/128-SMTPS-paper-1.pdf",
    "Honeybees: Combining Replication and Evasion for Mitigating Base-station Ja|Honeybees: Combining Replication and Evasion for Mitigating Base-station Jamming in Sensor Network Sherif Khattab Daniel Moss&eacute; Rami Melhem By violating MAC-layer protocols, the jamming attack aims at blocking successful communication among wireless nodes. Wireless sensor networks (WSNs) are highly vulnerable to jamming because of reliance on shared wireless medium, constrained per-sensor resources, and high risk of sensor compromise. Moreover, base stations of WSNs are single points of failure and, thus, attractive jamming targets. To tackle base-station jamming, replication of base stations as well as jamming evasion, by relocation to unjammed locations, have been proposed. In this paper, we propose Honeybees, an energy-aware defense framework against base-station jamming attack in WSNs. Honeybees efficiently combines replication and evasion to allow WSNs to continue delivering data for a long time during a jamming attack. We present three defense strategies: reactive, proactive, and hybrid, in the context of multi-hop WSN deployment. Through simulation, we show the interaction of these strategies with different attack tactics as well as the effect of system and attack parameters. We found that our honeybees framework struck an energy-efficient balance between replication and evasion that outperformed both separate mechanisms. Specifically, hybrid honeybees outperformed replication and evasion at low and intermediate number of attackers and gracefully degraded to high attack intensity.||||||||||||./pdfs/128-WPDRTS-paper-1.pdf",
    "A Framework to Develop Symbolic Performance Models of Parallel Applications|A Framework to Develop Symbolic Performance Models of Parallel Applications Sadaf R Alam Jeffrey S Vetter Performance and workload modeling has numerous uses at every stage of the high-end computing lifecycle: design, integration, procurement, installation and tuning. Despite the tremendous usefulness of performance models, their construction remains largely a manual, complex, and time-consuming exercise. We propose a new approach to the model construction, called modeling assertions (MA), which borrows advantages from both the empirical and analytical modeling techniques. This strategy has many advantages over traditional methods: incremental construction of realistic performance models, straightforward model validation against empirical data, and intuitive error bounding on individual model terms. We demonstrate this new technique on the NAS parallel CG and SP benchmarks by constructing high fidelity models for the floating-point operation cost, memory requirements, and MPI message volume. These models are driven by a small number of key input parameters thereby allowing efficient design space exploration of future problem sizes and architectures.||||||||||||./pdfs/129-PMEO-paper-1.pdf",
    "OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Cluster|OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters J. M. Brandt A. C. Gentile D. J. Hale P. P. Pebay Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure ``prediction''. We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (\\emph e.g. , temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters.||||||||||||./pdfs/129-SMTPS-paper-1.pdf",
    "Murphy Loves Potatoes: Experiences from a Pilot Sensor Network Deployment i|Murphy Loves Potatoes: Experiences from a Pilot Sensor Network Deployment in Precision Agriculture Koen Langendoen Aline Baggio Otto Visser We report on preliminary experiences with deploying a large-scale sensor network (about 100 nodes) for a pilot in precision agriculture. The pilot did not answer the initial research questions, but instead revealed many engineering problems typically overlooked by (computer) scientists evaluating their work by means of simulation. The deployment prompted us to rethink our development process and includes important lessons for the WSN research community as a whole.||||||||||||./pdfs/129-WPDRTS-paper-1.pdf",
    "Analysis of Checksum-Based Execution Schemes for Pipelined Processors |Analysis of Checksum-Based Execution Schemes for Pipelined Processors Bernhard Fechner The performance requirements for contemporary microprocessors are increasing as rapidly as their number of applications grows. By accelerating the clock, performance can be gained easily but only with high additional power consumption. The electrical potential between logic ?0? and ?1? is decreased as integration and clock rates grow, leading to a higher susceptibility for transient faults, caused e.g. by power fluctuations or Single Event Upsets (SEUs). We introduce a technique which is based on the well-known cyclic redundancy check codes (CRCs) to secure the pipelined execution of common microprocessors against transient faults. This is done by computing signatures over the control signals of each pipeline stage including dynamic out-of-order scheduling. To correctly compute the checksums, we resolve the timedependency of instructions in the pipeline. We will first discuss important physical properties of Single Event Upsets (SEUs). Then we present a model of a simple processor with the applied scheme as an example. The scheme is extended to support n-way simultaneous multithreaded systems, resulting in two basic schemes. A cost analysis of the proposed SEU-detection schemes leads to the conclusion that both schemes are applicable at reasonable costs for pipelines with 5 to 10 stages and maximal 4 hardware threads. A worst-case simulation using software fault-injection of transient faults in the processor model showed that errors can be detected with an average of 83\\% even at a fault rate of $10^ -2 $. Furthermore, the scheme is able to detect an error within an average of only 5.05 cycles.||||||||||||./pdfs/13-DPDNS-paper-1.pdf",
    "An Overview of the Jahob Analysis System Project Goals and Current Status |An Overview of the Jahob Analysis System Project Goals and Current Status Viktor Kuncak Martin Rinard We present an overview of the Jahob system for modular analysis of data structure properties. Jahob uses a subset of Java as the implementation language and annotations with formulas in a subset of Isabelle as the specification language. It uses monadic second-order logic over trees to reason about reachability in linked data structures, the Isabelle theorem prover and Nelson-Oppen style theorem provers to reason about high-level properties and arrays, and a new technique to combine reasoning about constraints on uninterpreted function symbols with other decision procedures. It also incorporates new decision procedures for reasoning about sets with cardinality constraints. The system can infer loop invariants using new symbolic shape analysis. Initial results in the use of our system are promising; we are continuing to develop and evaluate it.||||||||||||./pdfs/13-NSFNGS-paper-1.pdf",
    "Maximum Edge Matching for Reconfigurable Computing |Maximum Edge Matching for Reconfigurable Computing Markus Rullmann Renate Merker Reconfiguration of tasks implies considerable overhead on the amount of configuration data and time. Much overhead is caused by redundant configuration generated by the design tools which implement similar structures in the designs on different resources. In this paper we propose a new method to identify structural similarities in tasks. Based on this information, we are able to generate automatically constraints to ensure that the place and route tools use identical resources. Thus we ensure that less redundant configuration is produced. In this paper we give a formal description of the underlaying \\emph maximum edge matching problem and show a method to solve it optimally. We derive a truncation criteria to restrict the search space efficiently. We also propose an Ant Colony Optimization based solution with a problem specific local heuristic and show that it performs optimal as well in our examples, but with considerable lower computational effort.||||||||||||./pdfs/13-RAW-paper-1.pdf",
    "APDCM Keynote: Learning Computing Models from Cells and Tissues: P Systems |APDCM Keynote: Learning Computing Models from Cells and Tissues: P Systems Gheorghe Paun This is intended to be a quick introduction to membrane computing, a branch of natural computing inspired in the structure and functioning of living cells and in their organization in tissues. The corresponding models, called P systems, are parallel distributed computing devices, handling multisets of abstract objects in a compartmentalized architecture defined by cell-like or tissue-like membrane arrangements. Most classes of P systems are Turing universal and in certain cases can provide polynomial solutions to computationally hard problems (by means of a time-space trade-off). Several applications in biology/medicine, computer science, linguistics, economics were reported. The talk will present only basic ideas and (types of) results and of applications. Details can be found at the Web site \\tt http:\/\/psystems.disco.unimib.it .||||||||||||./pdfs/130-APDCM-paper-1.pdf",
    "Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks |Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks Subhash Saini Robert Ciotti Brian T. N. Gunney Thomas E. Spelce Alice Koniges Don Dossa Panagiotis Adamidis Rolf Rabenseifner Sunil R. Tiyyagura Matthias Mueller Rod Fatoohi The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray X1, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems.||||||||||||./pdfs/130-PMEO-paper-1.pdf",
    "A Study of MPI Performance Analysis Tools on Blue Gene/L |A Study of MPI Performance Analysis Tools on Blue Gene/L I-hsin Chung Robert E. Walkup Hui-fang Wen Hao Yu Applications on todays massively parallel supercomputers rely on performance analysis tools to guide them toward scalable performance on thousands of processors. However, conventional tools for parallel performance analysis have serious problems due to the large data volume that may be required. In this paper, we discuss the scalability issue for MPI performance analysis on Blue Gene/L, the worlds fastest supercomputing platform. We present an experimental study of existing MPI performance tools that were ported to BG/L from other platforms. These tools can be classified into two categories: profiling tools that collect timing summaries, and tracing tools that collect a sequence of time-stamped events. Profiling tools produce small data volumes and can scale well, but tracing tools tend to scale poorly. The experimental study discusses the advantages and disadvantages for the tools in the two categories and will be helpful in the future performance tools design.||||||||||||./pdfs/130-SMTPS-paper-1.pdf",
    "An Overview of Data Aggregation Architecture for Real-Time Tracking with Se|An Overview of Data Aggregation Architecture for Real-Time Tracking with Sensor Networks Tian He Lin Gu Liqian Luo Ting Yan John A. Stankovic Sang H. Son Since sensor nodes normally have limited resources in terms of energy, bandwidth and computation capability, efficiency is a key design goal in sensor network research. As one of techniques to achieve efficiency, data aggregation has been extensively investigated in recent literature. Previous research on data aggregation has demonstrated its effectiveness in reducing traffic, easing congestion and decreasing the energy consumption. However few are actually designed for a real-world application and implemented in a running system. This paper describes our design and implementation of a physical tracking system, using an aggressive data aggregation architecture as one of building blocks. This architecture can be generally applied to other sensor systems, where communication efficiency is a paramount concern and networking resources are limited.||||||||||||./pdfs/130-WPDRTS-paper-1.pdf",
    "An Entropy-Based Algorithm for Time-Driven Software Instrumentation in Para|An Entropy-Based Algorithm for Time-Driven Software Instrumentation in Parallel Systems Ahmet &Ouml;zmen While monitoring, instrumented long running parallel applications generate huge amount of instrumentation data. Processing and storing this data incurs overhead, and perturbs the execution. Techniques that eliminates unnecessary instrumentation data and lower the intrusion without loosing any performance information is valuable to tool developers. This paper presents a new algorithm for software instrumentation to measure the amount of information content of instrumentation data to be collected. The algorithm is based on entropy concept introduced in information theory, and it makes selective data collection for a time-driven software monitoring system possible.||||||||||||./pdfs/131-PMEO-paper-1.pdf",
    "A Database-centric Approach to System Management in the Blue Gene/L Superco|A Database-centric Approach to System Management in the Blue Gene/L Supercomputer Ralph Bellofatto Paul G. Crumley David Darrington Brant Knudson Mark Megerian Jose E. Moreira Alda S. Ohmacht John Orbeck Don Reed Greg Stewart In designing the management system for Blue Gene/L, we adopted a database-centric approach. All configuration and operational data for a particular Blue Gene/L system are stored in a relational database that is kept in the system?s service node. The database also serves as the communication bus for the various processes implementing the management system. This design offers many advantages, including the ability to use SQL commands to retrieve reliability, availability, and serviceability (RAS) information about the system. Information about machine partitioning and user jobs can be obtained the same way. Leveraging the database, we have developed a web interface for system management. This management system has been successfully implemented and deployed in all 19 Blue Gene/L installations at the time of this writing.||||||||||||./pdfs/131-SMTPS-paper-1.pdf",
    "Formal Modeling and Analysis of Wireless Sensor Network Algorithms in Real-|Formal Modeling and Analysis of Wireless Sensor Network Algorithms in Real-Time Maude Peter Csaba Olveczky Stian Thorvaldsen Advanced wireless sensor network algorithms pose challenges to their formal modeling and analysis, such as modeling probabilistic and real-time behaviors and novel forms of communication, and analyzing both correctness and performance. In this paper, we propose using Real-Time Maude to formally model, simulate, and further analyze such algorithms. The Real-Time Maude formalism is expressive yet intuitive, and the tool provides a spectrum of analysis methods, including simulation, reachability analysis, and temporal logic model checking. We have used Real-Time Maude to formally model and analyze the sophisticated OGDC algorithm. We could perform all the analyses performed by the OGDC developers using the simulation tool ns-2, as well as further analyses which are beyond the capabilities of simulation tools. To the best of our knowledge, this is the first time a formal tool has been applied to such a complex wireless sensor network algorithm.||||||||||||./pdfs/131-WPDRTS-paper-1.pdf",
    "A Design Environment for Mobile Applications |A Design Environment for Mobile Applications Stephen Gilmore Valentin Haenel Jane Hillston Jennifer Tenzer In this paper we show how high-level UML models of mobile computing applications can be analysed for classical performance measures such as throughput. The approach proceeds by compiling the UML model into a representation in the formally-defined modelling language of PEPA nets. The compilation process and subsequent performance analysis based on numerical solution of a Continuous-Time Markov Chain is supported by a software tool, the Choreographer design platform. Choreographer interoperates with popular UML tools by reading and writing UML models in the XML Metadata Interchange format (XMI). Specifically we extract a PEPA net model from a UML activity diagram, analyse the PEPA net and report the results back as a modified activity diagram. We present an example use of the Choreographer design platform to investigate the throughput of activities in a UML activity diagram. The example which we model represents both physical and logical mobility. The scenario is of a PDA user on board a moving train connecting to a remote Web site and loading pages of dynamically-generated HTML content. With little overhead the modelling language allows the modeller to precisely record the mobile and immobile components of the system and to distinguish location-changing events from changes of computational state. The extractor-workbench-reflector tool chain powers the performance analysis of high-level model descriptions, returning results in the language in which they were submitted.||||||||||||./pdfs/132-PMEO-paper-1.pdf",
    "GTS Allocation Analysis in IEEE 802.15.4 for Real-Time Wireless Sensor Netw|GTS Allocation Analysis in IEEE 802.15.4 for Real-Time Wireless Sensor Networks Anis Koubaa Mario Alves Eduardo Tovar The IEEE 802.15.4 protocol proposes a flexible communication solution for Low-Rate Wireless Personal Area Networks including sensor networks. It presents the advantage to fit different requirements of potential applications by adequately setting its parameters. When enabling its beacon mode, the protocol makes possible real-time guarantees by using its Guaranteed Time Slot (GTS) mechanism. This paper analyzes the performance of the GTS allocation mechanism in IEEE 802.15.4. The analysis gives a full understanding of the behavior of the GTS mechanism with regards to delay and throughput metrics. First, we propose two accurate models of service curves for a GTS allocation as a function of the IEEE 802.15.4 parameters. We then evaluate the delay bounds guaranteed by an allocation of a GTS using Network Calculus formalism. Finally, based on the analytic results, we analyze the impact of the IEEE 802.15.4 parameters on the throughput and delay bound guaranteed by a GTS allocation. The results of this work pave the way for an efficient dimensioning of an IEEE 802.15.4 cluster.||||||||||||./pdfs/132-WPDRTS-paper-1.pdf",
    "Performance Evaluation of Scheduling Applications with DAG Topologies on Mu|Performance Evaluation of Scheduling Applications with DAG Topologies on Multiclusters with Independent Local Schedulers Ligang He Stephen A. Jarvis Daniel P. Spooner Graham R. Nudd Before an application modelled as a Directed Acyclic Graph (DAG) is executed on a heterogeneous system, a DAG mapping policy is often enacted. After mapping, the tasks (in the DAG-based application) to be executed at each computational resource are determined. The tasks are then sent to the corresponding resources, where they are orchestrated in the pre-designed pattern to complete the work. Most DAG mapping policies in the literature assume that each computational resource is a processing node of a single processor, i.e. the tasks mapped to a resource are to be run in sequence. Our studies demonstrate that if the resource is actually a cluster with multiple processing nodes, this assumption will cause a misperception in the tasks? execution time and execution order. This will disturb the pre-designed cooperation among tasks so that the expected performance cannot be achieved. In this paper, a DAG mapping algorithm is presented for multicluster architectures. Each constituent cluster in the multicluster is shared by background workload (from other users) and has its own independent local scheduler. The multicluster DAG mapping policy is based on theoretical analysis and its performance is evaluated through extensive experimental studies. The results show that compared with conventional DAG mapping policies, the new scheme that we present can significantly improve the scheduling performance of a DAG-based application in terms of the schedule length.||||||||||||./pdfs/133-PMEO-paper-1.pdf",
    "Power-Aware Data Dissemination Protocols in Wireless Sensor Networks |Power-Aware Data Dissemination Protocols in Wireless Sensor Networks Sotiris Nikoletseas Recent rapid technological developments have led to the development of tiny, low-power, low-cost sensors. Such devices integrate sensing, limited data processing and communication capabilities.The effective distributed collaboration of large numbers of such devices can lead to the efficient accomplishment of large sensing tasks. This invited talk focuses on several aspects of energy efficiency. Two protocols for data propagation are studied: the first creates probabilistically optimized redundant data transmissions to combine energy efficiency with fault tolerance, while the second guarantees (in a probabilistic way) the same per sensor energy dissipation, towards balancing the energy load and prolong the lifetime of the network. A third protocol (in fact a power saving scheme) is also presented, that directly and adaptively affects power dissipation at each sensor. This ?lower level? scheme can be combined with data propagation protocols to further improve energy efficiency.||||||||||||./pdfs/133-WPDRTS-paper-1.pdf",
    "An Adaptive Dynamic Grid-based Approach to Data Distribution Management |An Adaptive Dynamic Grid-based Approach to Data Distribution Management Azzedine Boukerche Yunfeng Gu Gen Huey Chenregina Araujo This paper presents a novel Adaptive Dynamic Grid-based Data Distribution Management (DDM) scheme, which we refer to as ADGB. The main objective of our protocol is to optimize DDM time through matching probability (MP) and federates' performance. A Distribution Rate (DR) along with MP are used as part of the ADGB method to select, throughout the simulation, from different devised advertisement schemes, the best scheme to achieve maximum gain with acceptable network traffic overhead. As opposed to previous protocols, the novelty of our ADGB scheme is its focus on improving overall performance, an important goal for DDM strategy. In this paper, we present our scheme and highlight its performance analysis.||||||||||||./pdfs/134-PMEO-paper-1.pdf",
    "Algorithmic Models for Sensor Networks |Algorithmic Models for Sensor Networks Stefan Schmid Roger Wattenhofer Developing algorithms for sensor networks---and proving their correctness and performance---, requires simplifying but still realistic models. This paper surveys various models in use today and puts them into perspective. In addition, we propose interesting models which are not widely adopted by the community so far.||||||||||||./pdfs/134-WPDRTS-paper-1.pdf",
    "Approximated Tensor Sum Preconditioner for Stochastic Automata Networks |Approximated Tensor Sum Preconditioner for Stochastic Automata Networks Abderezak Touzene Some iterative and projection methods for SAN have been tested with a modest success. Several preconditioners for SAN have been developed to speedup the convergence rate. Recently Langville and Stewart proposed the Nearest Kronecker Product (NKP) preconditioner for SAN with a great success. Encouraged by their work, we propose a new preconditioning method, called Approximated Tensor Sum Preconditioner (ATSP), which uses tensor sum preconditioner rather than Kronecker product preconditioner. In ATSP, we take into account the effect of the synchronizations using an approximation technique. Our preconditioner outperforms the NKP preconditioner for the tested SAN Model.||||||||||||./pdfs/135-PMEO-paper-1.pdf",
    "Solving Generic Role Assignment Exactly |Solving Generic Role Assignment Exactly Christian Frank Kay R&ouml;mer Generic role assignment is a programming abstraction that supports the assignment of user-defined \\emph roles to sensor nodes such that certain conditions are met. Many common network configuration problems such as coverage (assign roles ON and OFF to sensor nodes such that ON nodes cover a physical area with their sensors), clustering, or in-network data aggregation can be formulated as role assignment problems. Building on our previous work in this area, we propose an extended role specification language that supports the minimization or maximization of the use of a given role. Moreover, we provide a mapping of this language to integer linear programs and implement this mapping. We show how the resulting tool can be used analyze aspects of role specifications such as feasibility and optimality.||||||||||||./pdfs/135-WPDRTS-paper-1.pdf",
    "Software-Based Fault-Tolerant Routing Algorithm in Multi-Demensional Networ|Software-Based Fault-Tolerant Routing Algorithm in Multi-Demensional Networks F. Safaei M. Rezazad A. Khonsari M. Fathy M. Ould-khaoua N. Alzeidi Massively parallel computing systems are being built with hundreds or thousands of components such as nodes, links, memories, and connectors. The failure of a component in such systems will not only reduce the computational power but also alter the network?s topology. The Software-Based fault-tolerant routing algorithm is a popular routing to achieve fault-tolerance capability in networks. This algorithm is initially proposed only for two dimensional networks. Since, higher dimensional networks have been widely employed in many contemporary massively parallel systems; this paper proposes an approach to extend this routing scheme to these indispensable higher dimensional networks. Deadlock and livelock freedom and the performance of presented algorithm, have been investigated for networks with different dimensionality and various fault regions. Furthermore, performance results have been presented through simulation experiments.||||||||||||./pdfs/136-PMEO-paper-1.pdf",
    "Similarity-Aware Query Processing in Sensor Networks |Similarity-Aware Query Processing in Sensor Networks Ping Xia Panos K. Chrysanthis Alexandros Labrinidis We assume a sensor network with data-centric storage, where sensor data is stored within the sensor network and ad hoc queries are disseminated and processed inside the network. In such an environment, there are often similarities among submitted queries. Using current solutions, similar queries may have to go through the same expensive query processing steps thus wasting energy. In this paper, we propose a similarity-aware query processing scheme (SAQP) that materializes previous query results within the sensor network and utilizes these materialized results to answer future similar queries. Through simulation, we demonstrate that our SAQP scheme reduces energy consumption on queries with negligible increase in response time, and without compromising the quality of data.||||||||||||./pdfs/136-WPDRTS-paper-1.pdf",
    "Saburo, a Tool for I/O and Concurrency Management in Servers |Saburo, a Tool for I/O and Concurrency Management in Servers Gautier Loyaut&eacute; R&eacute;mi Forax Gilles Roussel This paper presents a Java framework based on \\textbf separation of concerns and \\textbf code generation concepts that facilitates development of concurrency and I/O in servers. In this approach, the application is modeled by a graph whose vertices correspond to units of treatment connected by channels. It allows to build all kind of servers: multi-threaded, \\textbf S ingle-\\textbf P rocess \\textbf E vent-\\textbf D riven, \\textbf S taged \\textbf E vent \\textbf D riven \\textbf A rchitecture, etc. without modification of the functional part. This architecture also permits to extend very easily an application, adding vertices and edges to the graph. The aim of our development tool is to improve programmer productivity and portability, decreasing development time, and reducing bugs or deadlock problems.||||||||||||./pdfs/137-JAVAPDC-paper-1.pdf",
    "A Comparative Performance Analysis of n-Cubes and Star Graphs |A Comparative Performance Analysis of n-Cubes and Star Graphs Abbas Eslami Kiasari Hamid Sarbazi-azad Many theoretical-based comparison studies, relying on the graph theoretical viewpoints with using structural and algorithmic properties, have been conducted for the hypercube and the star graph. None of these studies, however, considered real working conditions and implementation limits. We have compared the performance of the star and hypercube networks for different message length and virtual channels and considered two implementation constraints, namely the constant bisection bandwidth and constant node pin-out. We use two accurate analytical models already proposed for the star graph and hypercube and implement the parameter changes imposed by technological implementation constraints. The comparison results reveal that the star graph has a better performance compared to the equivalent hypercube under light traffic loads while the opposite conclusion is reached for heavy traffic loads. The hypercube with more channels compared to its equivalent star graph saturates later showing that it can bear heavier traffic loads.||||||||||||./pdfs/137-PMEO-paper-1.pdf",
    "Schedulability analysis of flows scheduled with FIFO: Application to the Ex|Schedulability analysis of flows scheduled with FIFO: Application to the Expedited Forwarding class Steven Martin Pascale Minet In this paper, we are interested in real-time flows requiring quantitative and deterministic QoS (\\textit Quality of Service ) guarantees. We focus more particularly on two QoS parameters: the worst case end-to-end response time and jitter. We consider a \\textsc fifo (\\textit First In First Out ) scheduling of flows. The \\textsc fifo scheduling is the simplest one to implement and very used. We first establish a bound on the worst case end-to-end response time of any flow in the network, using the trajectory approach. We present an example illustrating our results. Finally, we show how to apply these results to the \\textsc ef (\\textit Expedited Forwarding ) class in a DiffServ (\\textit Differentiated Services ) architecture.||||||||||||./pdfs/137-WPDRTS-paper-1.pdf",
    "Chedar: Peer-to-Peer Middleware |Chedar: Peer-to-Peer Middleware Annemari Auvinen Mikko Vapa Matthieu Weber Niko Kotilainen Jarkko Vuori In this paper we present a new peer-to-peer (P2P) middleware called CHEap Distributed ARchitecture (Chedar). Chedar is totally decentralized and can be used as a basis for P2P applications. Chedar tries to continuously optimize its overlay network topology for maximum performance. Currently Chedar combines four different topology management algorithms and provides functionality to monitor how the peer-to-peer network is self-organizing. It also contains basic search algorithms for P2P resource discovery. Chedar has been used for building a data fusion prototype and a P2PDisCo distributed computing application, which provides an interface for distributing the computation of Java applications. To allow Chedar to be used in mobile devices, the Mobile Chedar middleware has also been developed.||||||||||||./pdfs/138-JAVAPDC-paper-1.pdf",
    "On the Probability Distribution of Busy Virtual Channels |On the Probability Distribution of Busy Virtual Channels Nasser Alzeidi Ahmed Khonsari Mohamed Ould-khaoua Lewis Mackenzie A major issue in modelling the performance merits of interconnection network is dealing with virtual channels. Some analytical models chose not to deal with this issue at all i.e. one virtual channel per physical channel. More sophisticated models, however, relayed on a method proposed by Dally to capture the effect of arranging the physical channel into many virtual channels. In this study, we investigate the accuracy of Dally?s method and propose an alternative approach to deal with virtual channels in analytical performance modelling. The new method is validated via simulation experiments and results reveal its accuracy under different traffic conditions.||||||||||||./pdfs/138-PMEO-paper-1.pdf",
    "Real-Time Systems for Multi-Processor Architectures |Real-Time Systems for Multi-Processor Architectures &Eacute;ric Piel Philippe Marquet Julien Soula Jean-luc Dekeyser The ARTiS system is a real-time extension of the GNU/Linux scheduler dedicated to SMP (Symmetric Multi-Processors) systems. It allows to mix High Performance Computing and Real-Time. ARTiS exploits the SMP architecture to guarantee the preemption of a processor when the system has to schedule a real-time task. The implementation is available as a modification of the Linux kernel. The basic idea of ARTiS is to assign a selected set of processors to real-time operations. A migration mechanism of non-preemptible tasks insures a latency level on these real-time processors. Furthermore, specific loadbalancing strategies permit ARTiS to benefit from the full power of the SMP systems: the real-time reservation, while guaranteed, is not exclusive and does not imply a waste of resources.||||||||||||./pdfs/138-WPDRTS-paper-1.pdf",
    "Performance Evaluation of an Enhanced Distributed Channel Access Protocol u|Performance Evaluation of an Enhanced Distributed Channel Access Protocol under Heterogeneous Traffic Mamun I. Abu-tair Geyong Min Recently there have been considerable interests focusing on the performance evaluation of IEEE 802.11e Medium Access Control (MAC) protocols, which were proposed for supporting Quality of Services (QoS) in Wireless Local Area Networks (WLANs). Different from most existing work, this study has conducted comprehensive performance evaluation and analysis of the IEEE 802.11e Enhanced Distributed Channel Access (EDCA) protocol in the presence of heterogeneous network traffic including non-bursty Poisson, bursty ON/OFF, and self-similar traffic generated by wireless multimedia applications. The performance results on throughput, access delay and medium utilization have demonstrated that the protocol is able to achieve satisfying QoS differentiation for heterogeneous multimedia traffic. On the other hand the results have showed that IEEE 802.11e EDCA suffering from the low medium utilization due to the overhead generated by transmission collisions and back-off processes.||||||||||||./pdfs/139-PMEO-paper-1.pdf",
    "QoS-based Management of Multiple Shared Resource in Dynamic Real-Time Syste|QoS-based Management of Multiple Shared Resource in Dynamic Real-Time Systems Klaus Ecker Frank Drews Jens Lichtenberg Dynamic real-time systems require adaptive resource management to accommodate varying processing needs. We address the problem of resource management with multiple shared resources for soft real-time systems consisting of tasks that have discrete QoS settings that correspond to varying resource usage and varying utility. Given an amount of available resource, the problem is to provide on-line control of the tasks' QoS settings so as to optimize the overall system utility. We propose several heuristic algorithms that will be shown to be compatible with the requirements imposed by our control theoretical resource management framework: (1) By only making incremental adjustments to QoS settings as available resources change, they provide low run-time complexity, making them suitable for use in on-line resource managers (2) Differences between actual utility and optimal utility do not accumulate over time, so there is no long-term degradation in performance. (3) The lower and upper bound on actual utility can be calculated dynamically based on current system conditions, and absolute bounds can be calculated statically in advance. (4) It is possible to respond to the actual resource possible, allowing all resources to be used and tolerating misspecification of task resource requirements.||||||||||||./pdfs/139-WPDRTS-paper-1.pdf",
    "A Combined Genetic-Neural Algorithm for Mobility Management |A Combined Genetic-Neural Algorithm for Mobility Management Javid Taheri Albert Y. Zomaya This work presents a new approach to solve the location management problem by using the location areas approach. A combination of a genetic algorithm and the Hopfield neural network is used to find the optimal configuration of location areas in a mobile network. Toward this end, the location areas configuration of the network is modeled so that the general condition of all the chromosomes of each population improves rapidly by the help of a Hopfield neural network. The Hopfield neural network is incorporated into the genetic algorithm optimization process, to expedite its convergence, since the generic genetic algorithm is not fast enough. Simulation results are very promising and they lead to network configurations that are unexpected but very efficient.||||||||||||./pdfs/14-NIDISC-paper-1.pdf",
    "Verification of Software via Integration of Design and Implementation |Verification of Software via Integration of Design and Implementation Andrew S. Miner Samik Basu Model checking is usually applied at the design phase to verify that preliminary high?level design specifications conform to their requirements. Source code analysis, on the other hand, is used to check for correctness of implementation once it is realized from the design specifications. However, the current practice of validating a design and its implementation in isolation makes it necessary to employ rigorous testing analysis to empirically ensure that the implementation satisfies the design specification. This article describes a formal framework that allows design models to contain embedded partial implementations as components; these models are then formally analyzed to ensure that global requirements are satisfied. This framework can be utilized to incrementally develop and ensure correctness of the design and the corresponding implementation. Realization of this framework requires consolidation and expansion of traditional formal verification techniques by integration of model checking, program analysis and constraint solving.||||||||||||./pdfs/14-NSFNGS-paper-1.pdf",
    "Analysis of a Reconfigurable Network Processor |Analysis of a Reconfigurable Network Processor Christoforos Kachris Stamatis Vassiliadis In this paper an analysis of a dynamically reconfigurable processor is presented. The network processor incorporates a processor and a number of co-processors that can be connected to the processor either directly or using a shared bus. The analysis investigates the configuration (in terms of co-processor distributions and interface), formulates the throughput that meets the network demands and the constraints of the platform (area, bus bandwidth, etc.) and takes into account the reconfiguration overhead. To find the configuration that meets the constraints, the platform is formulated into integer linear programming equations. Furthermore, the results of two case studies are presented, for a soft- and a hard- IP core processor, that uses three flows with different processing requirements (IP forward, encryption and media processing). In each case the number and the type of co-processors is shown in terms of the network distribution and the average packet size. Finally, the mapping of the framework in the Xilinx FPGA platform is discussed.||||||||||||./pdfs/14-RAW-paper-1.pdf",
    "PMEO Keynote: Remove the Memory Wall: From performance modeling to architec|PMEO Keynote: Remove the Memory Wall: From performance modeling to architecture optimization Xian-he Sun Data access is a known bottleneck of high performance computing (HPC). The prime sources of this bottleneck are the performance gap between the processor and memory storage and the large memory requirements of ever-hungry applications. Although advanced memory hierarchies and parallel file systems have been developed in recent years, they only provide high bandwidth for contiguous, well-formed data streams, performing poorly for accessing small, noncontiguous data. Unfortunately, many HPC applications make a large number of requests for small and noncontiguous pieces of data, as do high-level I/O libraries such as HDF-5. The problematic memory wall remains after years of study and, in fact, is becoming the most important issue of HPC. We propose a new I/O architecture for HPC. Unlike traditional I/O designs where data is stored and retrieved by request, our architecture is based on a novel ?Server-Push? model in which a data access server proactively pushes data from a file server to the compute node?s memory or to it?s cache directly based on the architecture design. Simulation results show that with the new approach the cache hit rates increase well above 90\\% for various benchmark applications that are notorious for poor cache performance. Performance evaluation is the driven force of the push-based model. Mechanisms of performance modeling, evaluation, and optimization are applied to data access pattern identification, prefetching algorithm design, data replacement strategy development, and architecture optimization to enable the ?Server-Push? model. Our current success illustrates the power and unique role of performance evaluation in computing.||||||||||||./pdfs/140-PMEO-paper-1.pdf",
    "Adaptability Management and Deterministic Scheduling of Media Flows on Para|Adaptability Management and Deterministic Scheduling of Media Flows on Parallel Storage Servers Costas Mourlas We study a new design strategy for the implementation of ParallelMedia Servers with a predictable behavior. This strategy makes the timing properties and the quality of presentation of a set of media streams predictable. The proposed strategy provides deterministic guarantees and service reliability for each stream that can?t be compromised by server contention. Our real-time scheduling approach exploits the performance of parallel environments and seems a promising method that brings the advantages of parallel computation in media servers. The proposed mechanism provides deterministic service for both Constant Bit Rate (CBR) and Variable Bit Rate (VBR) streams. We present an efficient placement strategy for data frames as well as an adaptability strategy that allows appropriate frames to be dropped without sacrificing the ability to present multimedia applications predictably in time. A prototype implementation of the proposed parallel media server illustrates the concepts of server allocation and scheduling of continuous media streams.||||||||||||./pdfs/140-WPDRTS-paper-1.pdf",
    "Workflow Fine-grained Concurrency with Automatic Continuation |Workflow Fine-grained Concurrency with Automatic Continuation Giancarlo Tretola Eugenio Zimeo Workflow enactment systems are becoming an effective solution to ease programming, deployment and execution of distributed applications in several domains such as telecommunication, manufacturing, e-business, e-government and grid computing. In some of these fields, efficiency and traffic optimization represent key aspects for a wide diffusion of workflow engines and modeling tools. This paper focuses on a technique that enables fine-grained concurrency in compute and data-intensive workflows and reduces the traffic on the network by limiting the number of interactions to the ones strictly needed to bring the data where they are really necessary for continuing the flow of computations. We implemented this technique by using the concepts of wait by necessity and automatic continuation and we integrated it in a flexible, Java workflow engine that through the new mechanisms is able to navigate a workflow anticipating the enactment of sequential activities.||||||||||||./pdfs/149-JAVAPDC-paper-1.pdf",
    "Unification of Verification and Validation Methods for Software Systems: Pr|Unification of Verification and Validation Methods for Software Systems: Progress Report and Initial Case Study Formulation James C. Browne Calvin Lin Kevin Kane Yoonsik Cheon Patricia Teller This paper presents initial research on unification of methods for verification and validation (V\\&V)of software systems. The synergism among methods for V\\&V are described. The requirements for a unification are defined. The initial steps of a case study of application of the unified approach to V\\&V is sketched including definition of the problem domain, the approach and some details of a property specification language. An undergraduate course introducing the unified approach to V\\&V is described. The relationship of this research to other efforts toward unification of V\\&V are discussed.||||||||||||./pdfs/15-NSFNGS-paper-1.pdf",
    "Ad-hoc Distributed Spatial Joins on Mobile Devices |Ad-hoc Distributed Spatial Joins on Mobile Devices Panos Kalnis Nikos Mamoulis Spiridon Bakiras Xiaochen Li PDAs, cellular phones and other mobile devices are now capable of supporting complex data manipulation operations. Here, we focus on ad-hoc spatial joins of datasets residing in multiple non-cooperative servers. Assuming that there is no mediator available, the spatial joins must be evaluated on the mobile device. Contrary to common applications that consider the cost at the server side, our main issue is the minimization of the transferred data, while meeting the resource constraints of the device. We show that existing methods, based on partitioning and pruning, are inadequate in many realistic situations. Then, we present novel algorithms that estimate the data distribution before deciding the physical operator independently for each partition. Our experiments with a prototype implementation on a WiFi-enabled PDA, suggest that the proposed methods outperform the competitors in terms of efficiency and applicability.||||||||||||./pdfs/1568970219-IPDPS-paper-1.pdf",
    "Exploiting Dataflow to Extract Java Instruction Level Parallelism on a Tag-|Exploiting Dataflow to Extract Java Instruction Level Parallelism on a Tag-based Multi-Issue Semi In-Order (TMSI) Processor Hai-chen Wang Chung-kwong Yuen To design a Java processor with traditional modern processor architecture, the Instruction Level Parallelism (ILP) is not readily exploitable due to stack operands dependencies. This paper presents a dataflow-based instruction tagging scheme. With instruction tagging, the independent bytecode instruction groups with stack dependences are identified. Because there is no stack dependence among the different bytecode instruction groups, they can be executed in parallel. With the instruction tagging scheme, we propose a tag-based multi-issue semi-in-order (TMSI) Java processor. The processor takes advantage of instruction-tagging and stack-folding to generate the tagged register-based instructions. When the tagged instructions are ready, they are bundled out-of-order depending on data availability to form VLIW-like instruction words and issued in-order. To achieve high performance, a VLIW engine is employed. We have done the experiments in our TMSI simulation environment using SPECjvm98 and Linpack workload. The results indicate that the proposed processor has the good performance gain.||||||||||||./pdfs/1568971390-IPDPS-paper-1.pdf",
    "Relationships between Communication Models in Networks using Atomic Registe|Relationships between Communication Models in Networks using Atomic Registers Lisa Higham Colette Johnen A common way to model a distributed system is with a graph where nodes represent processors and there is an edge between two processors if and only if they can communicate directly. In shared-registers versions of this general description, neighbouring processorscommunicate by reading or writing shared registers, where each read or write is one atomic step. This paper defined two models of shared registers determined by selecting the register locations (processors or links). In the \\emph atomic state model each processor has a register; in the \\emph atomic link model, each communication link has a register. We determine under what conditions and with what robustness and/or failure-tolerance guarantees it is possible to transform a solution under the \\emph atomic state model into a solution under \\emph atomic link model. The fault-tolerant models considered in this paper are wait-freedom and self-stabilization. These questions are addressed by first establishing a framework for defining correct transformations, which may be useful for similar studies of the relationship between various models of distributed computation.||||||||||||./pdfs/1568973443-IPDPS-paper-1.pdf",
    "A Proactive Fault-detection Mechanism in Large-scale Cluster Systems |A Proactive Fault-detection Mechanism in Large-scale Cluster Systems Wu Linping Meng Dan Gao Wen Zhan Jianfeng To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time before node fails and enables the proactive fault management. The proposed mechanism is summarized as follows: First, the dynamic characteristics of cluster system running in normal activity are built using Time Series Analysis methods. Second, the fault detection process is implemented by comparing the current running state of cluster system with normal running model. The fault alarm decision is made immediately when the current running state deviates the normal running model. The experiment results show that this mechanism can detect the fault in cluster system in good time.||||||||||||./pdfs/1568974027-IPDPS-paper-1.pdf",
    "k-anycast Routing Schemes for Mobile Ad Hoc Networks |k-anycast Routing Schemes for Mobile Ad Hoc Networks Bing Wu Jie Wu Anycast is a communication paradigm that was first introduced to the suit of routing protocols in IPv6 networks. In anycast, a packet is intended to be delivered to one of the nearest group hosts. $k$-anycast, however, is proposed to deliver a packet to any threshold $k$ members of a set of hosts. In this paper, we propose three $k$-anycast routing schemes for mobile ad hoc networks. Our research work is motivated by the distributed key management services using threshold cryptography in mobile ad hoc networks in which the certification authority's functionality is distributed to any $k$ servers. However, security is not the main focus of this paper. Our goal is to reduce the routing control messages and network delay to reach any $k$ servers. The first scheme is called controlled flooding. The increase of flooding radius is based on the number of responses instead of increasing radius linearly or exponentially. The second scheme, called component-based scheme I, is to form multiple components such that each component has at least $k$ members. We can treat each component as a virtual server as in anycast, thus, we simplify the $k$-anycast routing problem into an anycast routing problem. For the highly dynamic network environment, we introduce the third scheme, called component-based scheme II, in which the membership a component maintains is relaxed to be less than $k$. The performances of the proposed schemes are evaluated through simulations.||||||||||||./pdfs/1568974150-IPDPS-paper-1.pdf",
    "A New Analytical Method for Parallel, Diffusion-type Load Balancing |A New Analytical Method for Parallel, Diffusion-type Load Balancing Petra Berenbrink Tom Friedetzky Zengjian Hu We propose a new proof technique which can be used to analyze many parallel load balancing algorithms. The technique is designed to handle concurrent load balancing actions, which are often the main obstacle in the analysis. We demonstrate the usefulness of the approach by analyzing various natural diffusion-type protocols. Our results are similar to, or better than, previously existing ones, while our proofs are much easier. The key idea is to first sequentialize the original, concurrent load transfers, analyze this new, sequential system, and then to bound the gap between both.||||||||||||./pdfs/1568974159-IPDPS-paper-1.pdf",
    "RAPID: An End-System Aware Protocol for Intelligent Data Transfer over Lamb|RAPID: An End-System Aware Protocol for Intelligent Data Transfer over Lambda Grids Amitabha Banerjee Wu-chun Feng Biswanath Mukherjee Dipak Ghosal Next-generation e-Science applications will require the ability to transfer information at high data rates between distributed computing centers and data repositories. To support such applications, lambda grid networks have been built to provide large, on-demand bandwidth between end-points that are interconnected via optical circuit-switched lambdas. It is extremely important to develop an efficient transport protocol over such high-capacity, dedicated circuits. Because lambdas provide dedicated bandwidth between endpoints, they obviate the need for network congestion control. Consequently, past research has demonstrated that rate-based transport protocols, such as RBUDP, are more effective than TCP in transferring data over lambdas. However, while lambdas eliminate congestion in the network, they ultimately push the congestion to the endpoints --- congestion that current rate-based transport protocols are ill-suited to handle. In this paper we introduce a ``\\underline R ate-\\underline A daptive \\underline P rotocol for \\underline I ntelligent \\underline D elivery (RAPID)'' of data that is lightweight and end-system performance-aware, so as to maximize end-to-end throughput while minimizing packet loss. Based on self monitoring of the dynamic task-priority at the receiving end-system, our protocol enables the receiver to proactively deliver feedback to the sender, so that the sender may adapt its sending rate to avoid congestion at the receiving end-system. This avoids large bursts of packet losses typically observed in current rate-based transport protocols. Over a 10-Gigabit link emulation of an optical circuit, RAPID reduces file-transfer time, and hence improves end-to-end throughput by as much as 25\\%.||||||||||||./pdfs/1568974249-IPDPS-paper-1.pdf",
    "On Consistency Maintenance In Service Discovery |On Consistency Maintenance In Service Discovery Vasughi Sundramoorthy Pieter Hartel Hans Scholten Communication and node failures degrade the ability of a service discovery protocol to ensure Users receive the correct service information when the service changes. We propose that service discovery protocols employ a set of recovery techniques to recover from failures and regain consistency. We use simulations to show that the type of recovery technique a protocol uses significantly impacts the performance. We benchmark the performance of our own service discovery protocol, FRODO against the performance of first generation service discovery protocols, Jini and UPnP during increasing communication and node failures. The results show that FRODO has the best overall consistency maintenance performance.||||||||||||./pdfs/1568974313-IPDPS-paper-1.pdf",
    "Collective Operations in NEC's High-performance MPI Libraries |Collective Operations in NEC's High-performance MPI Libraries Jesper Larsson Traff Hubert Ritzdorf We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the \\emph Message Passing Interface ). The infrastructure of MPI/SX makes it easy to incorporate new algorithms and algorithms for common special cases (e.g.\\ a single SX node, or a single MPI process per SX node). Algorithms that are among the best known are employed, and special hardware features of the SX architecture and Internode Crossbar Switch (IXS) are exploited wherever possible. We discuss in more detail the implementation of \\texttt MPI\\_Barrier , \\texttt MPI\\_Bcast , the MPI reduction collectives, \\texttt MPI\\_Alltoall , and the gather/scatter collectives. Performance figures and comparisons to straightforward algorithms are given for a large SX-8 system, and for the \\emph Earth Simulator . The measurements show excellent absolute performance, and demonstrate the scalability of MPI/SX and MPI/ES to systems with large numbers of nodes.||||||||||||./pdfs/1568974319-IPDPS-paper-1.pdf",
    "SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue |SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue Jaume Abella Antonio Gonz&aacute;lez The load/store queue (LSQ) is one of the most complex parts of contemporary processors. Its latency is critical for the processor performance and it is usually one of the processor hotspots. This paper presents a highly banked, set-associative, multiple-instruction entry LSQ (SAMIE-LSQ) that achieves high performance with small energy requirements. Our approach relies on the fact that many in-flight memory instructions access the same cache lines. The SAMIE-LSQ groups those instructions accessing the same cache line in the same entry. This arrangement has a number of advantages. First, it significantly reduces the address comparison activity needed for memory disambiguation since there are less addresses to be compared. It also reduces the activity in the data TLB, the cache tag and cache data arrays by caching the cache line location and address translation in the corresponding SAMIE-LSQ entry. Hence, instructions in the same entry can reuse the translation, avoid the tag check and obtain the data accessing only the right cache way. Besides, the delay of the proposed scheme is lower than that required by a conventional LSQ. We show that the SAMIE-LSQ saves 82\\% dynamic energy for the load/store queue, 42\\% for the L1 data cache and 73\\% for the data TLB, with a negligible impact on performance (0.6\\%).||||||||||||./pdfs/1568974321-IPDPS-paper-1.pdf",
    "Coterminous Locality and Coterminous Group Data Prefetching on Chip-Multipr|Coterminous Locality and Coterminous Group Data Prefetching on Chip-Multiprocessors Xudong Shi Zhen Yang Jih-kwon Peir Lu Peng Yen-kuang Chen Victor Lee Bob Liang Due to shared cache contentions and interconnect delays, data prefetching is more critical in alleviating penalties from increasing memory latencies and demands on Chip-Multiprocessors (CMPs). Through deep analysis of SPEC2000 applications, we find that a part of the nearby data memory references often exhibit highly-repeated patterns with long, but equal block reuse distance. These references can form a coterminous group (CG). Coterminous locality is introduced as that when a member in a CG is referenced, the remaining members will likely be referenced in the near future. Based on the coterminous locality behavior, we implement a novel CG data prefetcher on CMPs. Performance evaluations show that the proposed prefetcher can accurately cover up to 40-50\\% of the total misses, and result in 50-60\\% of potential performance improvement for several selected workload mixes.||||||||||||./pdfs/1568974329-IPDPS-paper-1.pdf",
    "An Efficient and Scalable Parallel Algorithm for Out-of-Core Isosurface Ext|An Efficient and Scalable Parallel Algorithm for Out-of-Core Isosurface Extraction and Rendering Qin Wang Joseph Jaja Amitabh Varshney We consider the problem of isosurface extraction and rendering for large scale time varying data. Such datasets have been appearing at an increasing rate especially from physics-based simulations, and can range in size from hundreds of gigabytes to tens of terabytes. We develop a new simple indexing scheme, which makes use of the concepts of the interval tree and the span space data structures. The new scheme enables isosurface extraction and rendering in I/O optimal time, using more compact indexing structure and more effective bulk data movement than the previous schemes. Moreover, our indexing scheme can be easily extended to a multiprocessor environment in which each processor has access to its own local disk. The resulting parallel algorithm is provably efficient and scalable. That is, it achieves load balancing across the processors independent of the isovalue, with almost no overhead in the total amount of work relative to the sequential algorithm. We conduct a large number of experimental tests on the University of Maryland Visualization Cluster using the Richtmyer-Meshkov instability dataset, and obtain results that consistently validate the efficiency and the scalability of our algorithm.||||||||||||./pdfs/1568974332-IPDPS-paper-1.pdf",
    "Executing MPI Programs on Virtual Machines in an Internet Sharing System |Executing MPI Programs on Virtual Machines in an Internet Sharing System Zhelong Pan Xiaojuan Ren Rudolf Eigenmann Dongyan Xu Internet sharing systems aim at federating and utilizing distributed computing resources across the Internet. This paper presents a user-level virtual machine (VM) approach to MPI program execution in an Internet sharing framework. In this approach, the resource consumer has its own operating system running on top of, and isolated from, the operating system of the resource provider. We propose an efficient socket virtualization technique to optimize VM network performance. Socket virtualization achieves the same network bandwidth as the physical network. In our LAN environment, it reduces the latency overhead from 172\\% (using existing TUN/TAP technique) to 35.6\\%. Performance results on MPI benchmarks show that our virtualization technique incurs small overhead compared with the physical host platform, while gaining in return a higher degree of guest isolation and customization. We also describe the key mechanisms that allow the employment of VMs in an existing Internet sharing system.||||||||||||./pdfs/1568974340-IPDPS-paper-1.pdf",
    "Wire-Speed Total Order |Wire-Speed Total Order Tal Anker Greegory Greenman Danny Dolev Ilya Shnayderman Many distributed systems may be limited in their performance by the number of transactions they are able to support per unit of time. In order to achieve fault tolerance and to boost a system's performance, active state machine replication is frequently used. It employs total ordering service to keep the state of replicas synchronized. In this paper, we present an architecture that enables a drastic increase in the number of ordered transactions in a cluster, using off-the-shelf network equipment. Performance supporting nearly one million ordered transactions per second has been achieved, which substantiates our claim.||||||||||||./pdfs/1568974379-IPDPS-paper-1.pdf",
    "On the Packing of Selfish Items |On the Packing of Selfish Items Vittorio Bil&oacute; In the non cooperative version of the classical Minimum Bin Packing problem, an item is charged a cost according to the percentage of the used bin space it requires. We study the game induced by the selfish behavior of the items which are interested in being packed in one of the bins so as to minimize their cost. We prove that such a game always converges to a pure Nash equilibrium starting from any initial packing of the items, estimate the number of steps needed to reach one such equilibrium, prove the hardness of computing good equilibria and give an upper and a lower bound for the price of anarchy of the game. Then, we consider a multidimensional extension of the problem in which each item can require to be packed in more than just one bin. Unfortunately, we show that in such a case the induced game may not admit a pure Nash equilibrium even under particular restrictions. The study of these games finds applications in the analysis of the bandwidth cost sharing problem in non cooperative networks.||||||||||||./pdfs/1568974382-IPDPS-paper-1.pdf",
    "Parallelization and Performance Characterization of Protein 3D Structure Pr|Parallelization and Performance Characterization of Protein 3D Structure Prediction of Rosetta Wenlong Li Tao Wang Eric Li David Baker Li Jin Steven Ge Yurong Chen Yimin Zhang The prediction of protein 3D structure has become a hot research area in the post-genome era, through which people can understand a protein?s function in health and disease, explore ways to control its actions and assist drug design. Many protein structure prediction approaches have been proposed in past decades. Among them, Rosetta is one of the best systems. However, the huge time complexity of Rosetta, e.g. a few days to predict a protein, limits its wide use in practice.To accelerate the prediction of protein 3D structure in Rosetta, this paper presents three different approaches, i.e., non-interactive, periodic interactive and asynchronous dynamic interactive scheme, to parallelize Rosetta. The asynchronous interactive scheme, with the adaptation of dynamic solution interaction, outperforms the other two, delivering much faster convergence speed and better solution quality. Detailed measurements and performance analysis also indicate that parallel Rosetta with asynchronous dynamic interactive scheme scales well.||||||||||||./pdfs/1568974398-IPDPS-paper-1.pdf",
    "Efficient Client-to-Server Assignments for Distributed Virtual Environments|Efficient Client-to-Server Assignments for Distributed Virtual Environments Duong Nguyen Binh Ta Suiping Zhou Distributed Virtual Environments (DVEs) are distributed systems that allow multiple geographically distributed clients (users) to interact simultaneously in a computer-generated, shared virtual world. Applications of DVEs can be seen in many areas nowadays, such as online games, military simulations, collaborative designs, etc. To support large-scale DVEs with real-time interactions among thousands or more distributed clients, a geographically distributed server architecture (GDSA) is generally needed, and the virtual world can be partitioned into many distinct zones to distribute the load among the servers. Due to the geographic distributions of clients and servers in such architectures, it is essential to efficiently assign the participating clients to servers to enhance users' experience in interacting within the DVE. This problem is termed the client assignment problem. In this paper, we propose a two-phase approach, consisting of an initial assignment phase and a refined assignment phase to address this problem. Both phases are shown to be NP-hard, and several heuristic assignment algorithms are then devised based on this two-phase approach. Via extensive simulation studies with realistic settings, we evaluate these algorithms in terms of their performances in enhancing interactivity of the DVE.||||||||||||./pdfs/1568974412-IPDPS-paper-1.pdf",
    "Skewed Allocation of Non-Uniform Data for Broadcasting over Multiple Channe|Skewed Allocation of Non-Uniform Data for Broadcasting over Multiple Channels A.a. Bertossi C.m. Pinotti The problem of data broadcasting over multiple channels consists in partitioning data among channels, depending on data popularities, and then cyclically transmitting them over each channel so that the average waiting time of the clients is minimized. Such a problem is known to be polynomially time solvable for uniform length data items, while it is computationally intractable for non-uniform length data items. In this paper, two new heuristics are proposed which exploit a novel characterization of optimal solutions for the special case of two channels and data items of uniform lengths. Sub-optimal solutions for the most general case of an arbitrary number of channels and data items of non-uniform lengths are provided. The first heuristic, called Greedy+, combines the novel characterization with the known greedy approach, while the second heuristic, called Dlinear, combines the same characterization with the dynamic programming technique. Such heuristics have been tested on benchmarks whose popularities are characterized by Zipf distributions. The experimental tests reveal that Dlinear finds optimal solutions almost always, requiring good running times, while Greedy+ is faster and scales well when changes occur on the input parameters, but provides worse solutions than Dlinear.||||||||||||./pdfs/1568974423-IPDPS-paper-1.pdf",
    "DVoDP2P: Distributed P2P Assisted Multicast VoD Architecture |DVoDP2P: Distributed P2P Assisted Multicast VoD Architecture Xiaoyuan Yang Porfidio Hern&aacute;ndez Fernando Cores Leandro Souza Ana Ripoll Remo Suppi Emilio Luque For a high scalable VoD system, the distributed server architecture~(DVoD) with more than one server-node is a cost-effective design solution. However, such a design is highly vulnerable to workload variations because the service capacity is limited. In this paper, we propose a new and efficient VoD architecture that combines DVoD with a P2P system. The DVoD's server-nodes is able to offer a minimum required quality of service~(QoS) and the P2P system is able to provide the mechanism to increase the system service capacity according to client demands. Our P2P system is able to synchronize a group of clients in order to create multicast channels in local networks to replace server-nodes in the delivery process. Our client collaboration scheme is designed to take into account the P2P system's efficiency and the network overhead. We compared the new VoD architecture with DVoD architecture based on classic multicast and P2P delivery policies~(Patching and Chaining). The experimental results showed that our design is better than previous solutions in terms of server-node load, inter-connection network load, local-network overhead and scalability. Compared with the multicast-DVoD, our architecture reduced server-load by up to 37\\%.||||||||||||./pdfs/1568974431-IPDPS-paper-1.pdf",
    "A Dynamic Firing Speculation to Speedup Distributed Symbolic State-space Ge|A Dynamic Firing Speculation to Speedup Distributed Symbolic State-space Generation Ming-ying Chung Gianfranco Ciardo The \\emph saturation strategy for symbolic state-space generation is very effective for globally-asynchronous locally-synchronous discrete-state systems. Its inherently sequential nature, however, makes it difficult to parallelize on a NOW. An initial attempt that utilizes idle workstations to recognize event firing patterns and then speculatively compute firings conforming to these patterns is at times effective but can introduce large memory overheads. We suggest an implicit method to encode the firing history of decision diagram nodes, where patterns can be shared by nodes. By preserving the actual firing history efficiently and effectively, the speculation is more informed. Experiments show that our implicit encoding method not only reduces the memory requirements but also enables dynamic speculation schemes that further improve runtime.||||||||||||./pdfs/1568974442-IPDPS-paper-1.pdf",
    "A Dependable Infrastructure of the Electric Network for E-textiles |A Dependable Infrastructure of the Electric Network for E-textiles Nenggan Zheng Zhaohui Wu Man Lin Minde Zhao Electronic textiles, known as computational fabrics, offer an emerging method for constructing wearable and large area applications. Because e-textiles are battery-driven and fault-prone systems, there is a need for developing a dependable infrastructure of the electric networks for e-textiles. In this paper, a new infrastructure of the power networks for e-textiles, Flexible Power Network (FPN), is presented. Instead of drawing power from a fixed battery as in the conventional electric networks, the power consuming nodes in a FPN can obtain power energy from one of the choices of batteries available with the help of the battery selectors. We also introduce the over current protectors into the battery nodes (BN) to protect the batteries from wasting the charge when short-circuit faults occur. The electric features of battery selectors and over current protectors, the two types of important electric devices used in FPNs, are illustrated in the paper. We have performed simulation experiments and the results show that our FPNs are more dependable than some common electric networks published before in the cases of short- and open-circuit faults.||||||||||||./pdfs/1568974453-IPDPS-paper-1.pdf",
    "An Integrated Approach for Density Control and Routing in Wireless Sensor N|An Integrated Approach for Density Control and Routing in Wireless Sensor Networks Isabela G. Siqueira Carlos Maur&Iacute;cio S. Figueiredo Antonio Alfredo F. Loureiro Jos&eacute; Marcos Nogueira Linnyer Beatrys Ruiz Wireless Sensor Networks (WSNs) are characterized by having scarce resources. The usual way of designing network functions is to consider them isolatedly, a strategy which may not guarantee the correct and efficient operation of WSNs. For this reason, in this paper we propose an integrated design of network functions. We take two important WSN functions --- density control and routing --- as an example and present two approaches to integrate them. In particular, we present two solutions, named RDC-Sync and RDC-Integrated, which integrate a geographical density control algorithm with tree routing. The simulations experiments performed prove that the integrated design improves the network performance, especially when density control and routing are fully integrated.||||||||||||./pdfs/1568974461-IPDPS-paper-1.pdf",
    "Analytical Performance Modelling of Adaptive Wormhole Routing in the Star I|Analytical Performance Modelling of Adaptive Wormhole Routing in the Star Interconnection Network Abbas Eslami Kiasari Hamid Sarbazi-azad Mohamed Ould-khaoua The star graph was introduced as an attractive alternative to the well-known hypercube and its properties have been well studied in the past. Most of these studies have focused on topological properties and algorithmic aspects of this network. Although several analytical models have been proposed in the literature for different interconnection networks, none of them have dealt with star graphs. This paper proposes the first analytical model to predict message latency in wormhole-switched star interconnection networks with fully adaptive routing. The analysis focuses on a fully adaptive routing algorithm which has shown to be the most effective for star graphs. The results obtained from simulation experiments confirm that the proposed model exhibits a good accuracy under different operating conditions.||||||||||||./pdfs/1568974472-IPDPS-paper-1.pdf",
    "Non-cooperative, Semi-cooperative, and Cooperative Games-based Grid Resourc|Non-cooperative, Semi-cooperative, and Cooperative Games-based Grid Resource Allocation Samee Ullah Khan Ishfaq Ahmad In this paper we consider, compare and analyze three game theoretical Grid resource allocation mechanisms. Namely, 1) the non-cooperative sealed-bid method where tasks are auctioned off to the highest bidder, 2) the semi-cooperative n-round sealed-bid method in which each site delegate its work to others if it cannot perform the work itself, and 3) the cooperative method in which all of the sites deliberate with one another to execute all the tasks as efficiently as possible. To experimentally evaluate the above mentioned techniques, we perform extensive simulation studies that effectively encapsulate the task and machine heterogeneity. The tasks are assumed to be independent and bear multiple execution time deadlines. The simulation model is built around a hierarchical Grid infrastructure where machines are abstracted into larger computing centers labeled ``federations,'' each of which are responsible for managing their own resources independently. These federations are then linked together with a primary portal to which Grid tasks would be submitted. To measure the effectiveness of these game theoretical techniques, the recorded performance is evaluated against a conventional baseline method in which tasks are randomly assigned to the sites without any task execution guarantee.||||||||||||./pdfs/1568974535-IPDPS-paper-1.pdf",
    "Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computa|Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources Zizhong Chen Jack Dongarra As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix computations. However, previous algorithm-based fault tolerance methods for matrix computations are often derived using algorithms that are seldomly used in the practice of today's high performance matrix computations and have mostly focused on platforms where failed processors produce incorrect calculations. To fill this gap, this paper extends the existing algorithm-based fault tolerance to the volatile computing platform where the failied processor stops working and applies it to scalable high performance matrix computations with two dimensional block cyclic data distribution. We show the practicality of this technique by applying it to the ScaLAPACK/PBLAS matrix-matrix multiplication kernel. Experimental results demonstrate that the proposed approach is able to survive process failures with a very low performance overhead.||||||||||||./pdfs/1568974543-IPDPS-paper-1.pdf",
    "Dual-Layered File Cache On cc-NUMA System |Dual-Layered File Cache On cc-NUMA System Zhou Yingchao Meng Dan Ma Jie CC-NUMA is a widely adopted and deployed architecture of high performance computers. These machines are attractive for their transparent access to local and remote memory. However, the prohibitive latency gap between local and remote access deteriorates applications? performance seriously due to memory access stalls. File system cache, especially, being shared by all processes, inevitably triggers many remote accesses. To address this problem, we suggest and implement a mechanism that uses local memory to cache remote file cache, of which the main purpose is to improve data locality. Using realistic workload on a two-node cc-NUMA machine, we show that the cost of such a mechanism is as low as 0.5\\%, the performance can be increased 14.3\\% at most, and the local hit ratio can be improved as much as 40\\%.||||||||||||./pdfs/1568974567-IPDPS-paper-1.pdf",
    "A Distributed Method for Dynamic Resolution of BGP Oscillations |A Distributed Method for Dynamic Resolution of BGP Oscillations Ahronovitz Ehoud K&Ouml;nig Jean-claude Saad Cl&eacute;ment Autonomous Systems (AS) in the Internet use different protocols for internal and external routing. BGP is the only external protocol. It allows ASes to define their own routing policy independently. Many papers cited in reference deal with a divergence behavior due to this flexibility. In fact, when routing policies are not conflicting, BGP is self-stabilising, which means that whatever the network configuration, BGP converges to a stable solution. Unfortunately, as experienced on the Internet, AS routing policies may be uncoherent, thus generating oscillations. In this paper we propose a distributed dynamic method for detecting and solving oscillations of BGP. It respects private policy choices and requires only a few low level constraints in order to converge to a stable solution. Essentially, a router has to maintain only local path stateful information to detect instabilities. In this case, it generates and launches a token linked to a route. Each router makes the decision to forward or not the token according to local data and local policy. If the originating router receives back the token, then it marks the route as \\emph barred . Nevertheless, routes may furtherly be unmarked.||||||||||||./pdfs/1568974589-IPDPS-paper-1.pdf",
    "Distributed Algorithm for a Color Assignment on Asynchronous Rings |Distributed Algorithm for a Color Assignment on Asynchronous Rings Gianluca De Marco Mauro Leoncini Manuela Montangero We study a version of the $\\beta$-assignment problem (introduced by G. J. Chang and P. H. Ho in 1998) on asynchronous rings: consider a set of items and a set of $m$ colors, where each item is associated to one color. Consider also $n$ computational agents connected by an asynchronous ring. Each agent holds a subset of the items, where initially different agents might hold items associated to the same color. We analyze the problem of distributively assigning colors to agents in such a way that(a) each color is assigned to one agent and (b) the number of different colors assigned to each agent is minimum. Since any color assignment requires that the items be distributed according to it ( \\em e.g. all items of the same color are to be held by only one agent), we define the cost of a color assignment as the amount of items that need to be moved, given an initial allocation. We first show that any distributed algorithm for this problem on the ring requires a communication complexity of $\\Omega(n\\cdot m)$ and then we exhibit a polynomial time distributed algorithm with message complexity matching the bound, that determines a color assignment with cost at most $(2+\\epsilon)$ times the optimal cost, for any $0||||||||||||./pdfs/1568974594-IPDPS-paper-1.pdf",
    "Parallel Hypergraph Partitioning for Scientific Computing |Parallel Hypergraph Partitioning for Scientific Computing Karen D Devine Erik G Boman Robert T Heaphy Rob H Bisseling Umit V Catalyurek Graph partitioning is often used for load balancing in parallel computing, but it is known that hypergraph partitioning has several advantages. First, hypergraphs more accurately model communication volume, and second, they are more expressive and can better represent nonsymmetric problems. Hypergraph partitioning is particularly suited to parallel sparse matrix-vector multiplication, a common kernel in scientific computing. We present a parallel software package for hypergraph (and sparse matrix) partitioning developed at Sandia National Labs. The algorithm is a variation on multilevel partitioning. Our parallel implementation is novel in that it uses a two-dimensional data distribution among processors. We present empirical results that show our parallel implementation achieves good speedup on several large problems (up to 33 million nonzeros) with up to 64 processors on a Linux cluster.||||||||||||./pdfs/1568974597-IPDPS-paper-1.pdf",
    "DiST: Fully Decentralized Indexing for Querying Distributed Multidimensiona|DiST: Fully Decentralized Indexing for Querying Distributed Multidimensional Datasets Beomseok Nam Alan Sussman Grid computing and Peer-to-peer (P2P) systems are emerging as new paradigms for managing large scale distributed resources across wide area networks. While Grid computing focuses on managing heterogeneous resources and relies on centralized managers for resource and data discovery, P2P systems target scalable, decentralized methods for publishing and searching for data. In large distributed systems, a centralized resource manager is a potential performance bottleneck and decentralization can help avoid this bottleneck, as is done in P2P systems. However, the query functionality provided by most existing P2P systems is very rudimentary, and is not directly applicable to Grid resource management. In this paper, we propose a fully decentralized multidimensional indexing structure, called \\em DiST , that operates in a fully distributed environment with no centralized control. In DiST, each data server only acquires information about data on other servers from executing and routing queries. We describe the DiST algorithms for maintaining the decentralized network of data servers, including adding and deleting servers, the query routing algorithm, and failure recovery algorithms. We also evaluate the performance of the decentralized scheme against a more structured hierarchical indexing scheme that we have previously shown to perform well in distributed Grid environments.||||||||||||./pdfs/1568974608-IPDPS-paper-1.pdf",
    "WaveGrid: a Scalable Fast-turnaround Heterogeneous Peer-based Desktop Grid |WaveGrid: a Scalable Fast-turnaround Heterogeneous Peer-based Desktop Grid System Dayi Zhou Virginia Lo We propose a novel heterogeneous scalable desktop grid system, WaveGrid, which uses a peer-to-peer architecture and can satisfy the needs of applications with fast-turnaround requirements. The challenges for fast-turnaround scheduling in a large heterogeneous peer-based desktop grid system include how to quickly discover available hosts with low message overhead; how to achieve high utilization of the available cycles in this opportunistic scheduling environment; and how to adapt to the heterogeneous environment for efficient scheduling. WaveGrid answers these challenges by letting hosts self-organize into a timezone-aware overlay network, which supports straightforward, quick resource discovery. Scheduling methods in WaveGrid take heterogeneity into account in selecting scheduling and migration targets. WaveGrid then rides the wave of available cycles by migrating jobs to hosts located in idle night-time zones around the globe. We evaluate WaveGrid using a heterogeneous host CPU power profile based on empirical data collected from the global computing project BOINC. The simulation results show that WaveGrid performs consistently well with fast turnaround time and low migration overhead. It performs much better than other systems with respect to turnaround, stability and minimal impact on hosts.||||||||||||./pdfs/1568974611-IPDPS-paper-1.pdf",
    "Trust Overlay Networks for Global Reputation Aggregation in P2P Grid Comput|Trust Overlay Networks for Global Reputation Aggregation in P2P Grid Computing Runfang Zhou Kai Hwang This paper presents a new approach to trusted Grid computing in a Peer-to-Peer (P2P) setting. Trust and security are essential to establish lasting working relationships among the peers. A P2P reputation system collects peer trust scores and aggregates them to yield a global reputation. We use a new trust overlay network (TON) to model the trust relationships among the peers. After analyzing the eBay transaction trace data, we discover a power-law distribution in user feedbacks. We develop a new reputation system, PowerTrust, to leverage power-law feedback characteristics. The PowerTrust system is built with locality-preserving hash functions and a lookahead random walk strategy. Dynamic system reconfiguration is enabled by the use of power nodes with well-established reputations. Through P2P simulation experiments on distributed file sharing and Grid parameter-sweeping applications (PSA), we demonstrate the PowerTrust advantages in fast reputation convergence and accurate ranking of peer reputations. We report performance results with enhanced query success rate and shortened job makespan in scalable P2P Grid applications.||||||||||||./pdfs/1568974614-IPDPS-paper-1.pdf",
    "An Adaptive Stabilization Framework for Distributed Hash Tables |An Adaptive Stabilization Framework for Distributed Hash Tables Gabriel Ghinita Yong Meng Teo Distributed Hash Tables (DHT) algorithms obtain good lookup performance bounds by using deterministic rules to organize peer nodes into an overlay network. To preserve the invariants of the overlay network, DHTs use stabilization procedures that reorganize the topology graph when participating nodes join or fail. Most DHTs use periodic stabilization, in which peers perform stabilization at fixed intervals of time, disregarding the rate of change in overlay topology; this may lead to poor performance and large stabilization-induced communication overhead. We propose a novel adaptive stabilization framework that takes into consideration the continuous evolution in network conditions. Each peer collects statistical data about the network and dynamically adjusts its stabilization rate based on the analysis of the data. The objective of our scheme is to maintain nominal network performance and to minimize the communication overhead of stabilization.||||||||||||./pdfs/1568974618-IPDPS-paper-1.pdf",
    "Oblivious Parallel Probabilistic Channel Utilization without Control Channe|Oblivious Parallel Probabilistic Channel Utilization without Control Channels Christian Schindelhauer Kerstin Voss The research interest in sensor nets is still growing because they simplify data acquisition in many applications. If hardware resources are very sparse, routing algorithms cannot use data gathering. However, if a large number of channels can be used, then parallel transmission can compensate this drawback. If the senders and receivers are not known in advance, then a control channel poses a bottleneck for communication. We present an oblivious MAC protocol, called the Funnel protocol, where the channels are nearly optimally utilized in parallel. In this, senders and receivers choose for a polylogarithmic number of rounds (several sending attempts) a decreasing number of channels which are selected equiprobably. Then, we show that a previously presented approach using only one round and therefore one type of probability distribution is optimal up to some constant factor, and considerably worse than the Funnel protocol. The protocol works with few resources if an sufficient number of channels is available. The Funnel protocol is simple, elegant, and does not need to know the number of senders and receivers, thus being oblivious. On the bottom line we prove that small messages can be efficiently transmitted by the MAC layer in parallel without a control channel if more than one channel for communication can be used.||||||||||||./pdfs/1568974622-IPDPS-paper-1.pdf",
    "A Code Motion Technique for Accelerating General-Purpose Computation on the|A Code Motion Technique for Accelerating General-Purpose Computation on the GPU Takatoshi Ikeda Fumihiko Ino Kenichi Hagihara Recently, graphics processing units (GPUs) are providing increasingly higher performance with programmable internal processors, namely vertex processors (VPs) and fragment processors (FPs). Such newly added capabilities motivate us to perform general-purpose computation on GPUs (GPGPU) beyond graphics applications. Although VPs and FPs are connected in a pipeline, many GPGPU implementations utilize only FPs as a computational engine in the GPU. Therefore, such implementations may result in lower performance due to highly loaded FPs (as compared to VPs) being a performance bottleneck in the pipeline execution. The objective of our work is to improve the performance of GPGPU programs by eliminating this bottleneck. To achieve this, we present a code motion technique that is capable of reducing the FP workload by moving assembly instructions appropriately from the FP program to the VP program. We also present the definition of such movable instructions that do not change the I/O specification between the CPU and the GPU. The experimental results show that (1) our technique improves the performance of a Gaussian filter program with reducing execution time by approximately 40\\% and (2) it successfully reduces the FP workload in 10 out of 18 GPGPU programs.||||||||||||./pdfs/1568974626-IPDPS-paper-1.pdf",
    "Parallel Morphological Processing of Hyperspectral Image Data on Heterogene|Parallel Morphological Processing of Hyperspectral Image Data on Heterogeneous Networks of Computers Antonio J. Plaza Recent advances in space and computer technologies are revolutionizing the way remotely sensed data is collected, managed and interpreted. The development of efficient techniques for transforming the massive amount of collected data into scientific understanding is critical for space-based Earth science and planetary exploration. Although most currently available parallel processing strategies for hyperspectral image analysis assume homogeneity in the computing platform, heterogeneous networks of computers represent a very promising cost-effective solution expected to play a major role in the design of high-performance computing platforms for many on-going and planned remote sensing missions. This paper explores techniques for mapping morphological hyperspectral analysis algorithms, characterized by their scalability and sub-pixel accuracy, onto heterogeneous parallel computers. Important aspects in algorithm design are illustrated by using both homogeneous and heterogeneous parallel computing facilities available at NASA?s Goddard Space Flight Center and University of Maryland. Experiments reveal that heterogeneous networks of workstations represent a source of computational power that is both accessible and applicable in many remote sensing studies.||||||||||||./pdfs/1568974634-IPDPS-paper-1.pdf",
    "Design flow for Optimizing Performance in Processor Systems with on-chip Co|Design flow for Optimizing Performance in Processor Systems with on-chip Coarse-Grain Reconfigurable Logic Michalis D. Galanis Gregory Dimitoulakos Costas E. Goutis A design flow for processor platforms with on-chip coarse-grain reconfigurable logic is presented. The reconfigurable logic is realized by a 2-Dimensional Array of Processing Elements. Performance is improved by accelerating critical software loops, called kernels, on the Reconfigurable Array. Basic steps of the design flow have been automated. A procedure for detecting critical loops in the input C code was developed, while a mapping technique for Coarse Grain Reconfigurable Arrays, based on software pipelining, was also devised. Analytical results derived from mapping five real-life DSP applications on eight different instances of a generic system architecture are presented. Large values of Instructions Per Cycle were achieved on two Reconfigurable Arrays that resulted in high-performance kernel mapping. Additionally, by mapping critical code on the reconfigurable logic, speedups ranging from 1.27 to 3.18 relative to an all-processor execution were achieved.||||||||||||./pdfs/1568974642-IPDPS-paper-1.pdf",
    "Dynamic Multi Phase Scheduling for Heterogeneous Clusters |Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina Monica Ciorba Theodore Andronikos Ioannis Riakiotakis Anthony T. Chronopoulos George Papakonstantinou Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are how to deal with scheduling and load balancing of such a system, which may consist of heterogeneous computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops (with independent iterations) on heterogeneous computer clusters have been obtained and studied. However, no study of dynamic schemes for loops with iteration dependencies has been reported so far. In this work we study the problem of scheduling loops with iteration dependencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration dependencies incurs an extra degree of difficulty and makes the development of such schemes quite a challenge. We extend three well known dynamic schemes (CSS, TSS and DTSS) by introducing synchronization points at certain intervals so that processors compute in pipelined fashion. Our scheme is called dynamic multi-phase scheduling ($DMPS$) and we apply it to loops with iteration dependencies. We implemented our new scheme on a network of heterogeneous computers and studied its performance. Through extensive testing on two real-life applications (the heat equation and the Floyd-Steinberg algorithm), we show that the proposed method is efficient for parallelizing nested loops with dependencies on heterogeneous systems.||||||||||||./pdfs/1568974649-IPDPS-paper-1.pdf",
    "A Distributed Paging RAM Grid System for Wide-Area Memory Sharing |A Distributed Paging RAM Grid System for Wide-Area Memory Sharing Rui Chu Nong Xiao Yongzhen Zhuang Yunhao Liu Xicheng Lu Memory-intensive applications often suffer from the poor performance of disk swapping when memory is inadequate. Remote memory sharing schemes, which provide a remote memory that is faster than the local hard disk, are able to improve the performance of such applications. Due to the limitation of being applicable within single clusters only, however, most of the previous remote memory mechanisms, such as the network memory scheme, fail to be extendable into a large scale, distributed, heterogeneous, and dynamic environment. In this work, we propose a service-oriented grid memory sharing scheme, Distributed Paging RAM Grid (DPRG). We study the properties and criteria of large scale memory sharing, and then design major operations and optimizations to fit the usage of grid systems. We collect trace from our grid environment, and evaluate DPRG through comprehensive trace-driven simulations. Results show that DPRG significantly outperforms existing remote memory sharing schemes and supports grid computing applications effectively.||||||||||||./pdfs/1568974652-IPDPS-paper-1.pdf",
    "Incrementally Developing Parallel Applications with AspectJ |Incrementally Developing Parallel Applications with AspectJ Joao Luis Ferreira Sobral This paper presents a methodology to develop more modular parallel applications, based on aspect oriented programming. Traditional object oriented mechanisms implement application core functionality and parallelisation concerns are plugged by aspect oriented mechanisms. Parallelisation concerns are separated into four categories: functional or/and data partition, concurrency, distribution and optimisation. Modularising these categories into separate modules using aspect oriented programming enables (un)pluggability of parallelisation concerns. This approach leads to more incremental application development, easier debugging and increased reuse of core functionality and parallel code, when compared with traditional object oriented approaches. A detailed analysis of a simple parallel application - a prime number sieve - illustrates the methodology and shows how to accomplish these gains.||||||||||||./pdfs/1568974667-IPDPS-paper-1.pdf",
    "Fast Distributed Graph Partition and Application (Extended Abstract) |Fast Distributed Graph Partition and Application (Extended Abstract) Bilel Derbel Mohamed Mosbah Akka Zemmari This paper presents efficient deterministic and randomized distributed algorithms for decomposing a graph with $n$ nodes into a disjoint set of connected clusters with small radius and few intercluster edges. Our algorithms can be easily implemented in the distributed $\\mathcal CONGEST $ model of computation i.e., limited message size, improving the time complexity of previous algorithms from linear to sublinear. One important application of our algorithms is efficient construction of sparse graph spanners. In fact, given a parameter $k$, we show that there exists a sublinear deterministic distributed algorithm that constructs a graph spanner of stretch $2k-1$ with at most $\\mathcal O (n^ 1+1/k )$ edges in the $\\mathcal CONGEST $ model.||||||||||||./pdfs/1568974675-IPDPS-paper-1.pdf",
    "Enhancing L2 Organization for CMPs with a Center Cell |Enhancing L2 Organization for CMPs with a Center Cell Chun Liu Anand Sivasubramaniam Mahmut Kandemir Mary Jane Irwin Chip multiprocessors (CMPs) are becoming a popular way of exploiting ever-increasing number of on-chip transistors. At the same time, the location of data on the chip can play a critical role in the performance of these CMPs because of the growing on-chip storage capacities and the relative cost of wire delays. It is important to locate the data at the right place at the right time in the on-chip cache hierarchy. This paper presents a novel L2 cache organization for CMPs with these goals in mind. We first study the data sharing characteristics of a wide spectrum of multi-threaded applications and show that, while there are a considerable number of L2 accesses to shared data, the volume of this data is relatively low. Consequently, it is important to keep this shared data fairly close to all processor cores for both performance and power reasons. Motivated by this observation, we propose a small Center Cell cache residing in the middle of the processor cores which provides fast access to its contents. We demonstrate that this cache organization can considerably lower the number of block migrations between the L2 portions that are closer to each core, thus providing better performance and power.||||||||||||./pdfs/1568974680-IPDPS-paper-1.pdf",
    "Load Balancing in the Presence of Random Node Failure and Recovery |Load Balancing in the Presence of Random Node Failure and Recovery Sagar Dhakal Majeed M. Hayat Jorge E. Pezoa Chaouki T. Abdallah J. Doug Birdwell John Chiasson In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynamically changing in a random fashion. A load-balancing (LB) policy for such systems should therefore be robust, in terms of workload re-allocation and effectiveness in task completion, with respect to the random absence and re-emergence of nodes as well as random delays in the transfer of workloads among nodes. In this paper two LB policies for such computing environments are presented: The first policy takes an initial LB action to preemptively counteract the consequences of random failure and recovery of nodes. The second policy compensates for the occurrence of node failure dynamically by transferring loads only at the actual failure instants. A probabilistic model, based on the concept of regenerative processes, is presented to assess the overall performance of the system under these policies. Optimal performance of both policies is evaluated using analytical, experimental and simulation-based results. The interplay between node-failure/recovery rates and the mean load-transfer delay are highlighted.||||||||||||./pdfs/1568974682-IPDPS-paper-1.pdf",
    "Supporting Self-Adaptation in Streaming Data Mining Applications |Supporting Self-Adaptation in Streaming Data Mining Applications Liang Chen Gagan Agrawal There are many application classes where the users are flexible with respect to the output quality. At the same time, there are other constraints, such as the need for real-time or interactive response, which are more crucial. This paper presents and evaluates a runtime algorithm for supporting adaptive execution for such applications. The particular domain we target is distributed data mining on streaming data. This work has been done in the context of a middleware system called GATES (Grid-based AdapTive Execution on Streams) that we have been developing. The self-adaptation algorithm we present and evaluate in this paper has the following characteristics. First, it carefully evaluates the long-term load at each processing stage. It considers different possibilities for the load at a processing stage and its next stages, and decides if the value of an adaptation parameter needs to be modified, and if so, in which direction. To find the ideal new value of an adaptation parameter, it performs a binary search on the specified range of the parameter. To evaluate the self-adaptation algorithm in our middleware, we have implemented two streaming data mining applications. The main observations from our experiments are as follows. First, our algorithm is able to quickly converge to stable values of the adaptation parameter, for different data arrival rates, and independent of the specified initial value. Second, in a dynamic environment, the algorithm is able to adapt the processing rapidly. Finally, in both static and dynamic environments, the algorithm clearly outperforms the algorithm described in our earlier work and an obvious alternative, which is based on linear-updates.||||||||||||./pdfs/1568974693-IPDPS-paper-1.pdf",
    "Distributed Antipole Clustering for Efficient Data Search and Management in|Distributed Antipole Clustering for Efficient Data Search and Management in Euclidean and Metric Spaces Alfredo Ferro Rosalba Giugno Misael Mongiov&iacute; Giuseppe Pigola Alfredo Pulvirenti In this paper a simple and efficient distributed version of the recently introduced Antipole Clustering algorithm for general metric spaces is proposed. This combines ideas from the M-Tree, the Multi-Vantage Point structure and the FQ-Tree to create a new structure in the ``bisector tree class, called the Antipole Tree. Bisection is based on the proximity to an ``Antipole pair of elements generated by a suitable linear randomized tournament. The final winners (A,B) of such a tournament are far enough apart to approximate the diameter of the splitting set. A simple linear algorithm computing Antipoles in Euclidean spaces with exponentially small approximation ratio is proposed. The Antipole Tree Clustering has been shown to be very effective in important applications such as range and k-nearest neighbor searching, mobile objects clustering in centralized wireless networks with movable base stations and multiple alignment of biological sequences. In many of such applications an efficient distributed clustering algorithm is needed. In the proposed distributed versions of Antipole Clustering the amount of data passed from one node to another is either constant or proportional to the number of nodes in the network. The Distributed Antipole Tree is equipped with additional information in order to perform efficient range search and dynamic clusters management. This is achieved by adding to the randomized tournaments technique, methodologies taken from established systems such as BFR and BIRCH*. Experiments show the good performance of the proposed algorithms on both real and synthetic data.||||||||||||./pdfs/1568974702-IPDPS-paper-1.pdf",
    "A Performance Model for Fine-Grain Accesses in UPC |A Performance Model for Fine-Grain Accesses in UPC Zhang Zhang Steven R. Seidel UPC's implicit communication and fine-grain programming style make application performance modeling a challenging task. The correspondence between remote references and communication events depends on the internals of the compiler and runtime system. This correspondence is often hidden from application developers. Aggressive optimizations allowed by the relaxed memory consistency model further blur this correspondence by transforming code structure. A modeling approach based on UPC platform benchmarking and code analysis is proposed. This approach abstracts a UPC platform according to its potential to apply a few common optimizations, then divides remote references in the application code into groups, based on a dependence analysis, that are amenable to each optimization. Each group is associated with a cost, obtained via benchmarking each potential optimization. The aggregated cost of these groups is the predicted cost of the application. Three simple UPC applications modeled using this approach usually yielded performance predictions within 15 percent of actual running times.||||||||||||./pdfs/1568974703-IPDPS-paper-1.pdf",
    "Pipelined Broadcast on Ethernet Switched Clusters |Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk Ahmad Faraj Xin Yuan We consider unicast-based pipelined broadcast schemes for clusters connected by multiple Ethernet switches. By splitting a large broadcast message into segments and broadcasting the segments in a pipelined fashion, pipelined broadcast may achieve very high performance. We develop algorithms for computing various contention-free broadcast trees on Ethernet switched clusters that are suitable for pipelined broadcast, and evaluate the schemes through experimentation. The conclusions drawn from our theoretical and experimental study include the following. First, pipelined broadcast can be more effective than other common broadcast schemes including the ones used in the latest versions of MPICH and LAM/MPI when the message size is sufficiently large. Second, contention-free broadcast trees are essential for pipelined broadcast to achieve high performance. Finally, while it is difficult to determine the optimal message segment size for pipelined broadcast, finding one size that gives good performance is relatively easy.||||||||||||./pdfs/1568974704-IPDPS-paper-1.pdf",
    "Exploring the Design Space of an Optimized Compiler Approach for Mesh-Like |Exploring the Design Space of an Optimized Compiler Approach for Mesh-Like Coarse-Grained Reconfigurable Architectures Gregory Dimitroulakos Michalis D. Galanis Costas E. Goutis In this paper we study the performance improvements and trade-offs derived from an optimized mapping approach applied on a parametric coarse grained reconfigurable array architecture. The processing elements? local register files and the processing elements? interconnection network is exploited for caching memory data values with data reuse opportunities. The data reused values are transferred through the processing elements? interconnection network hence, relieving the bus from the burden of transferring these values. A novel mapping algorithm is also proposed that uses a modulo scheduling technique. This algorithm targets on a flexible architecture template which permits experimental exploration over different architecture alternatives. The experimental results showed that the operation parallelism was significantly improved by our mapping approach. Additionally, we have outlined the relation that exists between the performance improvements and the memory access latency, the interconnection network and the processing elements? register file size.||||||||||||./pdfs/1568974707-IPDPS-paper-1.pdf",
    "Battery-Aware Router Scheduling in Wireless Mesh Networks |Battery-Aware Router Scheduling in Wireless Mesh Networks Chi Ma Zhenghao Zhang Yuanyuan Yang Wireless mesh networks recently emerge as a flexible, low-cost and multipurpose networking platform with wired infrastructure connected to the Internet. A critical issue in mesh networks is to maintain network activities for a long lifetime with high energy efficiency. As more and more outdoor applications require long-lasting, high energy efficient and continuously-working mesh networks with battery-powered mesh routers, it is important to optimize the performance of mesh networks from a battery-aware point of view. Recent study in battery technology reveals that discharging of a battery is nonlinear. Batteries tend to discharge more power than needed, and reimburse the over-discharged power later if they have sufficiently long recovery time. Intuitively, to optimize network performance, a mesh router should recover its battery periodically to prolong the lifetime. In this paper, we introduce a mathematical model on battery discharging duration and lifetime for wireless mesh networks. We also present a battery lifetime optimization scheduling algorithm (BLOS) to maximize the lifetime of battery-powered mesh routers. Based on the BLOS algorithm, we further consider the problem of using battery powered routers to monitor or cover a few hot spots in the network. We refer to this problem as the Spot Covering under BLOS Policy problem (SCBP). We prove that the SCBP problem is NP-hard and give an approximation algorithm called the Spanning Tree Scheduling (STS) to dynamically schedule mesh routers. The key idea of the STS algorithm is to construct a spanning tree according to the BLOS Policy in the mesh network. The time complexity of the STS algorithm is O(r) for a network with $r$ mesh routers. Our simulation results show that the STS algorithm can greatly improve the lifetime, data throughput and power consumption efficiency of a wireless mesh network.||||||||||||./pdfs/1568974715-IPDPS-paper-1.pdf",
    "Enhancing Downlink Performance in Wireless Networks by Simultaneous Multipl|Enhancing Downlink Performance in Wireless Networks by Simultaneous Multiple Packet Transmission Zhenghao Zhang Yuanyuan Yang In this paper we consider using simultaneous Multiple Packet Transmission (MPT) to improve the downlink performance of wireless networks. With MPT, the sender can send two compatible packets simultaneously to two distinct receivers and can double the throughput in the ideal case. We formalize the problem of finding a schedule to send out buffered packets in minimum time as finding a maximum matching problem in a graph. Since maximum matching algorithms are relatively complex and may not meet the timing requirements of real time applications, we give a fast approximation algorithm that is capable of finding a matching at least 3/4 of the size of a maximum matching in $O( E )$ time where $ E $ is the number of edges in the graph. We also give analytical bounds for maximum allowable arrival rate which measures the speedup of the downlink after enhanced with MPT and our results show that the maximum arrival rate increases significantly even with a very small compatibility probability. We also use an approximate analytical model and simulations to study the average packet delay and our results show that packet delay can be greatly reduced even with a very small compatibility probability.||||||||||||./pdfs/1568974718-IPDPS-paper-1.pdf",
    "Monitoring Remotely Executing Shared Memory Programs in Software DSMs |Monitoring Remotely Executing Shared Memory Programs in Software DSMs Long Fei Xing Fang Y. Charlie Hu Samuel P. Midkiff Peer-to-Peer (P2P) cycle sharing over the Internet has become increasingly popular as a way to share idle cycles. A fundamental problem faced by P2P cycle sharing systems is how to incrementally monitor and verify, with low overhead, the execution of jobs submitted to a remote untrusted hosting machine, or cluster of machines. In this paper, we present the design and implementation of GridCop DSM, a novel incremental execution monitoring and verification scheme for software distributed shared memory (SDSM) programs running on remote clusters. Our scheme maximally leverages the shared memory abstraction provided by the SDSM system by extending the shared memory abstraction to the monitoring process by replicating one of the processes running on the host cluster to verify intermediate results at runtime. Our GridCop DSM employs two monitoring schemes: (i) a full-scale monitoring scheme that completely replicates the computation of a process running on the cluster, and (ii) a decoy monitoring scheme that deceives the host cluster into believing that full-scale monitoring is being performed without it ever actually being done, thereby incurring negligible overhead. Experiments show that the combined use of full-scale and decoy monitoring ensures faithful execution with low performance impact, even over a wide area network.||||||||||||./pdfs/1568974727-IPDPS-paper-1.pdf",
    "A Design of Overlay Anonymous Multicast Protocol |A Design of Overlay Anonymous Multicast Protocol Li Xiao Xiaomei Liu Wenjun Gu Dong Xuan Yunhao Liu Multicast services are demanded by a variety of applications. Many applications require anonymity during their communication. However, there has been very little work on anonymous multicasting and such services are not available yet. Since there are fundamental differences between multicast and unicast, the solutions proposed for anonymity in unicast communications cannot be directly applied to multicast applications. In this paper we define the anonymous multicast system, and propose a mutual anonymous multicast (MAM) protocol including the design of a unicast mutual anonymity protocol and construction and optimization of an anonymous multicast tree. MAM is self organizing and completely distributed. We define the attack model in an anonymous multicast system and analyze the anonymity degree. We also evaluate the performance of MAM by simulations.||||||||||||./pdfs/1568974730-IPDPS-paper-1.pdf",
    "Detecting Phases in Parallel Applications on Shared Memory Architectures |Detecting Phases in Parallel Applications on Shared Memory Architectures Erez Perelman Marzia Polito Jean-yves Bouguet John Sampson Brad Calder Carole Dulong Most programs are repetitive, where similar behavior can be seen at different execution times. Algorithms have been proposed that automatically group similar portions of a program's execution into phases, where samples of execution in the same phase have homogeneous behavior and similar resource requirements. In this paper, we examine applying these phase analysis algorithms and how to adapt them to parallel applications running on shared memory processors. Our approach relies on a separate representation of each thread's activity. We first focus on showing its ability to identify similar intervals of execution across threads for a single run. We then show that it is effective at identifying similar behavior of a program when the number of threads is varied between runs. This can be used by developers to examine how different phases scale across different number of threads. Finally, we examine using the phase analysis to pick simulation points to guide multi-threaded simulation.||||||||||||./pdfs/1568974737-IPDPS-paper-1.pdf",
    "On the Effectiveness of Speculative and Selective Memory Fences |On the Effectiveness of Speculative and Selective Memory Fences Oliver Trachsel Christoph Von Praun Thomas R. Gross Memory fences inhibit the reordering of memory accesses in modern microprocessors; fences are useful to implement synchronization and strong shared memory semantics in multi-threaded programs. A naive implementation of memory fences can result in a significant performance penalty for processors with deep pipelines supporting multiple concurrent memory accesses. The paper compares three techniques to reduce the impact of memory fences: (1) Read-speculation allows reads that follow a fence to be issued while the fence is being processed; (2) Write-ahead additionally allows writes following a fence to proceed early; (3) Selective fences distinguish between memory accesses to thread-local and shared memory and enforce ordering only among accesses to shared memory. We evaluate and compare the effectiveness of these techniques with a simulator derived from the Pentium~4 architecture. We report data for a storage model that uses memory fences to enforce the memory semantics at monitor boundaries.||||||||||||./pdfs/1568974784-IPDPS-paper-1.pdf",
    "Application-Oriented Adaptive MPIBcast for Grids |Application-Oriented Adaptive MPIBcast for Grids Rakhi Gupta Sathish Vadhiyar Due to the importance of collective communications in scientific parallel applications, many strategies have been devised for optimizing collective communications for different kinds of parallel environments. Recently, there has been an increasing interest to evolve efficient broadcast algorithms for computational Grids. In this paper, we present application-oriented adaptive techniques that take into account recent resource characteristics as well as the application's usage of broadcasts for deriving efficient broadcast trees. In particular, we consider two broadcast parameters used in the application, namely, the broadcast message sizes and the time interval between the broadcasts. The results indicate that our adaptive strategies can provide 20\\% average improvement in performance over the popular MPICH-G2's MPI\\_Bcast implementation for loaded network conditions.||||||||||||./pdfs/1568974796-IPDPS-paper-1.pdf",
    "Instability in Parallel Job Scheduling Simulation: The Role of Workload Flu|Instability in Parallel Job Scheduling Simulation: The Role of Workload Flurries Dan Tsafrir Dror G. Feitelson The performance of computer systems depends, among other things, on the workload. This motivates the use of real workloads (as recorded in activity logs) to drive simulations of new designs. Unfortunately, real workloads may contain various anomalies that contaminate the data. A previously unrecognized type of anomaly is workload flurries: rare surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. We find that long workloads often include at least one such event. We show that in the context of parallel job scheduling these events can have a significant effect on performance evaluation results, e.g.\\ a very small perturbation of the simulation conditions might lead to a large and disproportional change in the outcome. This instability is due to jobs in the flurry being effected in unison, a consequence of the flurry's repetitive nature. We therefore advocate that flurries be filtered out before the workload is used, in order to achieve stable and more reliable evaluation results (analogously to the removal of outliers in statistical analysis). At the same time, we note that more research is needed on the possible effects of flurries.||||||||||||./pdfs/1568974798-IPDPS-paper-1.pdf",
    "Bitmap Indexes for Large Scientific Data Sets: A Case Study |Bitmap Indexes for Large Scientific Data Sets: A Case Study Rishi Rakesh Sinha Soumyadeb Mitra Marianne Winslett The data used by today's scientific applications are often very high in dimensionality and staggering in size. These characteristics necessitate the use of a good multidimensional indexing strategy to provide efficient access to the data. Researchers have previously proposed the use of bitmap indexes for high-dimension scientific data as a way of overcoming the drawbacks of traditional multidimensional indexes such as R-trees and KD-trees, which are bulky and whose performance does not scale well as the number of dimensions increases. However, the techniques proposed in previous work on bitmap indexes are not sufficient to address all problems that arise in practice. In experiments with real datasets, we experienced problems with index size and query performance. To overcome these shortcomings, we propose the use of adaptive, multilevel, multi-resolution bitmap indexes, and evaluate their performance in two scientific domains. Our preliminary experiments with a parallel query processor and index creator also show that it is very easy to parallelize a bitmap index.||||||||||||./pdfs/1568974804-IPDPS-paper-1.pdf",
    "Segment-Based Routing: An Efficient Fault-Tolerant Routing Algorithm for Me|Segment-Based Routing: An Efficient Fault-Tolerant Routing Algorithm for Meshes and Tori A. Mejia J. Flich J. Duato Sven-arne Reinemo Tor Skeie Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as Segment-based Routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions \\emph locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing.||||||||||||./pdfs/1568974838-IPDPS-paper-1.pdf",
    "A Segment-Based DSM Supporting Large Shared Object Space |A Segment-Based DSM Supporting Large Shared Object Space Benny Wang-leung Cheung Cho-li Wang This paper introduces a software DSM that can extend its shared object space exceeding 4GB in a 32-bit commodity cluster environment. This is achieved through the dynamic memory mapping mechanism, with local hard disks as backing store. We introduce the new concept of segments with intelligent splitting to reduce network traffic, false sharing as well as adapt better to the shared memory access patterns. A priority-based swapping algorithm is designed to reduce disk accesses for efficient dynamic memory mapping, and maximize the use of disk space as shared object space. A new queue-based scheme is also devised for efficient and simple management of memory blocks. The proposed solutions were implemented in LOTS V.2, and it can outperform its previous version when running small applications, while the maximum shared object space is increased to one-third of the total free disk space available among all the nodes.||||||||||||./pdfs/1568974839-IPDPS-paper-1.pdf",
    "Free Network Measurement For Adaptive Virtualized Distributed Computing |Free Network Measurement For Adaptive Virtualized Distributed Computing Ashish Gupta Marcia Zangrilli Ananth I. Sundararaj Anne I. Huang Peter A. Dinda Bruce B. Lowekamp An execution environment consisting of virtual machines (VMs) interconnected with a virtual overlay network can use the naturally occurring traffic of an existing, unmodified application running in the VMs to measure the underlying physical network. Based on these characterizations, and characterizations of the application's own communication topology, the execution environment can optimize the execution of the application using application-independent means such as VM migration and overlay topology changes. In this paper we demonstrate the feasibility of such free automatic network measurement by fusing the Wren passive monitoring and analysis system with Virtuoso's virtual networking system. We explain how Wren has been extended to support online analysis, and we explain how Virtuoso's adaptation algorithms have been enhanced to use Wren's physical network level information to choose VM-to-host mappings, overlay topology, and forwarding rules.||||||||||||./pdfs/1568974846-IPDPS-paper-1.pdf",
    "Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems |Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems Changhee Jung Daeseob Lim Jaejin Lee Yan Solihin This paper presents a helper thread prefetching scheme that is designed to work on loosely-coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely-coupled processors have an advantage in that fine-grain resources, such as processor and L1 cache resources, are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely-coupled system can be done effectively, we evaluate our prefetching in a standard, unmodified CMP system, and in an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processorside sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33.||||||||||||./pdfs/1568974848-IPDPS-paper-1.pdf",
    "Algorithmic Skeletons for Stream Programming in Embedded Heterogeneous Para|Algorithmic Skeletons for Stream Programming in Embedded Heterogeneous Parallel Image Processing Applications Wouter Caarls Pieter Jonker Henk Corporaal Algorithmic skeletons can be used to write architecture independent programs, shielding application developers from the details of a parallel implementation. In this paper, we present a C-like skeleton implementation language, PEPCI, that uses term rewriting and partial evaluation to specify skeletons for parallel C dialects. By using skeletons to control the iteration of kernel functions, we provide a stream programming language that is better tailored to the user as well as the underlying architecture. Skeleton merging allows us to reduce the overheads usually associated with breaking an application into small kernels. We have implemented an example image processing application on a heterogeneous embedded prototype platform consisting of an SIMD and ILP processor, and show that a significant speedup can be achieved without requiring knowledge of data parallel processing.||||||||||||./pdfs/1568974861-IPDPS-paper-1.pdf",
    "Composite Abortable Locks |Composite Abortable Locks Virendra J. Marathe Mark Moir Nir Shavit The need to allow threads to abort an attempt to acquire a lock (sometimes called a timeout) is an interesting new requirement driven by state-of-the-art database applications with soft real-time constraints. This paper presents a new \\textit composite abortable lock (CAL), a combination of abortable queue-based (QL) and test-and-set based backoff (BL) lock mechanisms, which provides non-blocking aborts while ensuring low space requirements without need for a memory reclamation scheme. The key observation motivating our approach is that the fast lock hand-off achieved by QLs only requires the first few threads to be queued (not \\emph all waiting threads), and that the remaining threads can run as in a BL. We developed an algorithm that uses only a short fixed size structure for queueing, allowing most threads to back-off. This reduces worst-case space overhead dramatically, and improves performance by eliminating the need for expensive and complicated memory management mechanisms. Experimental results show that our new CAL algorithm not only saves on space, it actually outperforms Scott's state-of-the-art nonblocking abortable QL under contention, and even more so when there are more threads than processors. Moreover, as the rate of lock aborts increases, the CAL continues to perform well, while Scott's algorithm deteriorates rapidly.||||||||||||./pdfs/1568974863-IPDPS-paper-1.pdf",
    "Exploiting Locality: A Flexible DSM Approach |Exploiting Locality: A Flexible DSM Approach H&aring;kan Zeffer Zoran Radovic Erik Hagersten No single coherence strategy suits all applications well. Many promising adaptive protocols and coherence predictors, capable of dynamically modifying the coherence strategy, have been suggested over the years. While most dynamic detection schemes rely on plentiful of dedicated hardware, the customization technique suggested in this paper requires no extra hardware support for its per-application coherence strategy. Instead, each application is profiled using a low-overhead profiling tool. The appropriate coherence flag setting, suggested by the profiling, is specified when the application is launched. We have compared the performance of a hardware DSM (Sun WildFire) to a software DSM built with identical interconnect hardware and coherence strategy. With no support for flexibility, the software DSM runs on average 45 percent slower than the hardware DSM on the 12 studied applications, while the flexibility can get the software DSM within 11 percent. Our all-software system outperforms the hardware DSM on four applications.||||||||||||./pdfs/1568974866-IPDPS-paper-1.pdf",
    "Accelerating Shape Optimizing Load Balancing for Parallel FEM Simulations b|Accelerating Shape Optimizing Load Balancing for Parallel FEM Simulations by Algebraic Multigrid Henning Meyerhenke Burkhard Monien Stefan Schamberger We propose a load balancing heuristic for parallel adaptive finite element method (FEM) simulations. In contrast to most existing approaches, the heuristic focuses on good partition shapes rather than on minimizing the classical edge-cut metric. By applying Algebraic Multigrid (AMG), we are able to speed up the two most time consuming calculations of the approach while maintaining its large amount of natural parallelism.||||||||||||./pdfs/1568974890-IPDPS-paper-1.pdf",
    "Making Lockless Synchronization Fast: Performance Implications of Memory Re|Making Lockless Synchronization Fast: Performance Implications of Memory Reclamation Thomas E. Hart Paul E. McKenney Angela Demke Brown Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking, require a \\emph memory reclamation scheme that reclaims nodes once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes---\\emph quiescent-state-based reclamation , \\emph epoch-based reclamation , and \\emph hazard-pointer-based reclamation ---using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect memory reclamation performance.||||||||||||./pdfs/1568974892-IPDPS-paper-1.pdf",
    "Leakage-Aware Multiprocessor Scheduling for Low Power |Leakage-Aware Multiprocessor Scheduling for Low Power Pepijn De Langen Ben Juurlink It is expected that (single chip) multiprocessors will increasingly be deployed to realize high-performance embedded systems. Because in current technologies the dynamic power consumption dominates the static power dissipation, an effective technique to reduce energy consumption is to employ as many processors as possible in order to finish the tasks as early as possible, and to use the remaining time before the deadline (the slack) to apply voltage scaling. We refer to this heuristic as Schedule and Stretch (S\\&S). However, since the static power consumption is expected to become more significant, this approach will no longer be efficient when leakage current is taken into account. In this paper, we first show for which combinations of leakage current, supply voltage, and clock frequency the static power consumption dominates the dynamic power dissipation. These results imply that, at a certain point, it is no longer advantageous from an energy perspective to employ as many processors as possible. Thereafter, a heuristic is presented to schedule the tasks on a number of processors that minimizes the total energy consumption. Experimental results obtained using a public task graph benchmark set show that our leakage-aware scheduling algorithm reduces the total energy consumption by up to 24\\% for tight deadlines ($1.5$x the critical path length) and by up to 67\\% for loose deadlines ($8$x the critical path length) compared to S\\&S.||||||||||||./pdfs/1568974896-IPDPS-paper-1.pdf",
    "A Strategy proof Mechanism for Scheduling Divisible Loads in Tree Networks |A Strategy proof Mechanism for Scheduling Divisible Loads in Tree Networks Thomas E. Carroll Daniel Grosu The underlying assumption of Divisible Load Scheduling is that the processors composing the network are obedient, \\textit i.e. , they do not ``cheat'' the algorithm. This assumption is unrealistic if the processors are owned by autonomous, self-interested organizations that have no \\emph a priori motivation for cooperation and they will manipulate the algorithm if it is beneficial to do so. In this paper we propose the strategy proof mechanism DLS-TL for scheduling divisible loads in tree networks. Our proposal augments Divisible Load Theory (DLT) with incentives such that it is beneficial for processors to report their true processing capacity and compute their assignments at full processing capacity. Additionally, incentives are provided for processors to report algorithm deviants. Deviants are penalized which abates the processors' willingness to deviate.||||||||||||./pdfs/1568974898-IPDPS-paper-1.pdf",
    "D1HT: A Distributed One Hop Hash Table |D1HT: A Distributed One Hop Hash Table Luiz Rodolpho Monnerat Claudio Luis De Amorim Distributed Hash Tables (DHTs) have been used in a variety of applications, but most DHTs so far have opted to solve lookups with multiple hops, which sacrifices performance in order to keep little routing information and minimize maintenance traffic. In this paper, we introduce D1HT, a novel single hop DHT that is able to maximize performance with reasonable maintenance traffic overhead even for huge and dynamic peer-to-peer (P2P) systems. We formally define the algorithm we propose to detect and notify any membership change in the system, prove its correctness and performance properties, and present a Quarantine-like mechanism to reduce the overhead caused by volatile peers. Our analyses show that D1HT has reasonable maintenance bandwidth requirements even for very large systems, while presenting at least twice less bandwidth overhead than previous single hop DHT.||||||||||||./pdfs/1568974909-IPDPS-paper-1.pdf",
    "Centralized Versus Distributed Schedulers for Multiple bag-of-task applicat|Centralized Versus Distributed Schedulers for Multiple bag-of-task applications Olivier Beaumont Larry Carter Jeanne Ferrante Arnaud Legrand Loris Marchal Yves Robert Multiple applications that execute concurrently on heterogeneous platforms compete for CPU and network resources. In this paper we consider the problem of scheduling applications to ensure fair and efficient execution on a distributed network of processors. We limit our study to the case where communication is restricted to a tree embedded in the network, and the applications consist of a large number of independent tasks that originate at the tree?s root. The tasks of a given application all have the same computation and communication requirements, but these requirements can vary for different applications. Each application is given a weight that quantifies its relative value. The goal of scheduling is to maximize throughput while executing tasks from each application in the same ratio as their weights. We can find the optimal asymptotic rates by solving a linear program that expresses all necessary problem constraints, and we show how to construct a periodic schedule. For single-level trees, the solution is characterized by processing tasks with larger communication-to-computation ratios at children with larger bandwidths. For multi-level trees, this approach requires global knowledge of all application and platform parameters. For large-scale platforms, such global coordination by a centralized scheduler may be unrealistic. Thus, we also investigate decentralized schedulers that use only local information at each participating resource. We assess their performance via simulation, and compare to a centralized solution obtained via linear programming. The best of our decentralized heuristics achieves the same performance on about two-thirds of our test cases, but is far worse in a few cases. While our results are based on simplistic assumptions and do not explore all parameters (such as buffer size), they provide insight into the important question of fairly and optimally co-scheduling heterogeneous applications on heterogeneous grids.||||||||||||./pdfs/1568974914-IPDPS-paper-1.pdf",
    "Selecting the Tile Shape to Reduce the Total Communication Volume |Selecting the Tile Shape to Reduce the Total Communication Volume Nikolaos Drosinos Georgios Goumas Nectarios Koziris In this paper we revisit the tile-shape selection problem, that has been extensively discussed in bibliography. An efficient approach is proposed for the selection of a suitable tile shape, based on the minimization of the process communication volume. We consider the large family of applications that arise from the discretization of partial differential equations (PDEs). Practical experience has shown that for such applications and distributed memory architectures, minimizing the total communication volume is more important than minimizing the total number of parallel execution steps. We formulate a new method to determine an appropriate communication-aware tile shape, i.e. the one that reduces the communication volume for a fixed number of processes. Our approach is equivalent to defining a proper Cartesian process grid with MPI\\_Cart\\_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.||||||||||||./pdfs/1568974919-IPDPS-paper-1.pdf",
    "Effective Out-of-Core Parallel Delaunay Mesh Refinement using Off-the-Shelf|Effective Out-of-Core Parallel Delaunay Mesh Refinement using Off-the-Shelf Software Andriy Kot Andrey Chernikov Nikos Chrisochoides We present two cost-effective and high-performance out-of-core parallel mesh generation algorithms and their implementation on Cluster of Workstations (CoWs). The total wall-clock time including wait-in-queue delays for the out-of-core methods on a small cluster (16 processors) is three times shorter than the total wall-clock time for the in-core generation of the same size mesh (about a billion elements) using 121 processors. Our best out-of-core method, for mesh sizes that fit completely in the core of the CoWs, is about 5\\% slower than its in-core parallel counterpart method. This is a modest performance penalty for savings of many hours in response time. Both the in-core and out-of-core methods use the best publicly available off-the-shelf sequential in-core Delaunay mesh generator.||||||||||||./pdfs/1568974922-IPDPS-paper-1.pdf",
    "Parallel FPGA-based All-Pairs Shortest-Paths in a Directed Graph |Parallel FPGA-based All-Pairs Shortest-Paths in a Directed Graph Uday Bondhugula Ananth Devulapalli Joseph Fernando Pete Wyckoff P. Sadayappan With rapid advances in VLSI technology, Field Programmable Gate Arrays (FPGAs) are receiving the attention of the Parallel and High Performance Computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the all-pairs shortest-paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose makes efficient and maximal utilization of the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working FPGA implementation on the Cray XD1 show a speedup of 22 over execution on the XD1's processor.||||||||||||./pdfs/1568974923-IPDPS-paper-1.pdf",
    "Acceleration of a Content-Based Image-Retrieval Application on the RDISK Cl|Acceleration of a Content-Based Image-Retrieval Application on the RDISK Cluster Auguste Noumsi Steven Derrien Patrice Quinton Because of the growing use of multimedia content over Internet, Content-Based Image Retrieval (CBIR) has recently received a lot of interest. While accurate search techniques based on local image descriptors exist, they suffer from very long execution time. We propose to accelerate CBIR on the RDISK machine, a cluster of FPGA-enhanced hard-drives, that follows the philosophy of smart-disks. Our platform combines coarse and fine grain parallelism thanks to the concurrent use of the cluster nodes and of a programmable logic device. The implementation of the CBIR application on this mixed hardware/software platform follows a strict methodology, that was validated on realistic data-set (image database of more than 30,000 images). This methodology allows us to adapt the original algorithm to suit a hardware implementation, and to select the values of some key design parameters to maximize global performance. Our preliminary results indicate that speed-ups between 120 and 200 could be obtained for a cluster of 32 nodes compared with a software implementation running on a standard desktop PC.||||||||||||./pdfs/1568974924-IPDPS-paper-1.pdf",
    "Structural and Algorithmic Issues of Dynamic Protocol Update |Structural and Algorithmic Issues of Dynamic Protocol Update Olivier R&uuml;tti Pawel T. Wojciechowski Andr&eacute; Schiper In this paper, we study dynamic protocol update (DPU). Contrary to local code updates on-the-fly, DPU requires global coordination of local code replacements. We propose a novel solution to DPU. The key idea is to add a level of indirection between the service callers and the service provider. This indirection level facilitates an implementation of simple and efficient algorithms for DPU. For example, we describe an experimental implementation of adaptive group communication middleware. It can switch between different atomic broadcast protocols on-the-fly. All middleware protocols, including those that depend on the updated protocols, provide service correctly and with negligible delay while the global update takes places. The switching algorithm introduces very low overhead that we illustrate by showing example measurement results.||||||||||||./pdfs/1568974928-IPDPS-paper-1.pdf",
    "Empowering a Helper Cluster through Data-Width Aware Instruction Selection |Empowering a Helper Cluster through Data-Width Aware Instruction Selection Policies Osman S. Unsal Xavier Vera Antonio Gonz&aacute;lez Oguz Ergin Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost- and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. Helper Cluster), potentially providing performance benefits. We complement a 32-bit monolithic processor with a low-complexity 8-bit Helper Cluster. Then, in our main focus, we propose various ideas to select suitable instructions to execute in the data-width based clusters. We add data-width information as another instruction steering decision metric and introduce new data-width based selection algorithms which also consider dependency, inter-cluster communication and load imbalance. Utilizing those techniques, the performance of a wide range of workloads are substantially increased; Helper Cluster achieves an average speedup of 11\\% for a wide range of 412 apps. When focusing on integer applications, the speedup can be as high as 22\\% on average.||||||||||||./pdfs/1568974931-IPDPS-paper-1.pdf",
    "Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors |Compatible Phase Co-Scheduling on a CMP of Multi-Threaded Processors Ali El-moursy Rajeev Garg David H. Albonesi Sandhya Dwarkadas The industry is rapidly moving towards the adoption of Chip Multi-Processors (CMPs) of Simultaneous Multi-Threaded (SMT) cores for general purpose systems. The most prominent use of such processors, at least in the near term, will be as job servers running multiple independent threads on the different contexts of the various SMT cores. In such an environment, the co-scheduling of phases from different threads plays a significant role in the overall throughput. Less throughput is achieved when phases from different threads that conflict for particular hardware resources are scheduled together, compared with the situation where compatible phases are co-scheduled on the same SMT core. Achieving the latter requires precise per-phase hardware statistics that the scheduler can use to rapidly identify possible incompatibilities among phases of different threads, thereby avoiding the potentially high performance cost of inter-thread contention. In this paper, we devise phase co-scheduling policies for a dual-core CMP of dual-threaded SMT processors. We explore a number of approaches and find that the use of ready and in-flight instruction metrics permits effective co-scheduling of compatible phases among the four contexts. This approach significantly outperforms the worst static grouping of threads, and very closely matches the best static grouping, even outperforming it by as much as 7\\%.||||||||||||./pdfs/1568974937-IPDPS-paper-1.pdf",
    "Enabling Efficient and Flexible Coupling of Parallel Scientific Application|Enabling Efficient and Flexible Coupling of Parallel Scientific Applications Li Zhang Manish Parashar Emerging scientific and engineering simulations are presenting challenging requirements for coupling between multiple physics models and associated parallel codes that execute independently and in a distributed manner. Realizing coupled simulations requires an efficient, flexible and scalable coupling framework and simple programming abstractions. This paper presents a coupling framework that addresses these requirements. The framework is based on the Seine geometry-based interaction model. It enables efficient computation of communication schedules, supports low-overheads processor-to-processor data streaming, and provides high-level abstraction for application developers. The design, CCA-based implementation, and experimental evaluation of the Seine based coupling framework are presented.||||||||||||./pdfs/1568974938-IPDPS-paper-1.pdf",
    "Comparative Study of Price-based Resource Allocation Algorithms for Ad Hoc |Comparative Study of Price-based Resource Allocation Algorithms for Ad Hoc Networks Marcel Luethi Simin Nadjm-tehrani Calin Curescu As mobile ad hoc networks provide a wide range of possibly critical services, providing quality of service guarantees becomes an essential element. Yet there is a limited understanding of the performance characteristics of different resource allocation algorithms. In particular, there is little work that comparatively studies different algorithms in the same traffic environment. Therefore we study two algorithms, adhoc-TARA and an algorithm based on the gradient projection method, for optimised bandwidth allocation in ad hoc networks under overload situations. The focus is on convergence properties and performance measured in terms of accumulated utility. The simulation results show that the gradient projection algorithm converges to an optimal solution even in large, dynamic networks, but that in such dynamic environments the convergence time can significantly influence the overall performance. In comparison, the near-optimal algorithm adhoc-TARA, which quickly adapts to changes in the state of the network, can exhibit superior performance. Further we illustrate how different parameter settings influence the performance of the algorithms. We conclude that finding an optimal allocation comes at a high price in the rapidly changing environments of ad hoc networks and that near-optimal allocation can be an ample alternative.||||||||||||./pdfs/1568974939-IPDPS-paper-1.pdf",
    "Grid Solutions for Biological and Physical Cross-site Simulations on the Te|Grid Solutions for Biological and Physical Cross-site Simulations on the TeraGrid S. Dong N.t. Karonis G.e. Karniadakis Computational grids and grid middleware offer unprecedented computational power and storage capacity, and thus, have opened the possibility of solving problems that were previously not possible on even the largest single computational resources. These opportunities notwithstanding, the development of grid applications that run efficiently remains a challenge due to the inherent heterogeneity of networks and system architectures inherent in such environments. We present grid solutions to two grand challenge problems in computational mechanics. To study the scalability of our solutions we implemented both as MPI applications and ran them on the TeraGrid using NEKTAR and MPICH-G2. We present the results of our study which demonstrate near linear scalability in both applications when run across multiple TeraGrid sites and at a scale of hundreds or processors.||||||||||||./pdfs/1568974943-IPDPS-paper-1.pdf",
    "Exploiting Programmable Network Interfaces for Parallel Query Execution in |Exploiting Programmable Network Interfaces for Parallel Query Execution in Workstation Clusters Santhosh Kumar M. J. Thazhuthaveetil R. Govindarajan Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity -- computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups upto 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs a part of join operation along with the host processor, gives a speedup of 1.47\\% along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64.||||||||||||./pdfs/1568974947-IPDPS-paper-1.pdf",
    "Optimizing Bandwidth Limited Problems Using One-Sided Communication and Ove|Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap Christian Bell Dan Bonachea Rajesh Nishtala Katherine Yelick Partitioned Global Address Space languages like Unified Parallel C (UPC) are typically valued for their expressiveness, especially for computations with fine-grained random accesses. In this paper we show that the one-sided communication model used in these languages also has a significant performance advantage for bandwidth-limited applications. We demonstrate this benefit through communication microbenchmarks and a case-study that compares UPC and MPI implementations of the NAS Fourier Transform (FT) benchmark. Our optimizations rely on aggressively overlapping communication with computation but spreading communication events throughout the course of the local computation. This alleviates the potential communication bottleneck that occurs when the communication is packed into a single phase (e.g., the large all-to-all in a multidimensional FFT). Even though the new algorithms require more messages for the same total volume of data, the resulting overlap leads to speedups of over $1.75\\times$ and $1.9\\times$ for the two-sided and one-sided implementations, respectively, when compared to the default NAS Fortran/MPI release. Our best one-sided implementations show an average improvement of $15\\%$ over our best two-sided implementations. We attribute this difference to the lower software overhead of one-sided communication, which is partly fundamental to the semantic difference between one-sided and two-sided communication. Our UPC results use the Berkeley UPC compiler with the GASNet communication system, and demonstrate the portability and scalability of that language and implementation, with performance approaching 0.5 TFlop/s on the FT benchmark running on 512 processors.||||||||||||./pdfs/1568974954-IPDPS-paper-1.pdf",
    "Necessary and Sufficient Conditions for 1-adaptivity |Necessary and Sufficient Conditions for 1-adaptivity Joffroy Beauquier Sylvie Delaet Sammy Haddad A 1-adaptive self-stabilizing system is a self-stabilizing system that can correct any memory corruption of a single process in one computation step. 1-adaptivity means that if in a legitimate state the memory of a single process is corrupted, then the next system transition will lead to a legitimate state and the system will recover a correct behavior. Thus 1-adaptive self-stabilizing algorithms guarantee the very strong property that a single fault is corrected immediately and consequently that it cannot be propagated. Our aim here is to study necessary and sufficient conditions to obtain that property in order to design such algorithms. In particular we show that this property can be obtained even under the distributed demon and that it can also be applied to probabilistic algorithms. We provide two self-stabilizing 1-adaptive algorithms that demonstrate how the conditions we present here can be used to design and prove 1-adaptive algorithms.||||||||||||./pdfs/1568974955-IPDPS-paper-1.pdf",
    "Improving Cache Locality for Thread-Level Speculation |Improving Cache Locality for Thread-Level Speculation Stanley L. C. Fung J. Gregory Steffan With the advent of chip-multiprocessors (CMPs), \\em Thread-Level Speculation (TLS) remains a promising technique for exploiting this highly multithreaded hardware to improve the performance of an individual program. However, with such speculatively-parallel execution the cache locality once enjoyed by the original uniprocessor execution is significantly disrupted: for TLS execution on a four-processor CMP, we find that the data-cache miss rates are nearly four-times those of the uniprocessor case, even though TLS execution utilizes four private data caches (i.e., four-fold greater cache capacity). We break down the TLS cache locality problem into instruction and data cache, execution stages, and parallel access patterns, and propose methods to improve cache locality in each of these areas. We find that for parallel regions across 13 SPECint applications our simple and low-cost techniques reduce data-cache misses by 38\\%, improve performance by 12.8\\%, and significantly improve scalability---further enhancing the feasibility of TLS as a way to capitalize on future CMPs.||||||||||||./pdfs/1568974956-IPDPS-paper-1.pdf",
    "A Compiler-based Communication Analysis Approach for Multiprocessor Systems|A Compiler-based Communication Analysis Approach for Multiprocessor Systems Shuyi Shao Alex K. Jones Rami Melhem In this paper we describe a compiler framework which can identify communication patterns for MPI-based parallel applications. This has the potential of providing significant performance benefits when connections can be established in the network prior to the actual communication operation. Our compiler uses a flexible and powerful communication pattern representation scheme that can capture the property of communication patterns and allows manipulations of these patterns. In this way, communication phases can be detected and logically separated within the application. Additionally, we extend the classification of static and dynamic communication patterns and operations to include persistent communications. Persistent communications appear dynamic, however, they remain unchanged for large segments of the application execution. Our compiler is capable of detecting both static and persistent communication patterns within an application. We show that for the NAS Parallel Benchmarks, 100\\% of the point-to-point communications can be classified as either static or persistent and, with the exception of IS, 100\\% of the collective were either static or persistent. By comparison to application trace data, the predicted LBMHD, CG and MG communication patterns have been verified.||||||||||||./pdfs/1568974965-IPDPS-paper-1.pdf",
    "An Authentication Protocol in Web-computing |An Authentication Protocol in Web-computing Siman Wong A web-computing system (WCS) allows a host with limited resources to perform CPU intensive tasks by outsourcing the computations to external clients. But not every client is trusted, and redundancy in task assignment and auditing of results are needed to ensure the integrity of the results. This raises the question as to the efficiency and reliability of the system as measured against a given unit of the host's auditing time or cost. In this paper we propose a WCS with low overhead and has favorable error rate compared to a majority-voting scheme with similar efficiency. We can reduce the error rate by re-authenticating the results without having to resubmit any jobs, and we have an auditing strategy that in many cases is probabilistically better than random sampling.||||||||||||./pdfs/1568974974-IPDPS-paper-1.pdf",
    "Compiler Assisted Dynamic Management of Registers for Network Processors |Compiler Assisted Dynamic Management of Registers for Network Processors Ryan Collins Fernando Alegre Xiaotong Zhuang Santosh Pande Modern network processors support high levels of parallelism in packet processing by supporting multiple threads that execute on a micro-engine. Threads switch context upon encountering long latency memory accesses and this way the parallelism and memory access can be overlapped. Context switches in the typical network processor architectures such as the IXP are designed to be very fast. However, the low overhead is partly achieved by leaving register management to programs, with minimal support from the hardware. The complexity of the multi-engine, multi-threaded environment makes manual register management a daunting task, which is better left to a compiler. However, a purely static analysis is unable to achieve full utilization of the register file due to conservative estimates of liveness. A register that is live across a context switch point must be considered live for the duration of all other threads, and so it must be assumed to be unavailable to other threads. In addition, aliasing further reduces the effectiveness of static analysis. The net effect is a large number of idle cycles that are still present after static optimization. We propose a dynamic solution that requires minimal software and hardware support. On the software side, we take a pre-allocated binary file and annotate the potential context switch instructions with information about the dead registers. On the hardware side, we try to rename the transfer registers and addresses to dead general purpose registers and update the usage of registers. We then replace the long-latency memory instructions with fast move instructions in the architecture using the dynamic context. The results show up to 51\\% reduction in idle cycles and up to 14\\% increase in the throughput for hand coded applications on Intel IXP 1200 network processor.||||||||||||./pdfs/1568974980-IPDPS-paper-1.pdf",
    "Cooperative Checkpointing Theory |Cooperative Checkpointing Theory Adam J. Oliner Larry Rudolph Ramendra K. Sahoo \\emph Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. Using results from cooperative checkpointing theory, this paper proves that periodic checkpointing is not expected to be competitive with the offline optimal. By leveraging probabilistic information about the future, cooperative checkpointing gives flexible algorithms that are optimally competitive. The results prove that simulating periodic checkpointing, by performing only every $d^ th $ checkpoint, is not competitive with the offline optimal in the worst case; a simple modification gives a provably competitive algorithm. Calculations using failure traces from a prototype of IBM's Blue Gene/L show an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing, under realistic conditions. We contribute an approach to providing large-scale system reliability through cooperative checkpointing and techniques for analyzing the approach.||||||||||||./pdfs/1568974982-IPDPS-paper-1.pdf",
    "The Interleaved Authentication for Filtering False Reports in Multipath Rou|The Interleaved Authentication for Filtering False Reports in Multipath Routing based Sensor Networks Youtao Zhang Jun Yang Hai T Vu In this paper, we consider filtering false reports in braided multipath routing sensor networks. While multipath routing provides better resilience to various faults in sensor networks, it has two problems regarding the authentication design. One is that, due to the large number of partially overlapped routing paths between the source and sink nodes, the authentication overhead could be very high if these paths are authenticated individually; the other is that false reports may escape the authentication check through the newly identified node association attack. In this paper we propose enhancements to solve both problems such that secure and efficient authentication can be achieved in multipath routing. The proposed scheme is (t+1)-resilient, i.e. it is secure with up to t compromised nodes. The upper bound number of hops that a false report may be forwarded in the network is O($t^2$).||||||||||||./pdfs/1568974985-IPDPS-paper-1.pdf",
    "MPI-IO/L: Efficient Remote I/O for MPI-IO via Logistical Networking |MPI-IO/L: Efficient Remote I/O for MPI-IO via Logistical Networking Jonghyun Lee Robert Ross Scott Atchley Micah Beck Rajeev Thakur Scientific applications often need to access remotely located files, but many remote I/O systems lack standard APIs that allow efficient and direct access from application codes. This work presents MPI-IO/L, a remote I/O facility for MPI-IO using Logistical Networking. This combination not only provides high-performance and direct remote I/O using the standard parallel I/O interface but also offers convenient management and sharing of remote files. We show the performance trade-offs with various remote I/O approaches implemented in the system, which can help scientists identify preferable I/O options for their own applications. We also discuss how Logistical Networking could be improved to work better with parallel I/O systems such as ROMIO.||||||||||||./pdfs/1568975000-IPDPS-paper-1.pdf",
    "Application Classification through Monitoring and Learning of Resource Cons|Application Classification through Monitoring and Learning of Resource Consumption Patterns Jian Zhang Renato Figueiredo Application awareness is an important factor of efficient resource scheduling. This paper introduces a novel approach for application classification based on the Principal Component Analysis (PCA) and the k-Nearest Neighbor (k-NN) classifier. This approach is used to assist scheduling in heterogeneous computing environments. It helps to reduce the dimensionality of the performance feature space and classify applications based on extracted features. The classification considers four dimensions: CPU-intensive, I/O and paging-intensive, network-intensive, and idle. Application class information and the statistical abstracts of the application behavior are learned over historical runs and used to assist multi-dimensional resource scheduling. This paper describes a prototype classifier for application-centric Virtual Machines. Experimental results show that scheduling decisions made with the assistance of the application class information, improved system throughput by 22.11\\% on average, for a set of three benchmark applications.||||||||||||./pdfs/1568975003-IPDPS-paper-1.pdf",
    "Evaluating I/O Characteristics and Methods for Storing Structured Scienti|Evaluating I/O Characteristics and Methods for Storing Structured Scientific Data Avery Ching Alok Choudhary Wei-keng Liao Lee Ward Neil Pundit Many large-scale scientific simulations generate large, structured multi-dimensional datasets. Data is stored at various intervals on high performance I/O storage systems for checkpointing, post-processing, and visualization. Data storage is very I/O intensive and can dominate the overall running time of an application, depending on the characteristics of the I/O access pattern. Our NCIO benchmark determines how I/O characteristics greatly affect performance (up to 2 orders of magnitude) and provides scientific application developers with guidelines for improvement. In this paper, we examine the impact of various I/O parameters and methods when using the MPI-IO interface to store structured scientific data in an optimized parallel file system.||||||||||||./pdfs/1568975004-IPDPS-paper-1.pdf",
    "Real-Time Task Mapping and Scheduling for Collaborative In-Network Processi|Real-Time Task Mapping and Scheduling for Collaborative In-Network Processing in DVS-Enabled Wireless Sensor Networks Yuan Tian Jarupan Boangoat Eylem Ekici Fusun Ozguner With the increasing importance of energy consumption considerations and new requirements of emerging applications, in-network processing of information gains recognition as a viable solution for Wireless Sensor Networks (WSNs). The required processing capability can be achieved through locally collaborative information processing among sensors. Task mapping and scheduling plays an important role in efficient collaborative information processing. Although task mapping and scheduling in wired networks of processors has been well studied in the past, its counterpart for WSNs remains largely unexplored. In this paper, a task mapping and scheduling solution for real-time applications in WSNs, Real-time Task Mapping and Scheduling (RT-MapS), is presented. RT-MapS incorporates wireless channel modeling, Hyper-DAG extension, concurrent task mapping, communication and computation scheduling, and Dynamic Voltage Scaling (DVS) methods. Simulation results show significant performance improvements compared with existing mechanisms in terms of providing deadline guarantee with minimum energy consumption.||||||||||||./pdfs/1568975006-IPDPS-paper-1.pdf",
    "Distributed Coloring in &Otilde; |Distributed Coloring in &Otilde; ( &#214; log n ) Bit Rounds Kishore Kothapalli Melih Onus Christian Scheideler Christian Schindelhauer We consider the well-known vertex coloring problem: given a graph $G$, find a coloring of the vertices so that no two neighbors in $G$ have the same color. Distributed algorithms that find a $(\\Delta+1)$-coloring in a logarithmic number of communication rounds, with high probability (w.h.p), are known since more than a decade. But what if the edges have orientations, i.e., the endpoints of an edge agree on its orientation? Interestingly, for the cycle in which all edges have the same orientation, we show that a simple randomized algorithm can achieve a 3-coloring with only $O(\\sqrt \\log n )$ rounds of bit transmissions w.h.p. This result is tight because we also show that the bit complexity of coloring an oriented cycle is $\\Omega(\\sqrt \\log n )$, w.h.p., no matter how many colors are allowed. The 3-coloring algorithm can be easily extended to provide a $(\\Delta+1)$-coloring for oriented graphs of maximum degree $\\Delta$ in $O(\\sqrt \\log n )$ rounds of bit transmissions, w.h.p., if $\\Delta$ is a constant, and the graph does not contain an oriented cycle of length less than $\\sqrt \\log n $. Using more complex algorithms, we show how to obtain an $O(\\Delta)$-coloring for arbitrary oriented graphs with maximum degree $\\Delta$, and with no oriented cycles of length at most $\\sqrt \\log n $, using essentially $O(\\log \\Delta+ \\sqrt \\log n )$ rounds of bit transmissions.||||||||||||./pdfs/1568975008-IPDPS-paper-1.pdf",
    "Achieving Strong Scaling with NAMD on Blue Gene/L |Achieving Strong Scaling with NAMD on Blue Gene/L Sameer Kumar Chao Huang Gheorghe Almasi Laxmikant V. Kale NAMD is a scalable molecular dynamics application, which has demonstrated its performance on several parallel computer architectures. Strong scaling is necessary for molecular dynamics as problem size is fixed, and a large number of iterations need to be executed to understand interesting biological phenomenon. The Blue Gene/L machine is a massive source of compute power. It consists of tens of thousands of embedded Power PC 440 processors. In this paper, we present several techniques to scale NAMD to 8192 processors of Blue Gene/L. These include topology specific optimizations, new messaging protocols, load-balancing, and overlap of computation and communication. We were able to achieve 1.2 TF of peak performance for cutoff simulations and 0.99 TF with PME.||||||||||||./pdfs/1568975014-IPDPS-paper-1.pdf",
    "Evaluation of UDDI as a Provider of Resource Discovery Services for OGSA-ba|Evaluation of UDDI as a Provider of Resource Discovery Services for OGSA-based Grids Edward Benson Glenn Wasson Marty Humphrey Grid computing involves networks of heterogeneous resources working in collaboration to solve problems that cannot be addressed by the resources of any one organization. A pervasive problem for Grid users is how best to discover the resources they need given dynamic Grid environments. UDDI, the Universal Description, Discovery and Integration framework, is an OASIS standard for publishing and querying discovery information for Web services, which to date, has received surprisingly little analysis as a discovery mechanism for Web service-based Grids, e.g. those based on the Open Grid Services Architecture (OGSA). This work identifies issues that must be addressed in order to make UDDI meet the requirements of OGSA discovery. We examine the performance implications of these issues using a freely available implementation of UDDI version 2. Based on our experimental results, we conclude that UDDI can be used for OGSA discovery, but the cost may be prohibitive for large Grids.||||||||||||./pdfs/1568975015-IPDPS-paper-1.pdf",
    "Quantifying and Reducing the Effects of Wrong-Path Memory References in Cac|Quantifying and Reducing the Effects of Wrong-Path Memory References in Cache-Coherent Multiprocessor Systems Resit Sendag Ayse Yilmazer Joshua J. Yi Augustus K. Uht High-performance multiprocessor systems built around out-of-order processors with aggressive branch predictors execute many memory references that turn out to be on a mispredicted branch path. Previous work that focused on uniprocessors showed that these wrong-path memory references may pollute the caches by bringing in data that are not needed on the correct execution path and by evicting useful data or instructions. Additionally, they may also increase the amount of cache and memory traffic. On the positive side, however, they may have a prefetching effect for memory references on the correct path. While computer architects have thoroughly studied the impact of wrong-path effects in uniprocessor systems, there is no previous work on its effects in multiprocessor systems. In this paper, we explore the effects of wrong-path memory references on the memory system behavior of shared-memory multiprocessor (SMP) systems for both broadcast and directory-based cache coherence. Our results show that these wrong-path memory references can increase the amount of cache-to-cache transfers by 32\\%, invalidations by 8\\% and 20\\% for broadcast and directory-based SMPs, respectively, and the number of writebacks by up to 67\\% for both systems. In addition to the extra coherence traffic, wrong-path memory references also increase the number of cache line state transitions by 21\\% and 32\\% for broadcast and directory-based SMPs, respectively. In order to reduce the performance impact of these wrong-path memory references, we introduce two simple mechanisms ? filtering wrong-path blocks that are not likely-to-be-used and wrong-path aware cache replacement ? that yield speedups of up to 37\\%.||||||||||||./pdfs/1568975017-IPDPS-paper-1.pdf",
    "Network Uncertainty in Selfish Routing |Network Uncertainty in Selfish Routing Chryssis Georgiou Theophanis Pavlides Anna Philippou We study the problem of selfish routing in the presence of incomplete network information. Our model consists of a number of users who wish to route their traffic on a network of $m$ parallel links with the objective of minimizing their latency. However, in doing so, they face the challenge of lack of precise information on the capacity of the network links. This uncertainty is modelled via a set of probability distributions over all the possibilities, one for each user. The resulting model is an amalgamation of the KP-model of~[Koutsoupias and Papadimitriou, 1999] and the congestion games with user-specific functions of~[Milchtaich, 1996]. We embark on a study of Nash equilibria and the price of anarchy in this new model. In particular, we propose polynomial-time algorithms for computing some special cases of pure Nash equilibria and we show that negative results of~[Milchtaich, 1996], for the non-existence of pure Nash equilibria in the case of three users, do not apply to our model. Consequently, we propose an interesting open problem in this area, that of the existence of pure Nash equilibria in the general case of our model. Furthermore, we consider appropriate notions for the social cost and the price of anarchy and obtain upper bounds for the latter. With respect to fully mixed Nash equilibria, we propose a method to compute them and show that when they exist they are unique. Finally we prove that the fully mixed Nash equilibrium maximizes the social welfare.||||||||||||./pdfs/1568975021-IPDPS-paper-1.pdf",
    "Performance Analysis of Parallel Programs via Message-passing Graph Travers|Performance Analysis of Parallel Programs via Message-passing Graph Traversal Matthew J. Sottile Vaddadi P. Chandu David A. Bader The ability to understand the factors contributing to parallel program performance are vital for understanding the impact of machine parameters on the performance of specific applications. We propose a methodology for analyzing the performance characteristics of parallel programs based on message-passing traces of their execution on a set of processors. Using this methodology, we explore how perturbations in both single processor performance and the messaging layer impact the performance of the traced run. This analysis provides a quantitative description of the sensitivity of applications to a variety of performance parameters to better understand the range of systems upon which an application can be expected to perform well. These performance parameters include operating system interference and variability in message latencies within the interconnection network layer.||||||||||||./pdfs/1568975027-IPDPS-paper-1.pdf",
    "Parallel ICA Methods for EEG Neuroimaging |Parallel ICA Methods for EEG Neuroimaging Dan B. Keith Christian C. Hoge Robert M. Frank Allen D. Malony \\textit HiPerSAT , a C++ library and tools, processes EEG data sets with ICA (Independent Component Analysis) methods. \\textit HiPerSAT uses \\textbf BLAS , \\textbf LAPACK , \\textbf MPI and \\textbf OpenMP to achieve a high performance solution that exploits parallel hardware. ICA is a class of methods for analyzing a large set of data samples and extracting independent components that explain the observed data. ICA is used in EEG research for data cleaning and separation of spatiotemporal patterns that may reflect different underlying neural processes. We present two ICA implementations (FastICA and Infomax) that exploit parallelism to provide an EEG component decomposition solution of higher performance and data capacity than current MATLAB-based implementations. Experimental results and the methodology used to obtain them are presented. Integrating HiPerSAT with \\textbf EEGLAB is described, as well as future plans for this research.||||||||||||./pdfs/1568975028-IPDPS-paper-1.pdf",
    "MPEG-2 Decoding in a Stream Programming Language |MPEG-2 Decoding in a Stream Programming Language Matthew Drake Hank Hoffmann Rodric Rabbah Saman Amarasinghe Image and video codecs are prevalent in multimedia devices, ranging from embedded systems, to desktop computers, to high-end servers such as HDTV editing consoles. It is not uncommon however that developers create and customize separate coder and decoder implementations for each of the architectures they target. This practice is time consuming and error prone, leading to code that is neither malleable nor portable. This paper describes an implementation of the MPEG-2 decoder using the StreamIt programming language. StreamIt is an architecture-independent stream language that aims to improve programmer productivity, while concomitantly exposing the inherent parallelism and communication topology of the application. The paper shows that MPEG is a good match for the streaming programming model and illustrates the malleability of the implementation using a simple modification to the decoder to support alternate color compression formats. StreamIt allows for modular application development, which increases code reuse, and reduces the complexity of the debugging process since stream components can be verified independently. This in turn leads to greater programmer productivity.||||||||||||./pdfs/1568975031-IPDPS-paper-1.pdf",
    "Early Evaluation of the Cray XT3 |Early Evaluation of the Cray XT3 Jeffrey S. Vetter Sadaf R. Alam Thomas H. Dunigan, Jr. Mark R. Fahey Philip C. Roth Patrick H. Worley Oak Ridge National Laboratory recently received delivery of a 5,294 processor Cray XT3. The XT3 is Cray?s third-generation massively parallel processing system. The system builds on a single processor node?built around the AMD Opteron?and uses a custom chip?called SeaStar?to provide interprocessor communication. In addition, the system uses a lightweight operating system on the compute nodes. This paper describes our initial experiences with the system, including micro-benchmark, kernel, and application benchmark results. In particular, we provide performance results for strategic Department of Energy applications areas including climate and fusion. We demonstrate experiments on the installed system, scaling applications up to 4,096 processors.||||||||||||./pdfs/1568975035-IPDPS-paper-1.pdf",
    "Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP|Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matthew Devuyst Rakesh Kumar Dean M. Tullsen This paper explores thread scheduling on an increasingly popular architecture: chip multiprocessors with simultaneous multithreading cores. Conventional multiprocessor scheduling, applied to this architecture, will attempt to balance the thread load across cores. This research demonstrates that such an approach eliminates one of the big advantages of this architecture -- the ability to use unbalanced schedules to allocate the right amount of execution resources to each thread. However, accommodating unbalanced schedules creates several difficulties, the biggest being the fact that the search space of all schedules (both balanced and unbalanced) is much greater than that of the balanced schedules alone. This work proposes and evaluates scheduling policies that allow the system to identify and migrate toward good thread schedules, whether the best schedules are balanced or unbalanced.||||||||||||./pdfs/1568975042-IPDPS-paper-1.pdf",
    "Flexible Tardiness Bounds for Sporadic Real-Time Task Systems on Multiproce|Flexible Tardiness Bounds for Sporadic Real-Time Task Systems on Multiprocessors Umamaheswari Devi James H. Anderson The earliest-deadline-first (EDF) scheduling of a sporadic real-time task system on a multiprocessor may require that the total utilization of the task system, $U_ sum $, not exceed $(m+1)/2$ on $m$ processors if every deadline needs to be met. In recent work, we considered the alleviation of this under-utilization for task systems that can tolerate deadline misses by bounded amounts (i.e., bounded tardiness). We showed that if $U_ sum \\leq m$ and tasks are not pinned to processors, then the tardiness of each task is bounded under both preemptive and non-preemptive EDF. However, the tardiness bounds derived are applicable to every task in the task system, i.e., any task may incur maximum tardiness. In this paper, we consider supporting tasks whose tolerances to tardiness are less than that known to be possible under EDF. We propose a new scheduling policy, called EDF-hl, that is a variant of EDF, and show that under EDF-hl, any tardiness, including zero tardiness, can be ensured for a limited number of \\em privileged\\/ tasks, and that bounded tardiness can be guaranteed to the remaining tasks if their utilizations are restricted. EDFhl reduces to EDF in the absence of privileged tasks. The tardiness bound that we derive is a function of $U_ sum $, in addition to individual task parameters. Hence, tardiness for all tasks can be lowered by lowering $U_ sum $. A simulation-based evaluation of the tardiness bounds that are possible is provided.||||||||||||./pdfs/1568975051-IPDPS-paper-1.pdf",
    "On Efficient Distributed Deadlock Avoidance for Real-Time and Embedded Syst|On Efficient Distributed Deadlock Avoidance for Real-Time and Embedded Systems Cesar Sanchez Henny B. Sipma Zohar Manna Venkita Subramonian Christopher Gill Thread allocation is an important problem in distributed real-time and embedded (DRE) systems. A thread allocation policy that is too liberal may cause deadlock, while a policy that is too conservative limits potential parallelism, thus wasting resources. However, achieving (globally) optimal thread utilization, while avoiding deadlock, has been proven impractical in distributed systems: it requires too much communication between components. In previous work we showed that efficient local thread allocation protocols are possible if the protocols are parameterized by global static data, in particular by an annotation of the global call graph of all tasks to be performed by the system. We proved that absence of cyclic dependencies in this annotation guarantees absence of deadlock. In this paper we present an algorithm to compute optimal annotations, that is annotations that maximize parallelism while satisfying the condition of acyclicity. Moreover, we show that the condition of acyclicity is in fact tight and exhibits a rather surprising anomaly: if a cyclic dependency is present in the annotation of the call graph and a certain minimum number of threads is provided, deadlock is reachable. Thus, in the presence of cyclic dependencies, increasing the number of threads may introduce the possibility of deadlock in an originally deadlock free system.||||||||||||./pdfs/1568975052-IPDPS-paper-1.pdf",
    "IP over P2P: Enabling Self-configuring Virtual IP Networks for Grid Computi|IP over P2P: Enabling Self-configuring Virtual IP Networks for Grid Computing Arijit Ganguly Abhishek Agrawal P. Oscar Boykin Renato Figueiredo Peer-to-peer (P2P) networks have mostly focused on task oriented networking, where networks are constructed for single applications, i.e. file-sharing, DNS caching, etc. In this work, we introduce IPOP, a system for creating virtual IP networks on top of a P2P overlay. IPOP enables seamless access to Grid resources spanning multiple domains by aggregating them into a virtual IP network that is completely isolated from the physical network. The virtual IP network provided by IPOP supports deployment of existing IP-based protocols over a robust, self-configuring P2P overlay. We present implementation details as well as experimental measurement results taken from LAN, WAN, and Planet-Lab tests.||||||||||||./pdfs/1568975057-IPDPS-paper-1.pdf",
    "Infiniband Scalability in Open MPI |Infiniband Scalability in Open MPI Galen M. Shipman Tim S. Woodall Richard L. Graham Arthur B. Maccabe Patrick G. Bridges Infiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new open source implementation of the MPI standard targeted for production computing, provides several mechanisms to enhance Infiniband scalability. Initial comparisons with MVAPICH, the most widely used Infiniband MPI implementation, show similar performance but with much better scalability characteristics. Specifically, small message latency is improved by up to 10\\% in medium/large jobs and memory usage per host is reduced by as much as 300\\%. In addition, Open MPI provides predictable latency that is close to optimal without sacrificing bandwidth performance.||||||||||||./pdfs/1568975058-IPDPS-paper-1.pdf",
    "Concurrent Counting is Harder than Queuing |Concurrent Counting is Harder than Queuing Srikanta Tirthapura Costas Busch In both distributed counting and queuing, processors in a distributed system issue operations which are organized into a total order. In counting, each processor receives the rank of its operation in the total order, where as in queuing, a processor gets back the identity of its predecessor in the total order. Coordination applications such as totally ordered multicast can be solved using either distributed counting or queuing, and it would be very useful to definitively know which of counting or queuing is a harder problem. We conduct the first systematic study of the relative complexities of distributed counting and queuing in a concurrent setting. Our results show that concurrent counting is harder than concurrent queuing on a variety of processor interconnection topologies, including high diameter graphs such as the list and the mesh, and low diameter graphs such as the complete graph, perfect m-ary tree, and the hypercube. For all these topologies, we show that the concurrent delay complexity of a particular solution to queuing, the arrow protocol, is asymptotically smaller than a lower bound on the complexity of any solution to counting. As a consequence, we are able to definitively say that given a choice between applying counting or queuing to solve a distributed coordination problem, queuing is the better solution.||||||||||||./pdfs/1568975069-IPDPS-paper-1.pdf",
    "GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures |GPU-ABiSort: Optimal Parallel Sorting on Stream Architectures Alexander Gre&szlig; Gabriel Zachmann In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting $n$ values utilizing $p$ stream processor units, this approach achieves the optimal time complexity $O((n \\log n) / p)$. While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses (and by combining the optimal time approach with algorithms optimized for small input sequences). We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous non-optimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units $p$ (up to $n / \\log^2 n$ or even $n / \\log n$ units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations.||||||||||||./pdfs/1568975070-IPDPS-paper-1.pdf",
    "Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters |Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters Sayantan Sur Lei Chai Hyun-wook Jin Dhabaleswar K. Panda Clusters of several thousand nodes interconnected with InfiniBand, an emerging high-performance interconnect, have already appeared in the Top 500 list. The next-generation InfiniBand clusters are expected to be even larger with tens-of-thousands of nodes. A high-performance scalable MPI design is crucial for MPI applications in order to exploit the massive potential for parallelism in these very large clusters. MVAPICH is a popular implementation of MPI over InfiniBand based on its reliable connection oriented model. The requirement of this model to make communication buffers available for each connection imposes a memory scalability problem. In order to mitigate this issue, the latest InfiniBand standard includes a new feature called Shared Receive Queue (SRQ) which allows sharing of communication buffers across multiple connections. In this paper, we propose a novel MPI design which efficiently utilizes SRQs and provides very good performance. Our analytical model reveals that our proposed designs will take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes. Performance evaluation of our design on our 8-node cluster shows that our new design was able to provide the same performance as the existing design while requiring much lesser memory. In comparison to tuned existing designs our design showed a 20\\% and 5\\% improvement in execution time of NAS Benchmarks (Class A) LU and SP, respectively. The High Performance Linpack was able to execute a much larger problem size using our new design, whereas the existing design ran out of memory.||||||||||||./pdfs/1568975076-IPDPS-paper-1.pdf",
    "Design and Analysis of a Multi-dimensional Data Sampling Service for Large |Design and Analysis of a Multi-dimensional Data Sampling Service for Large Scale Data Analysis Applications Xi Zhang Tahsin Kurc Joel Saltz Srinivasan Parthasarathy Sampling is a widely used technique to increase efficiency in database and data mining applications operating on large dataset. In this paper we present a scalable sampling implementation that supports efficient, multi-dimensional spatio-temporal sample generation on dynamic, large scale datasets stored on a storage cluster. The proposed algorithm leverages Hilbert space-filling curves in order to provide an approximate linear order of multidimensional data while maintaining spatial locality. This new implementation is then bootstrapped on top of our previous implementation, which efficiently samples large datasets along a single dimension (e.g., time), thereby realizing a service for spatio-temporal sampling. We evaluate the performance of our approach comparing it to the popular R-tree based technique. The experimental results show that our approach achieves up to an order of magnitude higher efficiency and scalability.||||||||||||./pdfs/1568975078-IPDPS-paper-1.pdf",
    "Multilevel Algorithms for Partitioning Power-Law Graphs |Multilevel Algorithms for Partitioning Power-Law Graphs Amine Abou-rjeili George Karypis Graph partitioning is an enabling technology for parallel processing as it allows for the effective decomposition of unstructured computations whose data dependencies correspond to a large sparse and irregular graph. Even though the problem of computing high-quality partitionings of graphs arising in scientific computations is to a large extent well-understood, this is far from being true for emerging HPC applications whose underlying computation involves graphs whose degree distribution follows a power-law curve. This paper presents new multilevel graph partitioning algorithms that are specifically designed for partitioning such graphs. It presents new clustering-based coarsening schemes that identify and collapse together groups of vertices that are highly connected. An experimental evaluation of these schemes on 10 different graphs show that the proposed algorithms consistently and significantly outperform existing state-of-the-art approaches.||||||||||||./pdfs/1568975080-IPDPS-paper-1.pdf",
    "Parallelizing Post-Placement Timing Optimization |Parallelizing Post-Placement Timing Optimization Jiyoun Kim Marios C. Papaefthymiou Jose L. Neves This paper presents an efficient modeling scheme and a partitioning heuristic for parallelizing VLSI post-placement timing optimization. Encoding the paths with timing violations into a task graph, our novel modeling scheme provides an efficient representation of the timing and spatial relations among timing optimization tasks. Our new partitioning algorithm then assigns the task graph into multiple sessions of parallel processes, so that interprocessor communication is completely eliminated during each session. This partitioning scheme is especially useful for parallelizing processes with heavily connected tasks and, therefore, high communication requirements. For circuits with 20--130 thousand cells, the partitioning heuristic achieves speedups in excess of 5$\\times$ without degrading solution quality by dynamically utilizing 1--8 processors.||||||||||||./pdfs/1568975085-IPDPS-paper-1.pdf",
    "Adaptive Connection Management for Scalable MPI over InfiniBand |Adaptive Connection Management for Scalable MPI over InfiniBand Weikuan Yu Qi Gao Dhabaleswar K. Panda Supporting scalable and efficient parallel programs is a major challenge in parallel computing with the widespread adoption of large-scale computer clusters and supercomputers. One of the pronounced scalability challenges is the management of connections between parallel processes, especially over connection-oriented interconnects such as VIA and InfiniBand. In this paper, we take on the challenge of designing efficient connection management for parallel programs over InfiniBand clusters. We propose adaptive connection management (ACM) to dynamically control the establishment of InfiniBand reliable connections (RC) based on the communication frequency between MPI processes. We have investigated two different ACM algorithms: an on-demand algorithm that starts with no InfiniBand RC connections; and a partial static algorithm with only $2*logN$ number of InfiniBand RC connections initially. We have designed and implemented both ACM algorithms in MVAPICH to study their benefits. Two mechanisms have been exploited for the establishment of new RC connections: one using InfiniBand unreliable datagram and the other using InfiniBand connection management. For both mechanisms, MPI communication issues, such as progress rules, reliability and race conditions are handled to ensure efficient and light-weight connection management. Our experimental results indicate that ACM algorithms can benefit parallel programs in terms of the process initiation time, the number of active connections, and the resource usage. For parallel programs on a 16-node cluster, they can reduce the process initiation time by 15\\% and the initial memory usage by 18\\%.||||||||||||./pdfs/1568975086-IPDPS-paper-1.pdf",
    "Assembling Genomes on Large-Scale Parallel Computers |Assembling Genomes on Large-Scale Parallel Computers Anantharaman Kalyanaraman Scott J. Emrich Patrick S. Schnable Srinivas Aluru Assembly of large complex genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. New gene-enrichment sequencing strategies are expected to further exacerbate this situation. In this paper, we present a massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and heuristic strategies that reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. As part of the ongoing national efforts in maize genome sequencing, we applied our assembly framework to the largest available maize genomic data. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in 2 hours on 1,024 processors of an IBM BlueGene/L supercomputer.||||||||||||./pdfs/1568975088-IPDPS-paper-1.pdf",
    "Topology-aware Task Mapping for Reducing Communication Contention on Large |Topology-aware Task Mapping for Reducing Communication Contention on Large Parallel Machines Tarun Agarwal Amit Sharma Laxmikant V. Kale Communication latencies constitute a significant factor in the performance of parallel applications. With techniques such as wormhole routing, the variation in no-load latencies became insignificant, i.e., the no-load latencies for far-away processors were not significantly higher (and too small to matter) than those for nearby processors. Contention in the network is then left as the major factor affecting latencies. With networks such as Fat-Trees of hypercubes, with number of wires growing as $P\\log P$, even this is not a very significant factor. However, for torus and grid networks now being used in large machines such as BlueGene/L and the Cray XT3, such contention becomes an issue. We quantify the effect of this contention with benchmarks that vary the number of hops traveled by each communicated byte. We then demonstrate a process mapping strategy that minimizes the impact of topology by heuristically minimizing the total number of hop-bytes communicated. This strategy, and its variants, are implemented in an adaptive runtime system in Charm++ and Adaptive MPI, so it is available for a broad class of applications.||||||||||||./pdfs/1568975095-IPDPS-paper-1.pdf",
    "Dynamic Structured Partitioning for Parallel Scientific Applications with P|Dynamic Structured Partitioning for Parallel Scientific Applications with Pointwise Varying Workloads Sumir Chandra Manish Parashar Jaideep Ray Parallel implementations of scientific applications involving the simulation of reactive flow on structured grids are challenging, since the underlying phenomena include transport processes with uniform computational loads as well as reactive processes having pointwise varying workloads. As a result, traditional parallelization approaches that assume homogeneous loads are not suitable for these simulations. This paper presents ``\\textit Dispatch '', a dynamic structured partitioning strategy that has been applied to parallel uniform and adaptive formulations of simulations with computational heterogeneity. \\textit Dispatch maintains the computational weights associated with pointwise processes in a distributed manner, computes the local workloads and partitioning thresholds, and performs in-situ locality-preserving load balancing. The experimental evaluation of \\textit Dispatch using an illustrative 2-D reactive-diffusion kernel demonstrates improvement in load distribution and overall application performance.||||||||||||./pdfs/1568975099-IPDPS-paper-1.pdf",
    "Sim-X: Parallel System Software for Interactive Multi-Experiment Computatio|Sim-X: Parallel System Software for Interactive Multi-Experiment Computational Studies Siu-man Yau Eitan Grinspun Vijay Karamcheti Denis Zorin Advances in high-performance computing have led to the broad use of \\em computational studies in everyday engineering and scientific applications. A single study may require thousands of \\em computational experiments , each corresponding to individual runs of simulation software with different parameter settings; in complex studies, the pattern of parameter changes is complex and may have to be adjusted by the user based on partial simulation results. Unfortunately, existing tools have limited high-level support for managing large ensembles of simultaneous computational experiments. In this paper, we present a system architecture for interactive computational studies targeting two goals. The first is to provide a framework for high-level user interaction with computational studies, rather than individual experiments; the second is to maximize the size of the studies that can be performed at close to interactive rates. We describe a prototype implementation of the system and demonstrate performance improvements obtained using our approach for a simple model problem.||||||||||||./pdfs/1568975100-IPDPS-paper-1.pdf",
    "A Virtual Network (ViNe) Architecture for Grid Computing |A Virtual Network (ViNe) Architecture for Grid Computing Maur&iacute;cio Tsugawa Jos&eacute; A. B. Fortes This paper describes a virtual networking approach for Grids called ViNe. It enables symmetric connectivity among Grid resources and allows existing applications to run unmodified. Novel features of the ViNe architecture include: easy virtual networking administration; support for physical private networks and support for multiple independent virtual networks in the same infrastructure. The requirements of an application-friendly virtual network environment are presented and it is shown how the proposed solution meets them. Qualitative arguments are provided to justify all design decisions. Also presented is an experimental evaluation of the round-trip latencies and bandwidths achieved by a reference implementation. Measurements are reported for WAN-scenarios involving three different institutions. Under favorable conditions, ViNe bandwidths are within 90 to 100\\% of the available physical network bandwidth.||||||||||||./pdfs/1568975101-IPDPS-paper-1.pdf",
    "Parallel Algorithms for Inductance Extraction of VLSI Circuits |Parallel Algorithms for Inductance Extraction of VLSI Circuits Hemant Mahawar Vivek Sarin Inductance extraction involves estimating the mutual inductance in a VLSI circuit. Due to increasing clock speed and diminishing feature sizes of modern VLSI circuits, the effects of inductance are increasingly felt during the testing and verification stages. Hence, there is a need for fast and accurate inductance extraction software. A generalized approach for inductance extraction requires the solution of a dense complex symmetric linear system that models mutual inductive effects among circuit elements. Iterative methods are used to solve the system without explicit computation of the matrix itself. Fast hierarchical techniques are used to compute approximate matrix-vector products with the dense system matrix. This work presents an overview of a new parallel software package for inductance extraction of large VLSI circuits. The technique uses a combination of the solenoidal basis method and effective preconditioning schemes to solve the linear system. Fast Multipole Method (FMM) is used to compute approximate matrix-vector products with the inductance matrix. By formulating the preconditioner as a dense matrix similar to the coefficient matrix, we are able to use FMM for the preconditioning step as well. A two-tier parallelization scheme allows an efficient parallel implementation using both OpenMP and MPI directives simultaneously. The experiments conducted on various multiprocessor machines demonstrate the portability and parallel performance of the software.||||||||||||./pdfs/1568975102-IPDPS-paper-1.pdf",
    "Using Virtual Grids to Simplify Application Scheduling |Using Virtual Grids to Simplify Application Scheduling Richard Huang Henri Casanova Andrew A. Chien Users and developers of grid applications have access to increasing numbers of resources. While more resources generally mean higher capabilities for an application, they also raise the issue of application scheduling scalability. First, even polynomial time scheduling heuristics may take a prohibitively long time to compute a schedule. Second, and perhaps more critical, it may not be possible to gather all the resource information needed by a scheduling algorithm in a scalable manner. Our application focus is scientific workflows, which can be represented as Directed Acyclic Graphs (DAGs). Our claim is that, in future resource-rich environments, simple scheduling algorithms may be sufficient to achieve good workflow performances. We introduce a scalable scheduling approach that uses a resource abstraction called a virtual grid (VG). Our simulations of a range of typical DAG structures and resources demonstrate that a simple greedy scheduling heuristic combined with the virtual grid abstraction is as effective and more scalable than more complex heuristic DAG scheduling algorithms on large-scale platforms.||||||||||||./pdfs/1568975105-IPDPS-paper-1.pdf",
    "Hash-based Proximity Clustering for Load Balancing in Heterogeneous DHT Net|Hash-based Proximity Clustering for Load Balancing in Heterogeneous DHT Networks Haiying Shen Cheng-zhong Xu DHT networks based on consistent hashing functions have an inherent load uneven distribution problem. The objective of DHT load balancing is to balance the workload of the network nodes in proportion to their capacity so as to eliminate traffic bottleneck. It is challenging because of the dynamism nature of DHT networks and time-varying load characteristics. In this paper, we present a hash-based proximity clustering approach for load balancing in heterogeneity DHTs. In the approach, DHT nodes are classified as regular nodes and supernodes according to their computing and networking capacities. Regular nodes are grouped and associated with supernodes via consistent hashing of their physical proximity information on the Internet. The supernodes form a self-organized and churn resilient auxiliary network for load balancing. The hierarchical structure facilitates the design and implementation of a locality-aware randomized load balancing algorithm. The algorithm introduces a factor of randomness in the load balancing processes in a range of neighborhood so as to deal with both the proximity and dynamism. Simulation results show the superiority of the approach, in comparison with a number of other DHT load balancing algorithms. The approach performs no worse than existing proximity-aware algorithms and exhibits strong resilience to the effect of churn. It also greatly reduces the overhead of resilient randomized load balancing algorithms due to the use of proximity information.||||||||||||./pdfs/1568975144-IPDPS-paper-1.pdf",
    "Auto-Pipe and the X Language: A Pipeline Design Tool and Description Langua|Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language Mark A. Franklin Eric J. Tyson James Buckley Patrick Crowley \\emph Auto-Pipe is a tool that aids in the design, evaluation and implementation of applications that can be executed on computational pipelines (and other topologies) using a set of heterogeneous devices including multiple processors and FPGAs. It has been developed to meet the needs arising in the domains of communications, computation on large datasets, and real time streaming data applications. This paper introduces the \\emph Auto-Pipe design flow and the \\emph X design language, and presents sample applications. The applications include the Triple-DES encryption standard and a subset of the signal-processing pipeline for VERITAS, a high-energy gamma-ray astrophysics experiment. These applications are discussed and their description in \\emph X is presented. From the \\emph X description, simulations of alternative system designs and stage-to-device assignments are obtained and analyzed, and the optimal assignment is presented. The complete system will permit production of executable code and bit maps that may be downloaded onto real devices. Future work required to complete the \\emph Auto-Pipe design tool is discussed.||||||||||||./pdfs/1568975145-IPDPS-paper-1.pdf",
    "On Collaborative Content Distribution using Multi-Message Gossip |On Collaborative Content Distribution using Multi-Message Gossip Coby Fernandess Dahlia Malkhi We study epidemic schemes in the context of collaborative data delivery. In this context, multiple chunks of data reside at different nodes, and the challenge is to simultaneously deliver all chunks to all nodes. Here we explore the inter-operation between the gossip of multiple, simultaneous message-chunks. In this setting, interacting nodes must select which chunk, among many, to exchange in every communication round. We provide an efficient solution that possesses the inherent robustness and scalability of gossip. Our approach maintains the simplicity of gossip, and has low message, connections and computation overhead. Because our approach differs from solutions proposed by network coding, we are able to provide insight into the tradeoffs and analysis of the problem of collaborative content distribution. We formally analyze the performance of the algorithm, demonstrating its efficiency with high probability.||||||||||||./pdfs/1568975148-IPDPS-paper-1.pdf",
    "A Study of the On-Chip Interconnection Network for the IBM Cyclops64 Multi-|A Study of the On-Chip Interconnection Network for the IBM Cyclops64 Multi-Core Architecture Ying Ping Zhang Taikyeong Jeong Fei Chen Haiping Wu Ronny Nitzsche Guang R. Gao The designs of high-performance processor architectures are moving toward the integration of a large number of multiple processing cores on a single chip. The IBM Cyclops-64 (C64) is a petaflop supercomputer built on multi-core system-on-a-chip technology. Each C64 chip employs a multistage pipelined crossbar switch as its on-chip interconnection network to provide high bandwidth and low latency communication between the 160 thread processing cores, the on-chip SRAM memory banks, and other components. In this paper, we present a study of the architecture and performance of the C64 on-chip interconnection network through simulation. Our experimental results provide observations on the network behavior: (1) Dedicated channels can be created between any output port to input port of the C64 crossbar with latency as low as 7 cycles. The C64 crossbar has the potential reach the full hardware bandwidth, and exhibit a non-blocking behavior; (2) The C64 crossbar is a stable network; (3) The network logic design appears to provide a reasonable opportunity for sharing the channel bandwidth between traffic in either direction; (4) A simple circular neighbor arbitration scheme can achieve competitive performance level comparing to the complex segmented LRU (Least Recently Used) matrix arbitration scheme without losing the fairness. (5) Application-driven benchmarks provide comparable results to synthetic workloads.||||||||||||./pdfs/1568975150-IPDPS-paper-1.pdf",
    "Cooperative Load Balancing for a Network of Heterogeneous Computers |Cooperative Load Balancing for a Network of Heterogeneous Computers Satish Penmatsa Anthony T. Chronopoulos In this paper we present a game theoretic approach to solve the static load balancing problem in a distributed system which consists of heterogeneous computers connected by a single channel communication network. We use a cooperative game to model the load balancing problem. Our solution is based on the Nash Bargaining Solution (NBS) which provides a Pareto optimal solution for the distributed system and is also a fair solution. An algorithm for computing the NBS is derived for the proposed cooperative load balancing game. Our scheme is compared with that of other existing schemes under simulations with various system loads and configurations. We show that the solution of our scheme is near optimal and is superior to the other schemes in terms of fairness.||||||||||||./pdfs/1568975251-HCW-paper-1.pdf",
    "Wrekavoc: a Tool for Emulating Heterogeneity |Wrekavoc: a Tool for Emulating Heterogeneity Louis-claude Canon Emmanuel Jeannot Computer science and especially heterogeneous distributed computing is an experimental science. Simulation, emulation, or in-situ implementation are complementary methodologies to conduct experiments in this context. In this paper we address the problem of defining and controlling the heterogeneity of a platform. We evaluate the proposed solution, called Wrekavoc, with micro-benchmark and by implementing algorithms of the literature.||||||||||||./pdfs/1568975328-HCW-paper-1.pdf",
    "Integrating Heterogeneous Information Services Using JNDI |Integrating Heterogeneous Information Services Using JNDI Dirk Gorissen Piotr Wendykier Dawid Kurzyniec Vaidy Sunderam The capability to announce and discover resources is a foundation for heterogene ous computing systems. Independent projects have adopted custom implementations of information services, which are not interoperable and induce substantial maintenance costs. In this paper, we propose an alternative methodology. We suggest that it is possible to reuse existing naming service deployments and combine them into complex, scalable, hierarchical, distributed federations, by using appropriate client-side integration middleware that unifies service access and hides heterogeneity behind a common API. We investigate a JNDI-based approach, and describe in detail two newly implemented JNDI service providers, which enable unified access to 1) Jini lookup services, and 2) Harness Distributed Naming Services. We claim that these two technologies, along with others already accessible through JNDI such as e.g. DNS and LDAP, offer features suitable for use in hierarchical heterogeneous information systems.||||||||||||./pdfs/1568975343-HCW-paper-1.pdf",
    "A Brokering Framework for Large-Scale Heterogeneous Systems |A Brokering Framework for Large-Scale Heterogeneous Systems Xin Bai Ladislau Boloni Dan C. Marinescu Howard Jay Siegel Rose A. Daley I-jeng Wang In this paper we discuss the role of a broker in a market-oriented resource llocation model for large-scale heterogeneous systems. The simplified model is based upon a three party ystem, provider-broker-consumer. The allocation of resources is determined by their price, their utility to the onsumer, and by the satisfaction of the consumer. The role of the broker is to add societal objectives to resource llocation algorithms and to mediate between greedy consumers and selfish providers. A simulation experiment was onducted to study the transient and the steady-state behavior of several performance measures, including the verage consumer satisfaction, the average utility, and the hourly revenue.||||||||||||./pdfs/1568975349-HCW-paper-1.pdf",
    "The Impact of Heterogeneity on Master-slave On-line Scheduling |The Impact of Heterogeneity on Master-slave On-line Scheduling Jean-francois Pineau Yves Robert Fr&eacute;d&eacute;ric Vivien In this paper, we assess the impact of heterogeneity for scheduling independent tasks on master-slave platforms. We assume a realistic one-port model where the master can communicate with a single slave at any time-step. We target on-line scheduling problems, and we focus on simpler instances where all tasks have the same size. While such problems can be solved in polynomial time on homogeneous platforms, we show that there does not exist any optimal deterministic algorithm for heterogeneous platforms. Whether the source of heterogeneity comes from computation speeds, or from communication bandwidths, or from both, we establish lower bounds on the competitive ratio of any deterministic algorithm. We provide such bounds for the most important objective functions: the minimization of the makespan (or total execution time), the minimization of the maximum response time (difference between completion time and release time), and the minimization of the sum of all response times. Altogether, we obtain nine theorems which nicely assess the impact of heterogeneity on on-line scheduling. These theoretical contributions are complemented on the practical side by the implementation of several heuristics on a small but fully heterogeneous MPI platform. Our (preliminary) results show the superiority of those heuristics which fully take into account the relative capacity of the communication links.||||||||||||./pdfs/1568975357-HCW-paper-1.pdf",
    "FIFO Scheduling of Divisible Loads with Return Messages under the One-port |FIFO Scheduling of Divisible Loads with Return Messages under the One-port Model Olivier Beaumont Loris Marchal Veronika Rehn Yves Robert This paper deals with scheduling divisible load applications on star networks, in presence of return messages. This work is a follow-on of previous studies, where the same problem was considered under the two-port model, where a given processor can simultaneously send and receive messages. Here, we concentrate on the one-port model, where a processor can either send or receive a message at a given time step. The problem of scheduling divisible load on star platforms turns out to be very difficult as soon as return messages are involved. Unfortunately, we have not been able to assess its complexity, but we provide an optimal solution in the special (but important) case of FIFO communication schemes. We also provide an explicit formula for the optimal number of load units that can be processed by a FIFO ordering on a bus network. Finally, we provide a set of MPI experiments to assess the accuracy and usefulness of our results in a real framework.||||||||||||./pdfs/1568975369-HCW-paper-1.pdf",
    "A Task Duplication Based Bottom-Up Scheduling Algorithm for Heterogeneous E|A Task Duplication Based Bottom-Up Scheduling Algorithm for Heterogeneous Environments Doruk Bozda&#287; Umit Catalyurek F&uuml;sun Ozg&uuml;ner We propose a new duplication-based DAG scheduling algorithm for heterogeneous computing environments. Contrary to the traditional approaches, proposed algorithm traverses the DAG in a bottom-up fashion while taking advantage of task duplication and task insertion. Experimental results on random DAGs and three different application DAGs show that the makespans generated by the proposed DBUS algorithm are much better than those generated by the existing algorithms, HEFT, HCPFD and HCNF.||||||||||||./pdfs/1568975396-HCW-paper-1.pdf",
    "Scheduling of Tasks with Batch-shared I/O on Heterogeneous Systems |Scheduling of Tasks with Batch-shared I/O on Heterogeneous Systems Nagavijayalakshmi Vydyanathan Gaurav Khanna Umit Catalyurek Tahsin Kurc P. Sadayappan Joel Saltz This paper proposes a novel strategy that uses hypergraph partitioning and K-way iterative mapping-refinement heuristics for scheduling a batch of data-intensive tasks with batch-shared I/O behavior on heterogeneous collections of storage and compute clusters. The strategy formulates the sharing of files among tasks as a hypergraph to minimize the I/O overheads due to transferring of the same set of files multiple times and employs a K-way iterative mapping-refinement scheme to adapt to the heterogeneity of compute clusters and storage networks in the system. We evaluate the proposed approach through real experiments and simulations on application scenarios from two application domains; satellite data processing and biomedical imaging. Our experimental results show that our approach can achieve significant performance improvement over algorithms such as HPS, Shortest Job First, MinMin, MaxMin and Sufferage for workloads with high degree of shared I/O among tasks.||||||||||||./pdfs/1568975404-HCW-paper-1.pdf",
    "Using SCTP to Hide Latency in MPI Programs |Using SCTP to Hide Latency in MPI Programs Humaira Kamal Brad Penoff Mike Tsai Edith Vong Alan Wagner A difficulty in using heterogeneous collections of geographically distributed machines across wide area networks for parallel computing is the huge variability in message latency that is orders of magnitude larger than parallel programs executing on dedicated systems. This variability is in part due to the underlying network bandwidth and latency which can vary dramatically according to network conditions. Although such an environment is not suitable for many message passing programs there are those programs that can take advantage of it. Using SCTP (Stream Control Transmission Protocol) for MPI, we show how to reduce the effect of latency on task farm programs to allow them to effectively execute in high latency environments. SCTP is a recently standardized transport level protocol that has a number of features that make it well-suited to MPI and our goal is to reduce the effect of latency on MPI programs in wide area networks. We take advantage of SCTP's improved congestion control as well as its ability to have multiple independent message streams over a single connection to eliminate the head of line blocking that can occur in TCP-based middleware. The use of streams required a novel use of MPI tags to identify independent streams rather than different types of messages. We describe the design of a task farm template that exploits streams, uses buffering and pipelining of task requests to improve its performance under network loss and variable latency. We use these techniques to improve the performance of two real-world MPI programs: a robust correlation matrix computation and mpiBLAST.||||||||||||./pdfs/1568975405-HCW-paper-1.pdf",
    "Scheduling Multiple DAGs onto Heterogeneous Systems |Scheduling Multiple DAGs onto Heterogeneous Systems Henan Zhao Rizos Sakellariou The problem of scheduling a single DAG onto heterogeneous systems has been studied extensively. In this paper, we focus on the problem of scheduling more than one DAG at the same time onto a set of heterogeneous resources. The aim is not only to optimize the overall makespan, but also to achieve fairness, defined on the basis of the slowdown that each DAG would experience as a result of competing for resources with other DAGs. Two policies particularly focussing to deliver fairness are presented and evaluated along with another four policies that can be used to schedule multiple DAGs.||||||||||||./pdfs/1568975410-HCW-paper-1.pdf",
    "Plan Switching: An Approach to Plan Execution in Changing Environments |Plan Switching: An Approach to Plan Execution in Changing Environments Han Yu Dan C. Marinescu Annie S. Wu Howard Jay Siegel Rose A. Daley I-jeng Wang The execution of a complex task in any environment requires planning. Planning is the process of constructing an activity graph given by the current state of the system, a goal state, and a set of activities. If we wish to execute a complex computing task in a heterogeneous computing environment with autonomous resource providers, we should be able to adapt to changes in the environment. A possible solution is to construct a family of activity graphs beforehand and investigate the means of switching from one member of the family to another when the execution of one activity graph fails. In this paper, we study the conditions when plan switching is feasible. Then we introduce an approach for plan switching and report the simulation results of this approach.||||||||||||./pdfs/1568975411-HCW-paper-1.pdf",
    "An Economy-driven Mapping Heuristic for Hierarchical Master-Slave Applicati|An Economy-driven Mapping Heuristic for Hierarchical Master-Slave Applications in Grid Systems Nadia Ranaldo Eugenio Zimeo In heterogeneous distributed systems, such as Grids, a resource broker is responsible of automatically selecting resources, and mapping application tasks to them. A crucial aspect of resource broker design, especially in a next commercial exploitation of grid systems, in which economy theories for resource management will be applied, is the support to task mapping based on the fulfilment of Quality of Service (QoS) constraints. The paper presents an economy-driven mapping heuristic, called time minimization, for mapping and scheduling the tasks assigned to the slaves of a master-slave application in a hierarchical and heterogeneous distributed system. The validity and accuracy of such heuristic are tested by implementing it in a resource broker of a hierarchical grid middleware used for running a real world application.||||||||||||./pdfs/1568975414-HCW-paper-1.pdf",
    "Node-Disjoint Paths in Hierarchical Hypercube Networks |Node-Disjoint Paths in Hierarchical Hypercube Networks Ruei Yu Wu Gerard J. Chang Gen Huey Chen The hierarchical hypercube network is suitable for massively parallel systems. An appealing property of this network is the low number of connections per processor, which can facilitate the VLSI design and fabrication of the system. Other alluring features include symmetry and logarithmic diameter, which imply easy and fast algorithms for communication. In this paper, a maximal number of node-disjoint paths are constructed between every two distinct nodes of the hierarchical hypercube network. Their maximal length is not greater than max $2^ m+1 $ + 2m+ 1, $2^ m+1 $ +m+ 4 , where $2^ m+1 $ is the diameter.||||||||||||./pdfs/1568975652-PDSEC-paper-1.pdf",
    "Linyphi: An IPv6-Compatible Implementation of SSR|Linyphi: An IPv6-Compatible Implementation of SSR Pengfei Di Massimiliano Marcon Thomas Fuhrmann Scalable Source Routing (SSR) is a self-organizing routing protocol designed for supporting peer-to-peer applications. It is especially suited for networks that do not have a well crafted structure, e.g. ad-hoc and mesh-networks. SSR is based on the combination of source routes and a virtual ring structure. This ring is used in a Chord-like manner to obtain source routes to destinations that are not yet in the respective router cache. This approach makes SSR more efficient than flooding-based, ad-hoc routing protocols like AODV or DSR. As a consequence, SSR can provide routing for very large mesh network clouds without requiring any centralized administration. Moreover, SSR directly provides the semantics of a structured routing overlay. In this paper we present Linyphi, an implementation of SSR for wireless access routers. Linyphi combines IPv6 and SSR so that unmodified IPv6 hosts have transparent connectivity to both the Linyphi mesh network and the IPv4/v6 Internet. This allows peer-to-peer applications to directly benefit from other peers in the neighborhood without the need to route through the respective Internet service provider. We give a basic outline of the implementation and demonstrate its suitability in real-world mesh network scenarios. Linyphi is available for download.||||||||||||./pdfs/1568976188-HOTP2P-paper-1.pdf",
    "A Scalable Algorithm to Monitor Chord-based P2P Systems at Runtime |A Scalable Algorithm to Monitor Chord-based P2P Systems at Runtime Andreas Binzenh&ouml;fer Gerald Kunzmann Robert Henjes Peer-to-peer (p2p) systems are a highly decentralized, fault tolerant, and cost effective alternative to the classic client-server architecture. Yet companies hesitate to use p2p algorithms to build new applications. Due to the decentralized nature of such a p2p system the carrier does not know anything about the current size, performance, and stability of its application. In this paper we present an entirely distributed and scalable algorithm to monitor a running p2p network. The snapshot of the system enables a telecommunication carrier to gather information about the current performance parameters of the running system as well as to react to discovered errors.||||||||||||./pdfs/1568976195-HOTP2P-paper-1.pdf",
    "Neighbourhood Maps: Decentralised Ranking in Small-World P2P Networks |Neighbourhood Maps: Decentralised Ranking in Small-World P2P Networks Matteo Dell'amico Reputation in P2P networks is an important tool to encourage cooperation among peers. It is based on ranking of peers according to their past behaviour. In large-scale real world networks, a global centralised knowledge about all nodes is neither affordable nor practical. For this reason, reputation ranking is often based on local history knowledge available on the evaluating node. This criterion is not optimal, since it ignores useful data about interactions with other peers. We propose a simple, scalable and decentralised method, called ``neighbourhood maps'', that approximates rankings calculated using link-analysis techniques, exploiting the short-distance characteristics of small-world networks. We test our algorithms using data from the OpenPGP web-of-trust, a real-world network of trust relationships.||||||||||||./pdfs/1568976503-HOTP2P-paper-1.pdf",
    "A Case for Exploit-Robust and Attack-Aware Protocol RFCs |A Case for Exploit-Robust and Attack-Aware Protocol RFCs Venkat Pathamsetty Prabhaker Mateti A large number of vulnerabilities occur because protocol implementations failed to anticipate illegal packets. RFCs typically define what constitute ``right packets relevant to the protocol and they specify what the response should be for such packets. They are often ambiguous and remain silent on what the protocol implementation should do for packets which deviate from the specification. Implementers must and, by and large, do faithfully implement an RFC. However, implementers usually take any silence in a specification as ``design freedom. Even though the protocol implementers are network specialists, they often are not knowledgeable in network security and cryptography issues, past exploits and common attack techniques that can impact the security of a protocol module, and consequently, the whole system. This paper systematically discusses vulnerabilities that can be attributed to protocol designs, inadequacies of RFCs, and omissions of the protocol implementers. Using specific examples, we point out how ambiguities in protocol RFCs have lead to security vulnerabilities. We correlate various types of security vulnerabilities with the way the RFCs are written. We make a case for such exploit-robust and attack-aware RFCs, and recommend the features for a better RFC, called eRFC (Enhanced RFC). We offer advice to RFC writers, implementers and RFC approval bodies. The most effective solution to reducing network security incidents is to fix the RFCs in such a way that the implementers are forced to write an exploit-robust implementation, irrespective of their security knowledge and expertise.||||||||||||./pdfs/1568976522-SSN-paper-1.pdf",
    "Lightweight Emulation to Study Peer-to-Peer Systems |Lightweight Emulation to Study Peer-to-Peer Systems Lucas Nussbaum Olivier Richard The current methods used to test and study peer-to-peer systems (namely modeling, simulation, or execution on real testbeds) often show limits regarding scalability, realism and accuracy. This paper describes and evaluates P2PLab, our framework to study peer-to-peer systems by combining emulation (use of the real studied application within a configured synthetic environment) and virtualization. P2PLab is scalable (it uses a distributed network model) and has good virtualization characteristics (many virtual nodes can be executed on the same physical node by using process-level virtualization). Experiments with the BitTorrent file-sharing system complete this paper and demonstrate the usefulness of this platform.||||||||||||./pdfs/1568976551-HOTP2P-paper-1.pdf",
    "Improving Cooperation in Peer-to-Peer Systems Using Social Networks |Improving Cooperation in Peer-to-Peer Systems Using Social Networks Wenyu Wang Li Zhao Ruixi Yuan Rational and selfish nodes in P2P systems usually lack effective incentives to cooperate, contributing to the increase of free-riders, and degrading the system performance. Various attacks such as whitewashing, collusion, and software cracking pose great challenges on distributed reputation management. To tackle these problems, we propose to build a social network on P2P system, and use the strength of social connections to facilitate transactions in P2P system. The small world character of social networks makes it feasible for nodes to locate resources and conduct transactions while maintain limited local memory history. Such distributed memory combined by relationship between peers constructs a powerful reputation management network, which could have better performance than shared history system and is more robust under various attacks. Our simulation and analysis show that the social network model can greatly incent cooperation in P2P networks and enormously reduce the memory cost.||||||||||||./pdfs/1568976582-HOTP2P-paper-1.pdf",
    "Using Incentives to Increase Availability in a DHT |Using Incentives to Increase Availability in a DHT Fabio Picconi Pierre Sens Distributed Hash Tables (DHTs) provide a means to build a completely decentralized, large-scale persistent storage service from the individual storage capacities contributed by each node of the peer-to-peer overlay. However, persistence can only be achieved if nodes are highly available, that is, if they stay most of the time connected to the overlay. In this paper we present an incentives-based mechanism to increase the availability of DHT nodes, thereby providing better data persistence for DHT users. High availability increases a node?s reputation, which translates into access to more DHT resources and a better Quality-of-Service. The mechanism required for tracking a node?s reputation is completely decentralized, and is based on certificates reporting a node?s availability which are generated and signed by the node?s neighbors. An audit mechanism deters collusive neighbors from generating fake certificates to take advantage of the system.||||||||||||./pdfs/1568976627-HOTP2P-paper-1.pdf",
    "Interceptor: Middleware-level Application Segregation and Scheduling for P2|Interceptor: Middleware-level Application Segregation and Scheduling for P2P Systems Cosimo Anglano Very large size Peer-to-Peer systems are often required to implement efficient and scalable services, but usually they can be built only by assembling resources contributed by many independent users. Among the guarantees that must be provided to convince these users to join the P2P system, particularly important is the ability of ensuring that P2P applications and services run on their nodes will not unacceptably degrade the performance of their own applications because of an excessive resource consumption. In this paper we present \\emph Interceptor , a middleware-level application segregation and scheduling system that is able to strictly enforce quantitative limitations on node resource usage and, at same time, to make P2P applications achieve satisfactory performance even in face of these limitations.||||||||||||./pdfs/1568976628-HOTP2P-paper-1.pdf",
    "Simulating and Optimizing A Peer-to-Peer Computing Framework |Simulating and Optimizing A Peer-to-Peer Computing Framework Jean-baptiste Ernst-desmulier Julien Bourgeois Minh Thanh Ngo Fran&ccedil;ois Spies J&eacute;rome Verbeke The aim of P2P computing is to build virtual computing systems dedicated to large-scale computational problems.JXTA proposes an underlying infrastructure on which JNGI, one of the first P2P decentralized computing frameworks is built. In order to test this framework, we have built a tool named P2PPerf, which allows us to study the behavior of JNGI and to optimize it according to our simulation results.||||||||||||./pdfs/1568976629-HOTP2P-paper-1.pdf",
    "A Formal Framework for the Performance Analysis of P2P Networks Protocols |A Formal Framework for the Performance Analysis of P2P Networks Protocols Angelo Spognardi Roberto Di Pietro In this paper we propose a formal framework based on the Markov Chains to prove the performance of P2P protocols. Despite the proposal of several protocols for P2P networks, sometimes there is a lack of a formal demonstration of their performance: experimental simulations are the most used method to evaluate their performance, such as the average length of a lookup. In this paper we introduce a versatile model for the analysis of P2P protocols. We employ this model to formally prove which is the average lookup length for two sample protocols: BaRT and Koorde. We verify the effectiveness of the proposed framework also via extensive simulations.||||||||||||./pdfs/1568976643-HOTP2P-paper-1.pdf",
    "Privacy-aware Presence Management in Instant Messaging Systems |Privacy-aware Presence Management in Instant Messaging Systems Karsten Loesing Markus Dorsch Martin Grote Knut Hildebrandt Maximilian R&ouml;glinger Matthias Sehr Christian Wilms Guido Wirtz Information about online presence allows participants of instant messaging (IM) systems to determine whether their prospective communication partners will be able to answer their requests in a timely manner, or not. This makes IM more personal and closer than other forms of communication such as e-mail. On the other hand, revelation of presence constitutes a potential of misuse by untrustworthy entities, e.g.\\ generation of presence logs. We argue that current IM systems do not take reasonable precautions to protect presence information. We propose an IM system designed to be robust against attacks to disclose a user's presence. It stores presence information in a distributed hash table (DHT) in a way that is only detectable and applicable for intended users and even not comprehensible for the DHT nodes. We apply an anonymous communication network to protect the users' physical addresses.||||||||||||./pdfs/1568976646-HOTP2P-paper-1.pdf",
    "Fault and Intrusion Tolerance of Wireless Sensor Networks |Fault and Intrusion Tolerance of Wireless Sensor Networks Liang-min Wang Jian-feng Ma Chao Wang Alex Chichung Kot The following three questions should be answered in developing new topology with more powerful ability to tolerate node-failure in wireless sensor network. First, what is node-failure tolerance of topologies? Second, how to evaluate this tolerance ability? Third, which type of topologies is more efficient in tolerating node-failure? Without giving the answers, the existing work regards fault-tolerance topology as the multiply connected graph, and use the connectivity of the graph as the standard to evaluate tolerance ability. In this paper, we argue that fault tolerance of topologies is not equivalent to the connectivity of multiply connected graph by illustrating two concrete examples. Then the definition of node-failure tolerance is presented. According fault and intrusion, the two sources of failure nodes, we define fault tolerance and intrusion tolerance as the standards to evaluate the tolerance ability of topologies, and analyze the tolerance performance of hierarchical structure of wireless sensor network by using these standards. Finally, the function relation between hierarchical topology and its tolerance abilities of fault and intrusion is obtained, and an obvious corollary is that fault tolerance increase with the ratio of cluster head hierarchical structure, but with the intrusion tolerance decreasing.||||||||||||./pdfs/1568976649-SSN-paper-1.pdf",
    "Modeling Malware Propagation in Gnutella Type Peer-to-Peer Networks |Modeling Malware Propagation in Gnutella Type Peer-to-Peer Networks Krishna Kumar Ramachandran Biplab Sikdar A key emerging and popular communication paradigm, primarily employed for information dissemination, is peer-to-peer (P2P) networking. In this paper, we model the spread of malware in decentralized, Gnutella type of peer-to-peer networks. Our study reveals that the existing bound on the spectral radius governing the possibility of an epidemic outbreak needs to be revised in the context of a P2P network. We formulate an analytical model that emulates the mechanics of a decentralized Gnutella type of peer network and study the spread of malware on such networks. We show analytically, that a framework which does not incorporate the behavioral characteristics of peers ends up over estimating the epidemic threshold metric, $ \\cal R _0$. This in turn results in false positives, an undesirable feature.We also characterize the conditions under which the network may reach a malware free equilibrium and validate our theoretical results with numerical simulations.||||||||||||./pdfs/1568976651-HOTP2P-paper-1.pdf",
    "Base Line Performance Measurements of Access Controls For Libraries and Mo|Base Line Performance Measurements of Access Controls For Libraries and Modules Jason W Kim Vassilis Prevelakis Having reliable security in systems is of the utmost importance. However, the existing framework of writing, distributing and linking against code in the form of libraries and/or modules does a very poor job of keeping track of who has access to what code and who can call what function. The status-quo is insufficient for a variety of reasons. As the amount of code written that represents some kind of a rights-protected entity increases, we need a systematic, easily adopted framework for designating who has access to what code, and under which conditions. While adding access controls to libraries and modules (as well as functions held securely within them), we also give regard to the performance characteristics and ease-of-use considerations. In this vein, we discuss the design and implementation of a framework (called SecModule) used for generating (and using) libraries under access controls, as well as performance measurements of invoking functions that are held inside the protected library.||||||||||||./pdfs/1568976672-SSN-paper-1.pdf",
    "Optimizing the Finger Table in Chord-like DHTs |Optimizing the Finger Table in Chord-like DHTs Giovanni Chiola Gennaro Cordasco Luisa Gargano Alberto Negro Vittorio Scarano The Chord protocol is the best known example of implementation of logarithmic complexity routing for structured peer-to-peer networks. Its routing algorithm, however, does not provide an optimal trade-off between resources exploited (the size of the ``finger table'') and performance (the average/worst-case number of hops to reach destination). Cordasco et al. showed that a finger table based on Fibonacci distances provides lower number of hops with fewer table entries. In this paper we generalize this result, showing how to construct an improved finger table when the objective is to reduce the number of hops, possibly at the expense of an increased size of the finger table. Our results can also be exploited to guarantee low routing time in case a fraction of nodes is assumed to fail.||||||||||||./pdfs/1568976681-HOTP2P-paper-1.pdf",
    "Energy-Efficient ID-based Group Key Agreement Protocols for Wireless Networ|Energy-Efficient ID-based Group Key Agreement Protocols for Wireless Networks Chik How Tan Joseph Chee Ming Teo One useful application of wireless networks is for secure group communication, which can be achieved by running a Group Key Agreement (GKA) protocol. One well-known method of providing authentication in GKA protocols is through the use of digital signatures. Traditional certificate-based signature schemes require users to receive and verify digital certificates before verifying the signatures but this process is not required in ID-based signature schemes. In this paper, we present an energy-efficient ID-based authenticated GKA protocol and four energy-efficient ID-based authenticated dynamic protocols, namely Join, Leave, Merge and Partition protocol, to handle dynamic group membership events, which are frequent in wireless networks. We provide complexity and energy cost analysis of our protocols and show that our protocols are more energy-efficient and suitable for wireless networks.||||||||||||./pdfs/1568976683-SSN-paper-1.pdf",
    "Model-based Evaluation of Search Strategies in peer-to-peer Networks |Model-based Evaluation of Search Strategies in peer-to-peer Networks Rossano Gaeta Matteo Sereno This paper exploits a previously developed analytical modeling framework to compare several variations of the basic flooding search strategy in unstructured decentralized peer-to-peer (P2P) networks. The model predictions are used to compute system-oriented performance indexes (the average and the coefficient of variation of the number of query messages) as well as user-oriented measures (the probability of finding at least one replica of a resource, the average search time). The trade-off between the optimization of system-oriented measures and the improvement of user-oriented quality indexes is investigated for several variations of the basic flooding strategy suggesting that adding control parameters to the basic flooding mechanism might prove beneficial in this class of systems.||||||||||||./pdfs/1568976692-HOTP2P-paper-1.pdf",
    "Network Intrusion Detection with Semantics-Aware Capability |Network Intrusion Detection with Semantics-Aware Capability Walter Scheirer Mooi Choo Chuah Malicious network traffic, including widespread worm activity, is a growing threat to Internet-connected networks and hosts. In this paper, we propose a network intrusion detection system (NIDS) with semantics-aware capability. Our NIDS segregates suspicious traffic from the regular traffic flow, extracts binary code from the suspicious traffic, and performs semantic analysis on it to identify potential threats. Our contributions in this work are threefold: (a) we believe our prototype is the first NIDS that provides semantics-aware capability, (b) our implementation is more efficient than what is reported in previously published semantic detection work, (c) our designed templates can capture polymorphic shellcodes with added sequences of stack and mathematic operations.||||||||||||./pdfs/1568976694-SSN-paper-1.pdf",
    "Checkpointing and Rollback-Recovery Protocol for Mobile Systems with MW Ses|Checkpointing and Rollback-Recovery Protocol for Mobile Systems with MW Session Guarantee Jerzy Brzezinski Anna Kobusinska Michal Szychowiak In the mobile environment, weak consistency replication of shared data is the key to obtaining high data availability, good access performance, and good scalability. Therefore new class of consistency models, called session guarantees, recommended for mobile environment, has been introduced. Session guarantees, called also client-centric consistency models, have been proposed to define required properties of the system regarding consistency from the client?s point of view. Unfortunately, none of proposed consistency protocols providing session guarantees is resistant to server failures. Therefore, in this paper checkpointing and rollback-recovery protocol rVsMW, which preserves Monotonic Writes session guarantee is presented. The recovery protocol is integrated with the underlying consistency protocol by integrating operations of taking checkpoints with coherence operations of VsSG protocol.||||||||||||./pdfs/1568976697-SSN-paper-1.pdf",
    "A Correctness Proof of the SRP Protocol |A Correctness Proof of the SRP Protocol Huabing Yang Xingyuan Zhang Yuanyuan Wang The correctness of a routing protocol can be divided into two parts, a liveness property proof and a safety property proof. The former requires that route(s) should be discovered and data be transmitted successfully, while the latter requires that the discovered routes have some desired characters such as containing only benign nodes. While safety properties are relatively easier to prove, the proof of liveness properties is usually harder. This paper presented a liveness proof of a secure routing protocol, SRP in Isabelle/HOL. The liveness property proved says that if a data package needs to be sent, then it will be sent and then received, and finally, the sender will receive an acknowledgement sent back by the receiver. There are three main contributions in this paper. Firstly, a liveness property is proved for a secure routing protocol, and this has never been done before. Secondly, our validation model can deal with arbitrarily many nodes including malicious ones, and nodes are allowed to move randomly. Thirdly, a \\emph fail set is defined to restrict the attackers' actions, so that the safety properties used to prove the liveness property can be established. The paper explains why it is reasonable to prevent malicious nodes from performing the events in \\emph fail set.||||||||||||./pdfs/1568976698-SSN-paper-1.pdf",
    "Detecting Selective Forwarding Attacks in Wireless Sensor Networks |Detecting Selective Forwarding Attacks in Wireless Sensor Networks Bo Yu Bin Xiao Selective forwarding attacks may corrupt some mission-critical applications such as military surveillance and forest fire monitoring. In these attacks, malicious nodes behave like normal nodes in most time but selectively drop sensitive packets, such as a packet reporting the movement of the opposing forces. Such selective dropping is hard to detect. In this paper, we propose a lightweight security scheme for detecting selective forwarding attacks. The detection scheme uses a multi-hop acknowledgement technique to launch alarms by obtaining responses from intermediate nodes. This scheme is efficient and reliable in the sense that an intermediate node will report any abnormal packet loss and suspect nodes to both the base station and the source node. To the best of our knowledge, this is the first paper that presents a detailed scheme for detecting selective forwarding attacks in the environment of sensor networks. The simulation results show that even when the channel error rate is 15\\%, simulating very harsh radio conditions, the detection accuracy of the proposed scheme is over 95\\%.||||||||||||./pdfs/1568976710-SSN-paper-1.pdf",
    "A Note on Broadcast Encryption Key Management with Applications to Large Sc|A Note on Broadcast Encryption Key Management with Applications to Large Scale Emergency Alert Systems Guoqiang Shu David Lee Mihalis Yannakakis Emergency alerting capability is crucial for the prompt response to natural disasters and terrorist attacks. The emerging network infrastructure and secure broadcast techniques enable prompt and secure delivery of emergency notification messages. With the ubiquitous deployment of alert systems, scalability and heterogeneity pose new challenges for the design of secure broadcast schemes. In this paper we discuss the key generation problem with the goal of minimizing the total number of keys which need to be generated by the alert center and distributed to the users. Two encryption schemes, zero message scheme and extended header scheme, are modeled formally. For both schemes we show the equivalence of the general optimal key generation (OKG) problem and the bipartite clique cover (BCC) problem, and show that OKG problem is NP-Hard. The result is then generalized to the case with resource constraints, and we provide a heuristic algorithm for solving the restricted BCC (and OKG) problem.||||||||||||./pdfs/1568976714-SSN-paper-1.pdf",
    "Analysis of BGP Prefix Origins During Google's May 2005 Outage |Analysis of BGP Prefix Origins During Google's May 2005 Outage Tao Wan Paul C. Van Oorschot Google went down for 15 to 60 minutes around 22:10, May 07, 2005 UTC. This was explained by Google as having been caused by internal DNS misconfigurations. Another vulnerable protocol which could have caused such service outage is BGP. To pursue the latter possibility further, we explore how BGP was functioning during that period of time using the RouteViews BGP data set. Interestingly, our investigation reveals that one Autonomous System (i.e., AS174 operated by Cogent), which is apparently independent from Google, mysteriously originated routes for one of the IP prefixes assigned to Google (64.233.161.0/24) immediately prior to the service outage. As a result, 49.1\\% of ASes re-advertising routes for 64.233.161.0/24 switched to the incorrect path. Those poisoned ASes directly serve 1500 IP prefixes, and span a broad range of geographic locations. Since this erroneous prefix origination apparently has not occurred previously, or after this specific instance, we consider that it might have been the result of malicious activity (e.g., compromise of one or more BGP speakers) and contributed at least partially to Google's service outage.||||||||||||./pdfs/1568976719-SSN-paper-1.pdf",
    "Automated Refinement of Security Protocols |Automated Refinement of Security Protocols Anders M. Hagalisletto The design of security protocols is usually performed manually by pen and paper, by experts in security. Assumptions are rarely specified explicitly. We present a new way to approach security specification: The protocol is refined fully automated into a specification that contains assumptions sufficient to execute the protocol. As a result, the protocol designer using our method does not have to be a security expert to design a protocol, and can learn immediately how the protocol should work in practice.||||||||||||./pdfs/1568976728-SSN-paper-1.pdf",
    "Preserving Source Location Privacy in Monitoring-Based Wireless Sensor Netw|Preserving Source Location Privacy in Monitoring-Based Wireless Sensor Networks Yong Xi Loren Schwiebert Weisong Shi While a wireless sensor network is deployed to monitor certain events and pinpoint their locations, the location information is intended only for legitimate users. However, an eavesdropper can monitor the traffic and deduce the approximate location of monitored objects in certain situations. We first describe a successful attack against the flooding-based phantom routing, proposed in the seminal work by Celal Ozturk, Yanyong Zhang, and Wade Trappe. Then, we propose GROW (Greedy Random Walk), a two-way random walk, i.e., from both source and sink, to reduce the chance an eavesdropper can collect the location information. We improve the delivery rate by using local broadcasting and greedy forwarding. Privacy protection is verified under a backtracking attack model. The message delivery time is a little longer than that of the broadcasting-based approach, but it is still acceptable if we consider the enhanced privacy preserving capability of this new approach. At the same time, the energy consumption is less than half the energy consumption of flooding-base phantom routing, which is preferred in a low duty cycle, environmental monitoring sensor network.||||||||||||./pdfs/1568976729-SSN-paper-1.pdf",
    "Simulation of a Hybrid Model for Image Denoising |Simulation of a Hybrid Model for Image Denoising Ricolindo Carino Ioana Banicescu Hyeona Lim Neil Williams Seongjai Kim We propose a new model for image denoising which is a hybrid of the total variation model and the Laplacian mean-curvature model. An efficient numerical procedure to compute the hybrid model is also presented. The hybrid model and its computational procedure introduce a number of parameters. As a preliminary step to the synthesis of a method for selecting optimal parameters, the hybrid model was simulated on a number of known images with synthetically added noise. The parallel simulation code was easily composed from existing serial code and a dynamic load balancing tool. The estimated parallel efficiency of the simulation is in excess of 96\\% on 32 processors of a general-purpose Linux cluster||||||||||||./pdfs/1568977070-PDSEC-paper-1.pdf",
    "Coordinate Transformation - A Solution for the Privacy Problem of Location |Coordinate Transformation ? A Solution for the Privacy Problem of Location Based Services? Andreas Gutscher Protecting location information of mobile users in Location Based Services (LBS) is a very important but quite difficult and still largely unsolved problem. Location information has to be protected against unauthorized access not only from users but also from service providers storing and processing the location data, without restricting the functionality of the system. This paper discusses why existing privacy enhancing techniques are insufficient to solve this problem and proposes a new approach basing on coordinate transformations. It shows how location information can be rendered illegible in such a way that it is still possible to perform processing operations required by LBS.||||||||||||./pdfs/1568977171-SSN-paper-1.pdf",
    "Honeypot Back-propagation for Mitigating Spoofing Distributed Denial-of-Ser|Honeypot Back-propagation for Mitigating Spoofing Distributed Denial-of-Service Attacks Sherif Khattab Rami Melhem Daniel Moss&eacute; Taieb Znati The Denial-of-Service (DoS) attack remains a challenging problem in the current Internet. In a DoS defense mechanism, a honeypot acts as a decoy within a pool of servers, whereby any packet received by the honeypot is most likely an attack packet. We have previously proposed the roaming honeypots scheme to enhance this mechanism by camouflaging the honeypots within the server pool, thereby making their locations highly unpredictable. In roaming honeypots, each server acts as a honeypot for some periods of time, or honeypot epochs, the duration of which is determined by a pseudo-random schedule shared among servers and legitimate clients. In this paper, we propose a honeypot back-propagation scheme to trace back attack sources when attacks occur. Based on this scheme, the reception of a packet by a roaming honeypot triggers the activation of a DAG of honeypot sessions rooted at the honeypot under attack towards attack sources. The formation of this tree is achieved in a hierarchical fashion: first at the Autonomous system (AS) level and then at the router level within an AS if needed. The proposed scheme supports incremental deployment and provides deployment incentives for ISPs. Through ns-2 simulations, we show how the proposed scheme enhances the performance of a vanilla Pushback defense by obtaining accurate attack signatures and acting promptly once an attack is detected.||||||||||||./pdfs/1568977212-SSN-paper-1.pdf",
    "On the Performance of Parallel Normalized Explicit Preconditioned Conjugate|On the Performance of Parallel Normalized Explicit Preconditioned Conjugate Gradient Type Methods George A. Gravvanis Konstantinos M. Giannoutakis A new class of parallel normalized preconditioned conjugate gradient type methods in conjunction with normalized approximate inverses algorithms, based on normalized approximate factorization procedures, for solving sparse linear systems of irregular structure, which are derived from the finite element method of a two dimensional boundary value problem, is introduced. Parallel normalized explicit preconditioned conjugate gradient - type methods for distributed memory systems based on the block row distribution (for the vectors and the explicit approximate inverse), using Message Passing Interface (MPI) communication library, is also presented with theoretical estimates on speedups and efficiency, in order to examine the parallel behavior of these methods using normalized explicit approximate inverses as the suitable pre-conditioner. Collective communications have been utilized at the synchronization points and non blocking communications have been used, where the exchanging of messages can be overlapped with computations, where applicable. Application of the methods on a two dimensional boundary value problem is discussed and numerical results are given, concerning the parallel performance in terms of speedups and efficiency.||||||||||||./pdfs/1568977264-PDSEC-paper-1.pdf",
    "Efficient Parallel Implementation of a Weather Derivatives Pricing Algorith|Efficient Parallel Implementation of a Weather Derivatives Pricing Algorithm based on the Fast Gauss Transform Yusaku Yamamoto CDD weather derivatives are widely used to hedge weather risks and their fast and accurate pricing is an important problem in financial engineering. In this paper, we propose an efficient parallelization strategy of a pricing algorithm for the CDD derivatives. The algorithm uses the fast Gauss transform to compute the expected payoff of the derivative and has proved faster and more accurate than the conventional Monte Carlo method. However, speeding up the algorithm on a distributed-memory parallel computer is not straight-forward because na\\ \\i ve parallelization will require a large amount of inter-processor communication. Our new parallelization strategy exploits the structure of the fast Gauss transform and thereby reduces the amount of inter-processor communication considerably. Numerical experiments show that our strategy achieves up to 50\\% performance improvement over the na\\ \\i ve one on an 16-node Mac G5 cluster and can compute the price of a representative CDD derivative in 7 seconds. This speed is adequate for almost any applications.||||||||||||./pdfs/1568977378-PDSEC-paper-1.pdf",
    "Parallel implementation and performance characterization of MUSCLE |Parallel implementation and performance characterization of MUSCLE Xi Deng Eric Li Jiulong Shan Wenguang Chen Multiple sequence alignment is a fundamental and very computationally intensive task in molecular biology. MUSCLE, a new algorithm for creating multiple alignments of protein sequences, achieves a highest rank in accuracy and the fastest speed compared to ClustalW as well as T-Coffee, some widely used tools in multiple sequence alignment. To further accelerate the computations, we present the parallel implementation of MUSCLE in this paper. It is decomposed into several independent modules, which are parallelized with different OpenMP paradigms. We also conduct detailed performance characterization on symmetric multiple processor systems. The experiments show that MUSCLE scales well with the increase of processors, and achieves up to 15.x speedup on 16-way shared memory multiple processor system.||||||||||||./pdfs/1568977852-PDSEC-paper-1.pdf",
    "Towards a Parallel Framework of Grid-based Numerical Algorithms on DAGs |Towards a Parallel Framework of Grid-based Numerical Algorithms on DAGs Zeyao Mo Aiqing Zhang Xiaolin Cao This paper presents a parallel framework of grid-based numerical algorithms where data dependencies between grid zones can be modeled by a directed acyclic graph (DAG). The construction of DAG for numerical algorithms for solution of partial differential equations varying from the Boltzmann transport equation to the linearly convection-dominated fluids is presented. The framework consists of three parts on how to partition, order and calculate the vertices of digraph. Numerical results using hundreds of processors on two parallel machines show the efficiencies and moderate scalability of this framework.||||||||||||./pdfs/1568978540-PDSEC-paper-1.pdf",
    "Multiple Sequence Alignment by Quantum Genetic Algorithm |Multiple Sequence Alignment by Quantum Genetic Algorithm Layeb Abdesslem Meshoul Souham Batouche Mohamed In this paper we describe a new approach for the well known problem in bioinformatics: Multiple Sequence Alignment (MSA). MSA is fundamental task as it represents an essential platform to conduct other tasks in bioinformatics such as the construction of phylogenetic trees, the structural and functional prediction of new protein sequences. Our approach merges between the classical genetic algorithm and some principles of the quantum computing like interference, measure, superposition, etc. It differs from other genetic methods of the literature by using a small population size and a less iteration required to find good quality alignments thanks to the used quantum principles: state superposition, interference, quantum mutation and quantum crossover. Another attractive feature of this method is its ability to provide an extensible platform for evaluating different objective functions. Experiments on a wide range of data sets have shown the effectiveness of the proposed approach and its ability to achieve good quality solutions comparing to those given by other popular multiple alignment programs.||||||||||||./pdfs/1568978640-PDSEC-paper-1.pdf",
    "High-Performance Computing in Remotely Sensed Hyperspectral Imaging: The Pi|High-Performance Computing in Remotely Sensed Hyperspectral Imaging: The Pixel Purity Index Algorithm as a Case Study Antonio Plaza David Valencia Javier Plaza The incorporation of last-generation sensors to airborne and satellite platforms is currently producing a nearly continual stream of high-dimensional data, and this explosion in the amount of collected information has rapidly created new processing challenges. For instance, hyperspectral imaging is a new technique in remote sensing that generates hundreds of spectral bands at different wavelength channels for the same area on the surface of the Earth. The price paid for such a wealth of spectral information available from latest-generation sensors is the enormous amounts of data that they generate. In recent years, several efforts have been directed towards the incorporation of high-performance computing (HPC) models in remote sensing missions. This paper explores three HPC-based paradigms for efficient information extraction from remote sensing data using the Pixel Purity Index (PPI) algorithm (available from the popular Kodak?s Research Systems ENVI software) as a case study for algorithm optimization. The three considered approaches are: 1) Commodity cluster-based parallel computing; 2) Distributed computing using heterogeneous networks of workstations; and 3) FPGA-based hardware implementations. Combined, these parts deliver an excellent snapshot of the state-of-the-art in those areas, and offer a thoughtful perspective on the potential and emerging challenges of adapting HPC models to remote sensing problems.||||||||||||./pdfs/1568978661-PDSEC-paper-1.pdf",
    "Coordinated Checkpoint from Message Payload in Pessimistic Sender-Based Mes|Coordinated Checkpoint from Message Payload in Pessimistic Sender-Based Message Logging Mehdi Aminian Mohammad K. Akbari Bahman Javadi Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category techniques have been introduced to make these systems fault-tolerant. The first one is checkpoint-based technique and the other one is called log-based recovery protocol. Sender-based pessimistic logging which falls in the second category is harnessing from huge amount of messages payloads which must be kept in volatile memory. In this paper we present a Coordinated Checkpoint from Message Payload (CCMP) to reduce the aforementioned overhead. The proposed method was examined by MPICH-V2, a public domain platform implementing pessimistic logging with uncoordinated checkpoint. Experimental results demonstrated the reduction of run-time for NPB benchmarks in both fault-free and faulty environments.||||||||||||./pdfs/1568978669-PDSEC-paper-1.pdf",
    "Reducing Reconfiguration Time of Reconfigurable Computing Systems in Integr|Reducing Reconfiguration Time of Reconfigurable Computing Systems in Integrated Temporal Partitioning and Physical Design Framework Farhad Mehdipour Morteza Saheb Zamani Hamid Reza Ahmadifar Mehdi Sedighi Kazuaki Murakami In reconfigurable systems, reconfiguration latency is a very important factor impact the system performance. In this paper, a framework is proposed that integrates the temporal partitioning and physical design phases to perform a static compilation process for reconfigurable computing systems. A temporal partitioning algorithm is proposed which attempts to decrease the time of reconfiguration on a partially reconfigurable hardware. This algorithm attempts to find similar single or pair of operations between subsequent partitions. Considering similar pairs instead of single nodes brings about less complexity for routing process. By using this technique, smaller reconfiguration bit-stream is obtained, which directly decreases the reconfiguration overhead time at the run-time. A complementary algorithm attempts to increase the similarity of subsequent partitions by searching for similar pairs and using a technique called dummy node insertion. An incremental physical design process based on similar configurations produced in the partitioning stage improves the metrics over iterations.||||||||||||./pdfs/1568978671-PDSEC-paper-1.pdf",
    "The General Matrix Multiply-Add Operation on 2D Torus |The General Matrix Multiply-Add Operation on 2D Torus Ahmed Sherif Zekri Stanislav G. Sedukhin In this paper, the index space of the (\\textit n $\\times$\\textit n )-matrix multiply-add problem $C=C+A\\cdot B$ is represented as a 3D \\textit n $\\times$\\textit n $\\times$\\textit n torus. All possible modular time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations at the index points into the 2D \\textit n $\\times$\\textit n toroidal array processor. All optimal 2D data allocations that solve the problem in $n$ multiply-add-roll steps are obtained. The well known Cannon's algorithm is one of the 2D resulting allocations. We used the optimal data allocations to describe all variants of the general matrix multiply-add operation (GEMM) on the 2D toroidal array processor. By controling the movement of data, the transposition operation is avoided in 75\\% of the GEMM variants. However, only one explicit matrix transpose is needed for the remaining 25\\%. Ultimately, we described four versions of the GEMM operation covering the possible layouts of the initially loaded data into the array processor.||||||||||||./pdfs/1568978675-PDSEC-paper-1.pdf",
    "Tree Partition based Parallel Frequent Pattern mining on Shared Memory Syst|Tree Partition based Parallel Frequent Pattern mining on Shared Memory Systems Dehao Chen Chunrong Lai Wei Hu Wenguang Chen Yimin Zhang Weimin Zheng in this paper, we present a tree-partition algorithm for parallel mining of frequent patterns. Our work is based on FP-Growth algorithm, which is constituted of tree-building stage and mining stage. The main idea is to build only one FP-Tree in the memory, partition it into several independent parts and distribute them to different threads. A heuristic algorithm is devised to balance the workload. Our algorithm can not only alleviate the impact of locks during the tree-building stage, but also avoid the overhead that do great harm to the mining stage. We present the experiments on different kinds of datasets and compare the results with other parallel approaches. The results suggest that our approach has great advantage in efficiency, especially on certain kinds of datasets. As the number of processors increases, our parallel algorithm shows good scalability.||||||||||||./pdfs/1568978681-PDSEC-paper-1.pdf",
    "Parallelization of Module Network Structure Learning and Performance Tuning|Parallelization of Module Network Structure Learning and Performance Tuning on SMP Hongshan Jiang Chunrong Lai Wenguang Chen Yurong Chen Wei Hu Weimin Zheng Yimin Zhang As an extension of Bayesian network, module network is an appropriate model for inferring causal network of a mass of variables from insufficient evidences. However learning such a model is still a time-consuming process. In this paper, we propose a parallel implementation of module network learning algorithm using OpenMP. We propose a static task partitioning strategy which distributes sub-search-spaces over worker threads to get the tradeoff between load-balance and software-cache-contention. To overcome performance penalties derived from shared-memory contention, we adopt several optimization techniques such as memory pre-allocation, memory alignment and static function usage. These optimizations have different patterns of influence on the sequential performance and the parallel speedup. Experiments validate the effectiveness of these optimizations. For a 2,200 nodes dataset, they enhance the parallel speedup up to 88\\%, together with a 2X sequential performance improvement. With resource contentions reduced, workload imbalance becomes the main hurdle to parallel scalability and the program behaviors more stable in various platforms.||||||||||||./pdfs/1568978688-PDSEC-paper-1.pdf",
    "Parallel Calculation of Volcanoes for Cryptographic Uses |Parallel Calculation of Volcanoes for Cryptographic Uses Santi Martinez Rosana Tomas Concepcio Roig Magda Valls Ramiro Moreno Elliptic curve cryptosystems are nowadays widely used in the design of many security devices. Nevertheless, since not every elliptic curve is useful for cryptographic purposes, mechanisms for providing good curves are highly needed. The generation of the volcano graph of elliptic curves can help to provide such good curves. However, this procedure turns out to be very expensive when performed sequentially. Hence, a parallel application for the calculation of such volcano graphs is proposed in this paper. In order to obtain high efficiency, a theoretical analysis is provided for obtaining an accurate granularity and for giving the appropriate number of tasks to be created. Experimental results show the benefits obtained in the speedup when executing the application in a cluster of workstations with message-passing for the generation of different volcano graphs. By the use of simulation, we study the scalability of the implementation and show that a speedup of more than 80 can be achieved in some cases.||||||||||||./pdfs/1568978695-PDSEC-paper-1.pdf",
    "Parallelisation of a Simulation Tool for Casting and Solidification Process|Parallelisation of a Simulation Tool for Casting and Solidification Processes on Windows Platforms Carsten Clauss Silke Schuch Rainer Finocchiaro Stefan Lankes Thomas Bemmerl Since the beginning of computational engineering, the numerical simulation of physical processes is an essential element in the area of high performance computing. Thus, also the domain of metal foundry demands the computational simulation of casting and solidification processes. A popular software tool for this purpose has been developed by the RWP GmbH in Roetgen, Germany. This tool, named WinCast, is a complete software suite, which contains modules for pre-, main- and post-processing of simulation data sets. A core module of WinCast is TFB, which determines the chronological temperature distribution of a casting process based on a finite-element-method and a Gauss-Seidel solver. With the increasing demand for even higher precision of the simulation results on one hand, and a growing need for even larger data sets on the other hand, the parallelisation of this module became inevitable. In this paper, we present our work accomplished to parallelise the solving algorithm of this module. We have chosen an MPI based master-slave approach for compute clusters by using a self-developed MPI library for Windows platforms.||||||||||||./pdfs/1568978697-PDSEC-paper-1.pdf",
    "Conjugate Gradient Sparse Solvers: Performance-Power Characteristics |Conjugate Gradient Sparse Solvers: Performance-Power Characteristics Konrad Malkowski Ingyu Lee Padma Raghavan Mary Jane Irwin We characterize the performance and power attributes of the conjugate gradient (CG) sparse solver which is widely used in scientific applications. We use cycle-accurate simulations with SimpleScalar and Wattch, on a processor and memory architecture similar to the configuration of a node of the BlueGene/L. We first demonstrate that substantial power savings can be obtained without performance degradation if low power modes of caches can be utilized. We next show that if Dynamic Voltage Scaling (DVS) can be used, power and energy savings are possible, but these are realized only at the expense of performance penalties. We then consider two simple memory subsystem optimizations, namely memory and level-2 cache prefetching. We demonstrate that when DVS and low power modes of caches are used with these optimizations, performance can be improved significantly with reductions in power and energy. For example, execution time is reduced by 23\\%, power by 55\\% and energy by 65\\% in the final configuration at 500MHz relative to the original at 1GHz. We also use our codes and the CG NAS benchmark code to demonstrate that performance and power profiles can vary significantly depending on matrix properties and the level of code tuning. These results indicate that architectural evaluations can benefit if traditional benchmarks are augmented with codes more representative of tuned scientific applications.||||||||||||./pdfs/1568979430-HPPAC-paper-1.pdf",
    "Integrated Link/CPU Voltage Scaling for Reducing Energy Consumption of Para|Integrated Link/CPU Voltage Scaling for Reducing Energy Consumption of Parallel Sparse Matrix Applications Seung Woo Son Konrad Malkowski Guilin Chen Mahmut Kandemir Padma Raghavan Reducing power consumption is quickly becoming a first-class optimization metric for many high-performance parallel computing platforms. One of the techniques employed by many prior proposals along this direction is voltage scaling and past research used it on different components such as networks, CPUs, and memories. In contrast to most of the existent efforts on voltage scaling that target a single component (CPU, network or memory components), this paper proposes and experimentally evaluates a voltage/frequency scaling algorithm that considers CPU and communication links in a mesh network at the same time. More specifically, it scales voltages/frequencies of both CPUs in the network and the communication links among them in a coordinated fashion (instead of one after another) such that energy savings are maximized without impacting execution time. Our experiments with several tree-based sparse matrix computations reveal that the proposed integrated voltage scaling approach is very effective in practice and brings 13\\% and 17\\% energy savings over the pure CPU and pure communication link voltage scaling schemes, respectively. The results also show that our savings are consistent with the different network sizes and different sets of voltage/frequency levels.||||||||||||./pdfs/1568979434-HPPAC-paper-1.pdf",
    "Profile-based Optimization of Power Performance by using Dynamic Voltage Sc|Profile-based Optimization of Power Performance by using Dynamic Voltage Scaling on a PC cluster Yoshihiko Hotta Mitsuhisa Sato Hideaki Kimura Satoshi Matsuoka Taisuke Boku Daisuke Takahashi Currently, several of the high performance processors used in a PC cluster have a DVS (Dynamic Voltage Scaling) architecture that can dynamically scale processor voltage and frequency. Adaptive scheduling of the voltage and frequency enables us to reduce power dissipation without a performance slowdown during communication and memory access. In this paper, we propose a method of profiled-based power-performance optimization by DVS scheduling in a high-performance PC cluster. We divide the program execution into several regions and select the best gear for power efficiency. Selecting the best gear is not straightforward since the overhead of DVS transition is not free. We propose an optimization algorithm to select a gear using the execution and power profile by taking the transition overhead into account. We have built and designed a power-profiling system, PowerWatch. With this system we examined the effectiveness of our optimization algorithm on two types of power-scalable clusters (Crusoe and Turion). According to the results of benchmark tests, we achieved almost 40\\% reduction in terms of EDP (energy-delay product) without performance impact (less than 5\\%) compared to results using the standard clock frequency.||||||||||||./pdfs/1568979506-HPPAC-paper-1.pdf",
    "Online Strategies for High-Performance Power-Aware Thread Execution on Emer|Online Strategies for High-Performance Power-Aware Thread Execution on Emerging Multiprocessors Matthew Curtis-maury James Dzierwa Christos D. Antonopoulos Dimitrios S. Nikolopoulos Granularity control is an effective means for trading power consumption with performance on dense shared memory multiprocessors, such as multi-SMT and multi-CMP systems. With granularity control, the number of threads used to execute an application, or part of an application, is changed, thereby also changing the amount of work done by each active thread. In this paper, we analyze the energy/performance trade-off of varying thread granularity in parallel benchmarks written for shared memory systems. We use physical experimentation on a real multi-SMT system and a power estimation model based on the die areas of processor components and component activity factors obtained from a hardware event monitor. We also present HPPATCH, a runtime algorithm for live tuning of thread granularity, which attempts to simultaneously reduce both execution time and processor power consumption.||||||||||||./pdfs/1568979520-HPPAC-paper-1.pdf",
    "Shubac: A Searchable P2P Network Utilizing Dynamic Paths for Client/Server |Shubac: A Searchable P2P Network Utilizing Dynamic Paths for Client/Server Anonymity Aharon Brodie Cheng-zhong Xu A general approach to achieve anonymity on P2P networks is to construct an indirect path between client and server for each data transfer. The indirection, together with randomness in the selection of intermediate nodes, provides a guarantee of anonymity to some extent. It, however, comes at the cost of a large communication overhead. In this paper, we present Shubac, a searchable, anonymous peer to peer (P2P) overlay network. It implements a flexible dynamic path approach that shrinks paths in size to reduce overhead and delays and meanwhile reconfigures paths dynamically throughout a communication to maintain a high level of privacy. This dynamic path approach enables Shubac to make a good tradeoff between anonymity and efficiency.||||||||||||./pdfs/1568979734-SSN-paper-1.pdf",
    "Dynamic Power Saving in Fat-Tree Interconnection Networks Using On/Off Lin|Dynamic Power Saving in Fat-Tree Interconnection Networks Using On/Off Links Marina Alonso Salvador Coll Juan-miguel Martinez Vicente Santonja Pedro Lopez Jose Duato Current trends in high-performance parallel computers show that fat-tree interconnection networks are one of the most popular topologies. The particular characteristics of this topology, that provide multiple alternative paths for each source/destination pair, make it an excellent candidate for applying power consumption reduction techniques. Such techniques are being increasingly applied in computer systems and the interconnection network is not an exception, since its contribution to the system power budget is not negligible. In this paper, we present a mechanism that dynamically switches on and off network links as a function of traffic. The mechanism is designed to guarantee network connectivity, according to the underlying routing algorithm. In this way, the default routing algorithm can be used regardless of the power saving actions taken, thus simplifying router design. Our simulation results show that significant network power consumption reductions can be obtained at no cost. Latency remains the same although the number of operating network links is dynamically adjusted.||||||||||||./pdfs/1568979735-HPPAC-paper-1.pdf",
    "Making a Case for a Green500 List |Making a Case for a Green500 List Sushant Sharma Chung-hsing Hsu Wu-chun Feng For decades now, the notion of ``performance'' has been synonymous with ``speed'' (as measured in FLOPS, short for floating-point operations per second). Unfortunately, this particular focus has led to the emergence of supercomputers that consume egregious amounts of electrical power and produce so much heat that extravagant cooling facilities must be constructed to ensure proper operation. In addition, the emphasis on speed as the performance metric has caused other performance metrics to be largely ignored, e.g., reliability, availability, and usability. As a consequence, all of the above has led to an extraordinary increase in the total cost of ownership (TCO) of a supercomputer. Despite the importance of the TOP500 List, we argue that the list makes it much more difficult for the high-performance computing (HPC) community to focus on performance metrics other than speed. Therefore, to raise awareness to other performance metrics of interest, e.g., energy efficiency for improved reliability, we propose a Green500 List and discuss the potential metrics that would be used to rank supercomputing systems on such a list.||||||||||||./pdfs/1568979740-HPPAC-paper-1.pdf",
    "Power-Performance Efficiency of Asymmetric Multiprocessors for Multi-thread|Power-Performance Efficiency of Asymmetric Multiprocessors for Multi-threaded Scientific Applications Ryan E. Grant Ahmad Afsahi Recently, under a fixed power budget, asymmetric multiprocessors (AMP) have been proposed to improve the performance of multi-threaded applications compared to symmetric multiprocessors. An AMP is a multiprocessor system in which its processors are not operating at the same frequency. Power consumption has become an important design constraint in servers and high-performance server clusters. This paper explores the power-performance efficiency of Hyper-Threaded (HT) AMP servers, and proposes a new scheduling algorithm that can be used to reduce the overall power consumption of a server while maintaining a high level of performance. Prototyping AMPs on a commercial 4-way SMP server, we show that on average 15.6\\% energy savings and 6.1\\% slowdown for the HT-disabled case, and 7.1\\% energy savings and 4.8\\% slowdown for the HT-enabled case can be achieved across NAS and SPEC OpenMP applications.||||||||||||./pdfs/1568979810-HPPAC-paper-1.pdf",
    "Compiler And Runtime Support For Predictive Control Of Power And Cooling |Compiler And Runtime Support For Predictive Control Of Power And Cooling Henry G. Dietz William R. Dieter The low cost of clusters built using commodity components has made it possible for many more users to purchase their own supercomputer. However, even modest-sized clusters make significant demands on the power and cooling infrastructure. Minimizing impact of problems after they are detected is not as effective as avoiding problems altogether. This paper is about achieving the best system performance by predicting and avoiding power and cooling problems. Although measuring power and thermal properties of a code is not trivial, the primary issue is making predictions sufficiently in advance so that they can be used to drive predictive, rather than just reactive, control at runtime. This paper presents new compiler analysis supporting interprocedural power prediction and a variety of other compiler and runtime technologies making feed-forward control feasible. The techniques apply to most computer systems, but some properties specific to clusters and parallel supercomputing are used where appropriate.||||||||||||./pdfs/1568979811-HPPAC-paper-1.pdf",
    "MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology|MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology Taisuke Boku Mitsuhisa Sato Daisuke Takahashi Hiroshi Nakashima Hiroshi Nakamura Satoshi Matsuoka Yoshihiko Hotta In our research project named ``Mega-Scale Computing Based on Low-Power Technology and Workload Modeling'', we have been developing a prototype cluster not based on ASIC or FPGA but instead only using commodity technology. Its packaging is extremely compact and dense, and its performance/power ratio is very high. Our latest prototype cluster unit named ``MegaProto/E'' with 16 Transmeta Efficeon processors achieves 32 GFlops of peak performance, which is 2.2-fold greater than that of the old one. The cluster unit is equipped with an independent dual network of Gigabit Ethernet, including dual 24-port switches. The maximum power consumption of the cluster unit is 320 W, which is comparable with that of today's high-end PC servers for high performance clusters. Performance evaluation using NPB kernels and HPL shows that the performance of MegaProto/E exceeds that of a dual-Xeon server in all the benchmarks, and its performance ratio ranges from 1.3 to 3.7. These results reveal that our solution of implementing a number of ultra low-power processors in compact packaging is an excellent way to achieve extremely high performance in applications with a certain degree of parallelism.||||||||||||./pdfs/1568979853-HPPAC-paper-1.pdf",
    "Parallel Genetic Algorithm for SPICE Model Parameter Extraction |Parallel Genetic Algorithm for SPICE Model Parameter Extraction Yiming Li Yen-yu Cho Models of simulation program with integrated circuit emphasis (SPICE) are currently playing a central role in the connection between circuit design and chip fabrication communities. An automatic model parameter extraction system that simultaneously integrates evolutionary and numerical optimization techniques for optimal characterization of very large scale integration (VLSI) devices has recently been advanced. In this paper, to accelerate the extraction process, a parallelization of the genetic algorithm (GA) for VLSI device equivalent circuit model parameter extraction is developed. The GA implemented in the extraction system is mainly parallelized with a diffusion scheme on a PC-based Linux cluster with message passing interface libraries. Parallelization of GA is governed by many factors, which affect the quality of extracted parameters and its efficiency. The diffusion GA is superior to an isolated GA, and the superiority of the diffusion GA is significant when the number of devices to be optimized is increased. Theoretical estimation and preliminary implementation show that there is an optimal number of processors with respect to the number of devices to be extracted. Benchmark results, such as speedup and efficiency including accuracy of extraction are presented and discussed for different sets of realistic multiple VLSI devices to show the robustness and efficiency of the method. We believe that the practical implementation of the parallel GA approach benefits the engineering of SPICE model parameter extraction in modern electronic industry.||||||||||||./pdfs/1568980400-PDSEC-paper-1.pdf",
    "Plan-Based Replication for Fault-Tolerant Multi-Agent Systems |Plan-Based Replication for Fault-Tolerant Multi-Agent Systems Alessandro De Luna Almeida Samir Aknine Jean-pierre Briot Jacques Malenfant The growing importance of multi-agent applications and the need for a higher quality of service in these systems justify the increasing interest in fault-tolerant multi-agent systems. In this article, we propose an original method for providing dependability in multi-agent systems through replication. Our method is different from other works because our research focuses on building an automatic, adaptive and predictive replication policy where critical agents are replicated to avoid failures. This policy is determined by taking into account the criticality of the plans of the agents, which contain the collective and individual behaviors of the agents in the application. The set of replication strategies applied at a given moment to an agent is then fine-tuned gradually by the replication system so as to reflect the dynamicity of the multi-agent system.||||||||||||./pdfs/16-DPDNS-paper-1.pdf",
    "Vision for Liquid Architecture |Vision for Liquid Architecture Roger D. Chamberlain Ron K. Cytron Jason E. Fritts John W. Lockwood In the liquid architecture project, we are exploring ways in which architectural flexibility can be exploited to improve the execution properties of individual applications. Here, we report on successes we have had to date in this area, and present our vision of where this research should proceed into the future.||||||||||||./pdfs/16-NSFNGS-paper-1.pdf",
    "A Multiple Task Allocation Framework for Biological Sequence Comparison in |A Multiple Task Allocation Framework for Biological Sequence Comparison in a Grid Environment Azzedine Boukerche Marcelo S. Sousa Alba C. M. A. De Melo The evolution of DNA sequencing techniques generated huge sequence repositories and hence the need for efficient algorithms to compare them. The increase search speed, heuristic algorithms like BLAST were developed and are widely used. In order to further reduce BLAST execution time, this paper evaluates an adaptive task allocation framework to perform BLAST searches in a grid environment against segmented genetic databases segments. Our results present very good speedups and also show that no single task allocation strategy is able to achieve the lowest execution times for all scenarios. Also, our results show that the proposed adaptive strategy was able to deal with the heterogeneous and non-dedicated nature of a grid.||||||||||||./pdfs/17-NIDISC-paper-1.pdf",
    "Statistical Sampling of Microarchitecture Simulation |Statistical Sampling of Microarchitecture Simulation Thomas F. Wenisch Roland E. Wunderlich Babak Falsafi James C. Hoe Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. The Sampling Microarchitecture Simulation (SMARTS) framework is an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates. Analysis of the SPEC CPU2000 benchmark suite shows that CPI can be estimated to within $\\pm$ 3\\% with 99.7\\% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to $\\sim$2\\% for the tested benchmarks. We present two implementations of SMARTS that both achieve an average error of only 0.64\\% on CPI. SMARTSim constructs accurate model state through functional warming?continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements?reducing average simulation turnaround from 5.5 days to 7.0 hours. TurboSMARTSim replaces functional warming with live-points?checkpoints that store a bare minimum of functionally-warmed state for accurate simulation of a limited execution window?further reducing average turnaround to 91 seconds.||||||||||||./pdfs/17-NSFNGS-paper-1.pdf",
    "Modeling User Perceived Unavailability due to Long Response Times |User Perceived Unavailability due to Long Response Times Magnos Martinello Mohamed Kaaniche Karama Kanoun Carlos Aguilar Melchor In this paper, we introduce a simple analytical modeling approach for computing service unavailability due to long response time, for infinite and finite single-server systems as well as multi-server systems. Closed-form equations of system unavailability based on the conditional response time distributions are derived and sensitivity analyses are carried out to analyze the impact of long response time on service unavailability. The evaluation provides practical quantitative results that can help distributed system developers in design decisions.||||||||||||./pdfs/18-DPDNS-paper-1.pdf",
    "A Physical Particle and Plane Framework for Load Balancing in Multiprocesso|A Physical Particle and Plane Framework for Load Balancing in Multiprocessors Navid Imani Hamid Sarbazi Azad Different models for load balancing have been proposed before, each of which has its own features and advantages when considered for a specific scenario. Yet, nearly all of the existing techniques have assumed an oversimplified model of the system which is often not the case of the real world. In this paper, a new gradient based algorithm for dynamic load balancing on multiprocessors is proposed. This algorithm is an analogy of a classical physical model of a Particle \\& Plane system which operates based on the classic laws of physics dictated by the nature.||||||||||||./pdfs/18-NIDISC-paper-1.pdf",
    "Designing Next Generation Data-Centers with Advanced Communication Protocol|Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji K. Vaidyanathan S. Narravula H. -w. Jin D. K. Panda Current data-centers rely on TCP/IP over Fast- and Gigabit-Ethernet for data communication even within the cluster environment for cost-effective designs, thus limiting their maximum capacity. Together with raw performance, such data-centers also lack in efficient support for intelligent services, such as requirements for caching documents, managing limited physical resources, load-balancing, controlling overload scenarios, and prioritization and QoS mechanisms, that are becoming a common requirement today. On the other hand, the System Area Network (SAN) technology is making rapid advances during the recent years. Besides high performance, these modern interconnects are providing a range of novel features and their support in hardware (e.g., RDMA, atomic operations, QoS support). In this paper, we address the capabilities of these current generation SAN technologies in addressing the limitations of existing data-centers. Specifically, we present a novel framework comprising of three layers (communication protocol support, data-center service primitives and advanced data-center services) that work together to tackle the issues associated with existing data-centers. We also present preliminary results in the various aspects of the framework, which demonstrate close to an order of magnitude performance benefits achievable by our framework as compared to existing data-centers in several cases.||||||||||||./pdfs/18-NSFNGS-paper-1.pdf",
    "Predicting Failures of Computer Systems: A Case Study for a Telecommunicati|Predicting Failures of Computer Systems: A Case Study for a Telecommunication System Felix Salfner Michael Schieschke Miroslaw Malek The goal of online failure prediction is to forecast imminent failures while the system is running. This paper compares Similar Events Prediction (SEP) with two other well-known techniques for online failure prediction: a straightforward method that is based on a reliability model and Dispersion Frame Technique (DFT). SEP is based on recognition of failure-prone patterns utilizing a semi-Markov chain in combination with clustering. We applied the approaches to real data of a commercial telecommunication system. Results are presented in terms of precision, recall, F-measure and accumulated runtime-cost. The results suggest a significantly improved forecasting performance.||||||||||||./pdfs/19-DPDNS-paper-1.pdf",
    "I/O Conscious Algorithm Design and Systems Support for Data Analysis on Eme|I/O Conscious Algorithm Design and Systems Support for Data Analysis on Emerging Architectures G. Buehrer A. Ghoting Xi Zhang S. Tatikonda S. Parthasarathy T. Kurc J. Saltz Advances in data collection and storage technologies have given rise to large dynamic data stores. In order to effectively manage and mine such stores on modern and emerging architectures, one must consider both designing effective middleware support and re-architecting algorithms, to derive performance that commensurates with technological advances. In this article, we present a top-down view of how one can achieve this goal for next generation data analysis centers. Specifically, we present a case study on frequent pattern algorithms, and show how such algorithms can be re-structured to be cache, memory and I/O conscious. Furthermore, motivated by such algorithms, we present a services oriented middleware framework for the derivation of high performance on next generation architectures.||||||||||||./pdfs/19-NSFNGS-paper-1.pdf",
    "A Parallel Memetic Algorithm Applied to the Total Tardiness Machine Schedul|A Parallel Memetic Algorithm Applied to the Total Tardiness Machine Scheduling Problem Vin&iacute;cius Jacques Garcia Paulo Morelato Fran&ccedil;a Alexandre De Sousa Mendes Pablo Moscato This work proposes a parallel memetic algorithm applied to the total tardiness single machine scheduling problem. Classical models of parallel evolutionary algorithms and the general structure of memetic algorithms are discussed. The classical model of global parallel genetic algorithm was used to model the global parallel memetic analogue where the parallelization is only applied to the individual optimization phase of the algorithm. Computational tests show the efficiency of the parallel approach when compared to the sequential version. A set of eight instances, with sizes ranging from 56 up to 323 jobs and with known optimal solutions, is used for the comparisons.||||||||||||./pdfs/2-NIDISC-paper-1.pdf",
    "Virtual Playgrounds: Managing Virtual Resources in the Grid |Virtual Playgrounds: Managing Virtual Resources in the Grid K. Keahey J. Chase I. Foster Large Grid deployments increasingly require abstractions and methods decoupling the work of resource providers and resource consumers to implement scalable management methods. We proposed the abstraction of a Virtual Workspace (VW) describing a virtual execution environment that can be made dynamically available to authorized Grid clients by using well-defined protocols. Virtual workspaces provide resources in controllable ways that are independent of how a resource is consumed. A Virtual Playground may combine many such workspaces, as well as other aspects of virtual environments, such as networking and storage, to form virtual Grids. In this paper, we report on the goals and progress of the Virtual Playground Project and put in context the research to date.||||||||||||./pdfs/20-NSFNGS-paper-1.pdf",
    "Partial and dynamic Reconfiguration of FPGAs : a top down design methodolog|Partial and dynamic Reconfiguration of FPGAs : a top down design methodology for an automatic implementation Florent Berthelot Fabienne Nouvel Dominique Houzet Dynamic reconfiguration of FPGAs enables systems to adapt to changing demands. This paper concentrates on how to take into account specificities of partially reconfigurable components during the high level Adequation Algorithm Architecture process. We present a method which generates automatically the design for both partially and fixed parts of FPGAs. The runtime reconfiguration manager which monitors dynamic reconfigurations, uses prefetching technic to minimize reconfiguration latency of runtime reconfiguration. We demonstrate the benefits of this approach through the design of a dynamic reconfigurable MC-CDMA transmitter implemented on a Xilinx Virtex2. This methodology is architecture's manufacturers independant and can be applied to different FPGAs.||||||||||||./pdfs/20-RAW-paper-1.pdf",
    "Web Server Protection by Customized Instruction Set |Web Server Protection by Customized Instruction Set Bernhard Fechner J&ouml;rg Keller Andreas Wohlfeld We present a novel technique to secure the execution of a processor against the execution of malicious code (trojans, viruses). The main idea is to permute parts of the opcode values so that it gets a different semantic meaning. A virus which does not know the permutation is not able to execute and will cause a failure such as segmentation violation, whereby the execution of malicious code is prevented. The permutation is realized by a lookup table. We develop several variants that require only small changes to microprocessors. We sketch how to bootstrap a system such that all intended applications (including operating system) are reversely permuted, and can execute as intended. While this will be cumbersome for typical personal computers, it will work for web servers, because the number of applications and frequency of installation is lower. Furthermore, web servers are particularly endangered: they cannot be protected as good as personal computers, because by the very nature of their duty they are more openly connected with the internet than any other computer in an organization's network.||||||||||||./pdfs/21-DPDNS-paper-1.pdf",
    "The GHS Grid Scheduling System: Implementation and Performance Comparison |The GHS Grid Scheduling System: Implementation and Performance Comparison Ming Wu Xian-he Sun Effective task scheduling and deployment is hard to achieve in a Grid environment, where computing resources are heterogamous and shared between local and Grid users without a central control. Current scheduling systems, such as AppLeS, use NWS (Network Weather Service) for short-term estimation of resource availability and do not address the influence of the variation of resource availability in task scheduling. These inherent limitations prevent existing scheduling systems from working effectively to solve large-scale tasks in a Grid environment. Adopting APST (AppLeS Parameter Sweep Template) as the deployment environment, we have developed a task scheduling system for large-scale applications based on our recent results in performance prediction and task scheduling. Preliminary experimental results show that the newly developed system works well and is significantly more appropriate for large applications than existing systems.||||||||||||./pdfs/21-NSFNGS-paper-1.pdf",
    "A Simulation Study of the Effects of Multi-path Approaches in e-Commerce Ap|A Simulation Study of the Effects of Multi-path Approaches in e-Commerce Applications Paolo Romano Francesco Quaglia Bruno Ciciani Response time is a key factor of any e-Commerce application, and a set of solutions have been proposed to provide low response time despite network congestions or failures. Being them mostly based on caching of Web objects and replication of DBMS managed data at the edges, or at intermediate points, of the Web infrastructure, they reveal effective when handling client requests only performing read access to application data. However, any update request typically needs to be redirected to the origin DBMSs, hence not taking advantage from data replication and related client proximity. In order to alleviate the effects of network congestions or failures, we have proposed a multi-path protocol that increases the likelihood for the update request to be processed along a responsive (e.g. failure free) network path in between the client location and the origin DBMS sites. In this paper we present an extensive simulation study of the effects of such a multi-path approach on the client perceived response time. The study relies on both Brite generated network topologies and the NLANR graph. Also, well known realistic TCP models are used to capture the effects of network delays during both normal and anomalous (i.e. packet loss affected) operation mode.||||||||||||./pdfs/22-DPDNS-paper-1.pdf",
    "On Improving Performance and Energy Profiles of Sparse Scientific Applicati|On Improving Performance and Energy Profiles of Sparse Scientific Applications Konrad Malkowski Ingyu Lee Padma Raghavan Mary Jane Irwin In many scientific applications, the majority of the execution time is spent within a few basic \\emph sparse kernels such as sparse matrix vector multiplication (SMV). Such sparse kernels can utilize only a fraction of the available processing speed because of their relatively large number of data accesses per floating point operation, and limited data locality and data re-use. Algorithmic changes and tuning of codes through blocking and loop unrolling schemes can improve performance but such tuned versions are typically not available in benchmark suites such as the SPEC CFP 2000. In this paper, we consider sparse SMV kernels with different levels of tuning that are representative of this application space. We emulate certain memory subsystem optimizations using SimpleScalar and Wattch to evaluate improvements in performance and energy metrics. We also characterize how such an evaluation can be affected by the interplay between code tuning and memory subsystem optimizations. Our results indicate that the optimizations reduce execution time by over 40\\%, and the energy by over 85\\%, when used with power control modes of CPUs and caches. Furthermore, the relative impact of the same set of memory subsystem optimizations can vary significantly depending on the level of code tuning. Consequently, it may be appropriate to augment traditional benchmarks by tuned kernels typical of high performance sparse scientific codes to enable comprehensive evaluations of future systems.||||||||||||./pdfs/22-NSFNGS-paper-1.pdf",
    "Dynamic Resource Allocation of Computer Clusters with Probabilistic Workloa|Dynamic Resource Allocation of Computer Clusters with Probabilistic Workloads Marwan Sleiman Lester Lipsky Robert Sheahan Real-time resource scheduling is an important factor for improving the performance of cluster computing. In many distributed and parallel processing systems, particularly real-time systems, it is desirable and more efficient for jobs to finish as close to a target time as possible. This work models the execution time for such a stochastic environment and proposes a dynamic algorithm for optimizing the job completion times by dynamically allocating resources to jobs that are behind schedule and taking resources from jobs that are ahead of schedule. We validate our analytical model with simulations that represent the real computing environment. The results of our simulations show that our alternative is the best estimate to predict the time remaining by using earlier data. Emphasis is placed on where variance enters the system and how well it can be controlled. Also our dynamic algorithm involves modifying the architecture to help reduce the peak number of servers used to execute a job and thus optimize the computation cost.||||||||||||./pdfs/23-DPDNS-paper-1.pdf",
    "An Automated Approach to Improve Communication-Computation Overlap in Clust|An Automated Approach to Improve Communication-Computation Overlap in Clusters Lewis Fishgold Anthony Danalis Lori Pollock Martin Swany Applications that execute on parallel clusters face scalability concerns due to the high communication overhead that is usually associated with such environments. Modern network technologies that support Remote Direct Memory Access (RDMA) can offer true zero copy communication and reduce communication overhead by overlapping it with computation. For this approach to be effective the parallel application using the cluster must be structured in a way that enables communication computation overlapping. Unfortunately, the trade-off between maintainability and performance often leads to a structure that prevents exploiting the potential for communication computation overlapping. This paper describes a sourceto- source optimizing transformation that can be performed by an automatic (or semi-automatic) system in order to restructure MPI codes towards maximizing communication-computation overlapping.||||||||||||./pdfs/23-NSFNGS-paper-1.pdf",
    "Evaluating a Clock Synchronization for Dependable Sensor Networks |Evaluating a Clock Synchronization for Dependable Sensor Networks Spiro Trikaliotis Georg Lukas A synchronized clock is an important prerequisite for many distributed algorithms. This clock is used to give an occured before relationship, as well as for synchronizing distributed actions. There are many clock synchronization algorithms with varying precisions and assumptions on the underlying network topology. In this paper, a synchronization protocol is presented which achieves a high precision in the order of $20 \\mu s$ to $30 \\mu s$ in a one-hop wireless environment, and a multiple of this value for multi-hop wireless networks, such as sensor networks. The protocol works reliably even if message losses occur, which is very likely in wireless networks. For this, it utilizes redundancy in the sent time information. This protocol is implemented and evaluated on standard PC hardware running RT-Linux/Free, and an outline of the extension for multi-hop scenarios is given.||||||||||||./pdfs/24-DPDNS-paper-1.pdf",
    "Decentralized Runtime Analysis of Multithreaded Applications |Decentralized Runtime Analysis of Multithreaded Applications Koushik Sen Abhay Vardhan Gul Agha Grigore Rosu Violations of a number of common safety properties of multithreaded programs--such as atomicity and absence of dataraces--cannot be observed by looking at the linear execution trace. We characterize a class of such properties, called \\textit robust properties , and define a simple but expressive epistemic logic to specify them. We then develop an efficient algorithm to automatically monitor and predict violations of robust safety properties. Our algorithm is based on capturing the causal structure of a computation through a mechanism similar to vector clock updates. The algorithm automatically synthesizes decentralized monitors to evaluate the information at each thread and to detect and predict safety violations. Based on this approach, a tool named \\textsc DAME has been developed and evaluated on some simple examples.||||||||||||./pdfs/24-NSFNGS-paper-1.pdf",
    "Construction of Efficient OR-based Deletion-tolerant Coding Schemes |Construction of Efficient OR-based Deletion-tolerant Coding Schemes Peter Sobe Kathrin Peter Fault--tolerant data layouts for storage systems are based on the principle to add redundancy to groups of data blocks and store them in different fault regions. Commonly, XOR-based codes are used with an optimal redundancy overhead but with the disadvantage of relatively high calculation costs. We present a scheme that encodes input data in a highly redundant code and exploits that redundancy for a fault tolerance scheme. It allows to recalculate missed bits in fewer steps than needed for XOR-based schemes. This simple and efficient en- and decoding requires an appropriate hardware architecture or a highly parallel microprocessor architecture. Particularly, disjunctions over many input bits must be calculated, e.g. by wide OR-gates or busses that are driven by multiple logic input lines. The high redundant encoding is combined with data compression for separated data streams, each stream dedicated to a storage device. The compression not only eliminates the introduced redundancy of the used code, it also eliminates redundancy in the input data.||||||||||||./pdfs/25-DPDNS-paper-1.pdf",
    "Aligning Traces for Performance Evaluation |Aligning Traces for Performance Evaluation Todd Mytkowicz Amer Diwan Matthias Hauswirth Peter F. Sweeney For many performance analysis problems, the ability to reason across traces is invaluable. However, due to non-determinism in the OS and virtual machines, even two identical runs of an application yield slightly different traces. For example, it is unlikely that two identical runs of an application will suffer context switches at exactly the same points. These sorts of variations across traces make it difficult to reason across traces. This paper describes and evaluates an algorithm, Dynamic Time Warping (DTW), that can be used to align traces, thus enabling us to reason across traces. While DTW comes from prior work our use of DTW is novel. Also we describe and evaluate an enhancement to DTW that significantly improves the quality of its alignments. Our results show that for applications whose performance varies significantly over time, DTW does a great job at aligning the traces. For applications whose performance stays largely constant for significant periods of time, the original DTW does not perform well; however, our enhanced DTW performs much better.||||||||||||./pdfs/25-NSFNGS-paper-1.pdf",
    "Architecture of a Multi-Context FPGA Using a hybrid Multiple-Valued/Binary |Architecture of a Multi-Context FPGA Using a hybrid Multiple-Valued/Binary Context Switching Signal Yoshihiro Nakatani Masanori Hariyama Michitaka Kameyama Multi-context FPGAs have multiple memory bits per configuration bit forming configuration planes for fast switching between contexts. Large amount of memory causes significant overhead in area and power consumption. This paper presents two key technologies. The first is a floating-gate-MOS functional pass gate that merges storage and switching functions area-efficiently. The second is the use of a hybrid multiple-valued/binary context switching signal that eliminates redundancy of a conventional multi-context (MC) switch with high scalability. The transistor count of the proposed MC-switch is reduced to 7\\% in comparison with that of a SRAM-based one.||||||||||||./pdfs/25-RAW-paper-1.pdf",
    "Model-driven Generative Techniques for Scalable Performability Analysis of |Model-driven Generative Techniques for Scalable Performability Analysis of Distributed Systems Arundhati Kogekar Dimple Kaul Aniruddha Gokhale Paul Vandal Upsorn Praphamontripong Swapna Gokhale Jing Zhang Yuehua Lin Jeffrey Gray The ever increasing societal demand for the timely availability of newer and feature-rich but highly dependable network-centric applications imposes the need for these applications to be constructed by the composition, assembly and deployment of off-the-shelf infrastructure and domain-specific services building blocks. Service Oriented Architecture (SOA) is an emerging paradigm to build applications in this manner by defining a choreography of loosely coupled building blocks. However, current research in SOA does not yet address the performability (i.e., performance and dependability) challenges of these modern applications. Our research is developing novel mechanisms to address these challenges. We initially focus on the composition and configuration of the infrastructure hosting the individual services. We illustrate the use of domain-specific modeling languages and model weavers to model infrastructure composition using middleware building blocks, and to enhance these models with the desired performability attributes. We also demonstrate the use of generative tools that synthesize metadata from these models for performability validation using analytical, simulation and empirical benchmarking tools.||||||||||||./pdfs/26-NSFNGS-paper-1.pdf",
    "A High Level SoC Power Estimation Based on IP Modeling |A High Level SoC Power Estimation Based on IP Modeling David Elleouet Nathalie Julien Dominique Houzet Current electronic system design requires to be concerned with power consumption consideration. However, in a lot of design tools, the application power consumption budget is estimated after RTL synthesis. We propose in this article a methodology based on measurements which allows to model the application power consumption with architectural and algorithmic parameters. So, the modeled applications can be added in a library in order to help the system designer to determine early in the design flow the best adequacy between high performances and low power consumption.||||||||||||./pdfs/26-RAW-paper-1.pdf",
    "Techniques Supporting threadprivate in OpenMP |Techniques Supporting threadprivate in OpenMP Xavier Martorell Marc Gonzalez Alejandro Duran Jairo Balart Roger Ferrer Eduard Ayguade Jesus Labarta This paper presents the alternatives available to support threadprivate data in OpenMP and evaluates them. We show how current compilation systems rely on custom techniques for implementing thread-local data. But in fact the ELF binary specification currently supports data sections that become threadprivate by default. ELF naming for such areas is Thread-Local Storage (TLS). Our experiments demonstrate that implementing threadprivate based on the TLS support is very easy, and more efficient. This proposal goes in the same line as the future implementation of OpenMP on the GNU compiler collection. In addition, our experience with the use of threadprivate in OpenMP applications shows that usually it is better to avoid it. This is because threadprivate variables reside in common blocks and they impede the compiler to fully optimize the code. So it is better to keep threadprivate as a temporary technique only to ease porting MPI codes to OpenMP.||||||||||||./pdfs/27-HIPS-paper-1.pdf",
    "Engineering Reliability into Hybrid Systems via Rich Design Models: Recent |Engineering Reliability into Hybrid Systems via Rich Design Models: Recent Results and Current Directions Somo Banerjee Leslie Cheung Leana Golubchik Nenad Medvidovic Roshanak Roshandel Gaurav Sukhatme Software reliability techniques are aimed at reducing or eliminating failures in software systems. Reliability in software systems has traditionally been measured during or after system implementation. However, software engineering methodology lays stress on doing the ``correct things'' early on in the software development lifecycle in order to curb development and maintenance costs. In this paper, we argue that reliability of a software system should be assessed throughout the system?s life span, starting with the software architecture level. Our research goal is to estimate the reliability of software systems in early design stages, which we believe involves the ability to reason about numerous uncertainties that exist in this stage, including uncertainty due to lack of execution artifacts. Our proposed approach is to develop techniques that will couple software architectural models with a suite of stochastic reliability estimation models and allow us to reason about these uncertainties. In this paper, we present our recent results using our technique for reliability estimation of software components at the level of software architecture. Another important part of this paper is the discussion of our ongoing research efforts and open research problems in this area.||||||||||||./pdfs/27-NSFNGS-paper-1.pdf",
    "Power Consumption Comparison for Regular Wireless Topologies using Fault-To|Power Consumption Comparison for Regular Wireless Topologies using Fault-Tolerant Beacon Vector Routing Luke Demoracski Dimiter R. Avresky Fault-tolerant Beacon Vector Routing (FBVR) is an efficient technique for routing in the presence of node failures. Several common wireless topologies exist that can be used with this technique. This paper compares the power consumption of various regular topologies using FBVR and makes appropriate recommendations. The topology types include Mesh, Torus, Communication Graph, and F-Cycle Ring (FCR). An existing analytical method for power consumption prediction is used. The results of this analytical method are compared against simulation results, which match closely, showing a high level of confidence in the power consumption results.||||||||||||./pdfs/28-DPDNS-paper-1.pdf",
    "The Monitoring Request Interface (MRI) |The Monitoring Request Interface (MRI) Edmond Kereku Michael Gerndt In this paper we present MRI, a high level interface for selective monitoring of code regions and data structures in single and multiprocessor environments. MRI keeps transparent the available monitoring resources from the performance analysis tools and can electively generate monitoring results as online profile information, or as postmortem traces. MRI is the first step toward a standard monitoring interface which can be used by a broad range of performance analysis tools, from profiler tools, trace producers and visualizers, up to complex automatic performance analyzers. We also present an implementation of MRI for SMPs which transparently use a simulation backend and a PAPI backend to obtain performance data.||||||||||||./pdfs/28-HIPS-paper-1.pdf",
    "Babylon v2.0:Middleware for Distributed, Parallel, and Mobile Java Applicat|Babylon v2.0:Middleware for Distributed, Parallel, and Mobile Java Applications Willem Van Heiningen Tim Brecht Steve Macdonald Babylon v2.0 is a collection of tools and services that provide a 100\\% Java compatible environment for developing, running and managing parallel, distributed and mobile Java applications. It incorporates features like object migration, asynchronous method invocation and remote class loading while providing an easy-to-use interface. Additionally, Babylon v2.0 enables Java applications to seamlessly create and interact with remote objects while protecting those objects from other applications by implementing access restrictions and separate name spaces. This paper describes the most important programming features of the Babylon v2.0 system, using a heat diffusion example to show how they are used in practice. The potential cluster computing benefits of the system are demonstrated with experimental results which show that sequential Java applications can achieve significant performance benefits from using Babylon v2.0 to parallelize their work across a cluster of workstations.||||||||||||./pdfs/29-HIPS-paper-1.pdf",
    "Sharing Resources with Artificial Ants |Sharing Resources with Artificial Ants Christophe Gu&eacute;ret Nicolas Monmarch&eacute; Mohamed Slimane As networks are growing up , more and more information becomes available every day. Despite the presence of software enabling communications and content sharing, they are not always shared among people inside networks. We present here an architecture aimed at helping people to share informations and find collaborators inside an organization. It is part of our PIAF framework, an intelligent agent system we use to develop recommender and personalization software. The main contribution of this paper is the introduction of principles of stigmergy and artificial ants to model data flows in a social network.||||||||||||./pdfs/3-NIDISC-paper-1.pdf",
    "Modeling and Executing Master-Worker applications |Modeling and Executing Master-Worker applications Hinde Lilia Bouziane Christian P&eacute;rez Thierry Priol This paper describes work in progress to extend component models to support Master-Worker applications and to let them to be executed on Grid infrastructures. The proposed approach is generic enough to be applied to existing component models such as the OMG CORBA and the ObjectWeb FRACTAL component models. One objective of our research is to relieve Grid application designers of managing low level programming and implementation aspects. With the proposed approach, a designer has only to cope with the description of an abstract view of the application architecture in which he has to specify what the master and the workers have to do while leaving the system environment to manage the low level aspects such as communication between the master and the workers.||||||||||||./pdfs/31-HIPS-paper-1.pdf",
    "Tree-based Overlay Networks for Scalable Applications |Tree-based Overlay Networks for Scalable Applications Dorian C. Arnold Gary D. Pack Barton P. Miller The increasing availability of high-performance computing systems with thousands, tens of thousands, and even hundreds of thousands of computational nodes is driving the demand for programming models and infrastructures that allow effective use of such large-scale environments. Tree-based Overlay Networks (TB\\=ONs) have proven to provide such a model for distributed tools like performance profilers, parallel debuggers, system monitors and system administration tools. We demonstrate that the extensibility and flexibility of the TB\\=ON distributed computing model, along with its performance characteristics, make it surprisingly general, particularly for applications outside the tool domain. We describe many interesting applications and commonly-used algorithms for which TB\\=ONs are well-suited and provide a new (non-tool) case study, a distributed implementation of the \\it mean-shift algorithm commonly used in computer vision to delineate arbitrarily shaped clusters in complex, multi-modal feature spaces.||||||||||||./pdfs/33-HIPS-paper-1.pdf",
    "Performance and Power Analysis of Time-multiplexed Execution on Dynamically|Performance and Power Analysis of Time-multiplexed Execution on Dynamically Reconfigurable Processor Yohei Hasegawa Shohei Abe Shunsuke Kurotaki Vu Manh Tuan Naohiro Katsura Takuro Nakamura Takashi Nishimura Hideharu Amano Dynamically Reconfigurable Processor (DRP) developed by NEC Electronics is a coarse grain reconfigurable processor that selects a datapath called a context from the on-chip repository of sixteen circuit configurations at run-time. The time-multiplexed execution based on the multicontext functionality is expected to drastically improve area and power efficiency. To demonstrate the impact of the time-multiplexed execution, we have implemented several stream applications on DRP with various context sizes. Throughout the evaluation based on real application designs, we analyzed the impact of the time-multiplexed execution on performance and power dissipation quantitatively.||||||||||||./pdfs/33-RAW-paper-1.pdf",
    "2D Defragmentation Heuristics for Hardware Multitasking on Reconfigurable D|2D Defragmentation Heuristics for Hardware Multitasking on Reconfigurable Devices Julio Septi&eacute;n Hortensia Mecha Daniel Mozos Jes&uacute;s Tabero This paper focuses on the fragmentation problem produced in 2D run-time reconfigurable FPGAs when hardware multitasking management is considered. Though allocation heuristics can take fragmentation into account when a new task arrives, the free area becomes inevitably fragmented as the tasks finish and exit the FPGA. The main contributions of our work are a fragmentation metric able to estimate when the FPGA fragmentation status has become critical, and several heuristics to decide when to perform defragmentation and how to perform it. This defragmentation heuristics can be of a preventive kind, driven by alarms that fire when isolated islands appear or a high fragmentation status is reached. It can be also an on-demand process produced when a task allocation fails though there is enough free area in the FPGA to accommodate it.||||||||||||./pdfs/41-RAW-paper-1.pdf",
    "Reconfiguration of Embedded Java Applications |Reconfiguration of Embedded Java Applications Jo&atilde;o Cl&aacute;udio Soares Otero Fl&aacute;vio Rech Wagner Luigi Carro This work presents the development of a coarse grain reconfigurable unit to be coupled to a native Java microcontroller, which is designed for an optimized execution of the embedded application. Code fragments to be accelerated through this unit are identified by profiling the application. The unit is able to explore ILP in a simple way and allows for Java compatibility, while also reducing the number of executed instructions, thus improving the performance with simultaneous energy savings. In many cases, as demonstrated by experiments, it also allows for smaller power consumption.||||||||||||./pdfs/42-RAW-paper-1.pdf",
    "Multi-Clock Pipelined Design of an IEEE 802.11a Physical Layer Transmitter |Multi-Clock Pipelined Design of an IEEE 802.11a Physical Layer Transmitter Maryam Mizani Daler Rakhmatov Among different wireless LAN technologies 802.11a has recently become popular due to its high throughput, large system capacity, and relatively long range. In this paper, we propose a reconfigurable architecture for the 802.11a physical layer transmitter, which has low latency and low power consumption due to its pipelined structure. Data from the MAC layer can continuously flow through the pipeline without excessive buffering and handshaking within the physical layer. Dynamically reconfiguring this architecture to work at any data rate supported by 802.11a (eight different modes) can be performed within a few cycles, simply by adjusting the period of two clock signals and changing the value of a 3-bit control signal. Our architecture, prototyped on a Xilinx Virtex-II Pro FPGA, occupies the area of 2059 slices and is estimated to consume 500 $mW$. These figures can be improved substantially in custom ASIC implementations.||||||||||||./pdfs/45-RAW-paper-1.pdf",
    "Multi-level Reconfigurable Architectures in the Switch Model |Multi-level Reconfigurable Architectures in the Switch Model Sebastian Lange Martin Middendorf In this paper we study multi-level dynamically reconfigurable architectures. These are extensions of standard reconfigurable architectures where ordinary reconfiguration operations correspond to the lowest reconfiguration level. On each higher reconfiguration level the reconfiguration capabilities of the reconfigurable resources that are available on the level directly below can be reconfigured. We show that the problem to find optimal reconfigurations with an arbitrary number of reconfiguration levels can be found in polynomial time for the switch cost model. The problem of finding the optimal number of reconfiguration levels is shown to be solvable in polynomial time on homogenous multi-level architectures but it becomes NP-hard for heterogenous multi-level architectures. Moreover, we present experimental results for some example problems on a simple test architecture.||||||||||||./pdfs/46-RAW-paper-1.pdf",
    "Power-Dependable Transactions in Mobile Networks |Power-Dependable Transactions in Mobile Networks Ami Marowka David Sem&eacute; We define a Quality-of-Power-Service (QoPS) metric to evaluate the efficiency of power-aware routing protocols in wireless ad-hoc networks. The aim of power management of routing protocols is to prolong the life-time of individual nodes in wireless network and thus to increase the delivery rate of Unicast transactions. QoPS metric is applied to different location-based Unicast transaction protocols. The results confirm that powerrelative distribution of data streams in multi-paths Unicast transaction protocols consume substantially less energy from individual nodes than from other distribution methods. The locality distribution phenomenon discovered by the simulations explains, on the one hand, the long lifetime of large, dense, and highly degree wireless networks, and on the other hand, the short lifetime of small, sparse, and low degree networks.||||||||||||./pdfs/5-DPDNS-paper-1.pdf",
    "Mapping DSP Applications on Processor Systems with Coarse-Grain Reconfigura|Mapping DSP Applications on Processor Systems with Coarse-Grain Reconfigurable Hardware Michalis D. Galanis Gregory Dimitroulakos Costas E. Goutis In this paper, we present performance results from mapping five real-world DSP applications on an embedded system-on-chip that incorporates coarse-grain reconfigurable logic with an instruction-set processor. The reconfigurable logic is realized by a 2-Dimensional Array of Processing Elements. A mapping flow for improving application?s performance by accelerating critical software parts, called kernels, on the Coarse-Grain Reconfigurable Array is proposed. Profiling is performed for detecting critical kernel code. For mapping the detected kernels on the reconfigurable logic a priority-based mapping algorithm has been developed. The experiments for three different instances of a generic system show that the speedup from executing kernels on the Reconfigurable Array ranges from 9.9 to 151.1, with an average value of 54.1, relative to the kernels? execution on the processor. Important overall application speedups, due to the kernels? acceleration, have been reported for the five applications. These overall performance improvements range from 1.3 to 3.7, with an average value of 2.3, relative to an all-software execution.||||||||||||./pdfs/5-RAW-paper-1.pdf",
    "Speech Silicon AM: An FPGA-Based Acoustic Modeling Pipeline for Hidden Mark|Speech Silicon AM: An FPGA-Based Acoustic Modeling Pipeline for Hidden Markov Model based Speech Recognition Jeffrey W. Schuster Raymond Hoare Kshitij Gupta This paper presents the design of a FPGA-based hardware co-processor capable of performing continuous speech recognition on medium sized vocabularies in real-time. The system is based on models derived through analysis of the SPHINX 3 large vocabulary continuous speech recognition engine designed by CMU. By creating a custom, input-driven pipeline for performing the calculations we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls. By using embedded multiply-accumulate ASIC cells in the FPGA and using advanced placement techniques we were able to reach post place-and-route speeds even greater than those necessary for real-time operation while operating at maximum workload. Further, we use ?input control vectors?, rather than internal finite state machines, to shut down portions of the pipeline when they were not in use to help mitigate power consumption. These results combined with the ability to reprogram the system for different recognition tasks serve to create a system capable of performing real-time speech recognition in a vast array of environments. We synthesized our hardware to a Xilinx Virtex 4 SX and a Xilinx Spartan 3 FPGA. Functional verification was through post place-and-route simulations.||||||||||||./pdfs/51-RAW-paper-1.pdf",
    "Distributed Monte Carlo Simulation of Light Transportation in Tissue |Distributed Monte Carlo Simulation of Light Transportation in Tissue Andrew J. Page Shirley Coyle Thomas M. Keane Thomas J. Naughton Charles Markham Tomas Ward A distributed Monte Carlo simulation which models the propagation of light through tissue has been developed. It will allow for improved calibration of medical imaging devices for investigating tissue oxygenation in the white matter of the cerebral cortex. The application can distribute the simulation over an unbounded number of processors in parallel. We have found that this application is highly parallelisable resulting in up to 97% efficiency at 60 processors running on a homogeneous Java distributed system. A distributed system with 150 heterogeneous processors was used to simulate the paths of photons in a brain tissue model. We found that the source illumination footprint has an effect on the distribution of photons in the head and that lasers do produce a small beam in a highly scattering medium. This application will help researchers to improve the accuracy of their experiments.||||||||||||./pdfs/510-JAVAPDC-paper-1.pdf",
    "The Benefits of Java and Jini in the JGrid System |The Benefits of Java and Jini in the JGrid System Szabolcs Pota Zoltan Juhasz The Java language and platform have been considered by many as natural candidate for creating grid systems. The platform-independent runtime environment, safe and high-level language and its built-in support for networking and security are very valuable features. Despite its potential and the many proof-of-concept systems developed, the grid community is turning to web services technology as its implementation base. In this paper, we show that Java, by joining forces with Jini Technology can provide a very appealing technology base for highly dynamic grid systems. The key properties of Java and Jini technology are examined with reference to their role in grids. Then, the JGrid Jini-based service-oriented grid system is overviewed describing its key concepts, services and how it extends Jini to address some of the unique requirements of grid systems.||||||||||||./pdfs/511-JAVAPDC-paper-1.pdf",
    "Parallel Implementation of the Replica Exchange Molecular Dynamics Algorith|Parallel Implementation of the Replica Exchange Molecular Dynamics Algorithm on Blue Gene/L M. Eleftheriou A. Rayshubski J. W. Pitera B. G. Fitch R. Zhou R. S. Germain The Replica Exchange method is a popular approach for studying the folding thermodynamics of small to modest size proteins in explicit solvent, since it is easily parallelized. However, Replica Exchange can become computationally expensive for large-scale studies, due to the number of replicas needed as well as interprocessor communication requirements both between and within replicas. In this paper we discuss an implementation of Replica Exchange Molecular Dynamics on Blue Gene/L for performing large scale simulation studies of systems of biological interest. The algorithm is tuned with an awareness of the physical network topology and hardware performance features of the Blue Gene/L architecture. Performance measurements for Replica Exchange using the Blue Matter Molecular Dynamics application are presented on Blue Gene/L hardware with up to 256 replicas simulated on 8,192 compute nodes. Both scalability and performance are achieved with this implementation.||||||||||||./pdfs/53-HiCOMB-paper-1.pdf",
    "ReConfigME: A Detailed Implementation of an Operating System for Reconfigur|ReConfigME: A Detailed Implementation of an Operating System for Reconfigurable Computing Grant Wigley David Kearney Mark Jasiunas Reconfigurable computing applications have traditionally had the exclusive use of the field programmable gate array, primarily because the logic densities of the available devices have been relatively similar in size compared to the application. But with the modern FPGA expanding beyond 10 million system gates, and through the use of dynamic reconfiguration, it has become feasible for several applications to share a single high density device. However, developing applications that share a device is difficult as the current design flow assumes the exclusive use of the FPGA resources. As a consequence, the designer must ensure that resources have been allocated for all possible combinations of loaded applications at design time. If the sequence of application loading and unloading is not known in advance, all resource allocation cannot be performed at design time because the availability of resources changes dynamically. In this paper we present an implementation of an operating system that has the ability to share its FPGA resources dynamically among multiple executing applications.||||||||||||./pdfs/55-RAW-paper-1.pdf",
    "Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analy|Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis B. B. Zhou D. Chu M. Tarawneh P. Wang C. Wang A. Y. Zomaya R. P. Brent This paper describes a parallel implementation of our recently developed algorithm for phylogenetic analysis on the IBM BlueGene/L cluster. This algorithm constructs evolutionary trees for a given set of DNA or protein sequences based on the topological information of every possible quartet trees. Our experimental results showed that it has several advantages over many popular algorithms. By distributing the quartet weights evenly across the processing nodes and making effective use of a fast collective network on the IBM BlueGene/L cluster, we are able to achieve a close to linear speedup even when the number of processors involved in the computation is large.||||||||||||./pdfs/56-HiCOMB-paper-1.pdf",
    "Investigation into Programmability for Layer 2 Protocol Frame Delineation A|Investigation into Programmability for Layer 2 Protocol Frame Delineation Architectures Ciaran Toal Sakir Sezer This paper presents the design and study of reconfigurable architectures for two data-link layer frame delineation techniques used for ATM and GFP. The architectures are targeted to Altera Stratix II FPGA technology and are investigated in terms of performance and area. This work addresses the potential for incorporating programmability into custom purpose architectures that could enable the same processing hardware to be used for processing multiple protocols.||||||||||||./pdfs/57-RAW-paper-1.pdf",
    "A Method to Improve Structural Modeling Based on Conserved Domain Clusters |A Method to Improve Structural Modeling Based on Conserved Domain Clusters Fa Zhang Lin Xu Bo Yuan Homology modeling requires an accurate alignment between a query sequence and its homologs with known three-dimensional (3D) information. Current structural modeling techniques largely use entire protein chains as templates, which are selected based only on their sequence alignments with the queries. Protein can be largely described as combinations of conserved domains, and already more than two-third of the known protein domains can be found in the Protein Data Bank (PDB). We presented a method to improve structural modeling based on conserved domain clusters. First, we searched and mapped all the InterPro domains in the entire PDB, partitioned and clustered homologous domains into the domain-based template library. For each of the resulting clusters created, a multiple structural alignment was generated based only on the 3D coordinates of all the residues involved. Then we used the structural alignments as anchors to increase the alignment accuracy between a query and its templates, and consequently improve the quality of predicted structure for query protein. We implemented the method on DAWNING 4000A cluster system. The preliminary results show that our domain-based template library and the structure-anchored alignment protocol can be used for the partial prediction for a majority of known protein sequences with better qualities.||||||||||||./pdfs/60-HiCOMB-paper-1.pdf",
    "Bio-Sequence Database Scanning on a GPU |Bio-Sequence Database Scanning on a GPU Weiguo Liu Bertil Schmidt Gerrit Voss Andre Schroder Wolfgang Muller-wittig Protein sequences with unknown functionality are often compared to a set of known sequences to detect functional similarities. Efficient dynamic programming algorithms exist for this problem, however current solutions still require significant scan times. These scan time requirements are likely to become even more severe due to the rapid growth in the size of these databases. In this paper, we present a new approach to bio-sequence database scanning using computer graphics hardware to gain high performance at low cost. To derive an efficient mapping onto this type of architecture, we have reformulated the Smith-Waterman dynamic programming algorithm in terms of computer graphics primitives. Our OpenGL implementation achieves a speedup of approximately sixteen on a high-end graphics card over available straightforward and optimized CPU Smith-Waterman implementations.||||||||||||./pdfs/66-HiCOMB-paper-1.pdf",
    "Design and Analysis of Matching Circuit Architectures for a Closest Match L|Design and Analysis of Matching Circuit Architectures for a Closest Match Lookup Kieran Mclaughlin Friederich Kupzog Holger Blume Sakir Sezer Tobias Noll John Mccanny This paper investigates the implementation of a number of circuits used to perform a high speed closest value match lookup. The design is targeted particularly for use in a search trie, as used in various networking lookup applications, but can be applied to many other areas where such a match is required. A range of different designs have been considered and implemented on FPGA. A detailed description of the architectures investigated is followed by an analysis of the synthesis results.||||||||||||./pdfs/71-RAW-paper-1.pdf",
    "RTOS Extensions for Dynamic Hardware / Software Monitoring and Configuratio|RTOS Extensions for Dynamic Hardware / Software Monitoring and Configuration Management. Yvan Eustache Jean-philippe Diguet Milad El Khodary We present our solution for a flexible and unified implementation of self-adaptive systems on reconfigurable architectures. This approach is based on a couple of local and global reconfiguration managers. In this paper we describe how the managers are implemented in the context of an usual RTOS and the new services we add for hardware and software monitoring, reconfiguration decision and reconfiguration control which also includes hardware and software interface modeling.||||||||||||./pdfs/72-RAW-paper-1.pdf",
    "Securing Embedded Programmable Gate Arrays in Secure Circuits |Securing Embedded Programmable Gate Arrays in Secure Circuits Nicolas Valette Lionel Torres Gilles Sassatelli Frederic Bancel The purpose of this article is to propose a survey of possible approaches for implementing embedded reconfigurable gate arrays into secure circuits. A standard secure interfacing architecture is proposed and motivations justifying such an approach are discussed. This paper also lists all features offered by FPGA vendors (Field Programmable Gate Array) aiming at securing those circuits according to different concerns. This article emphasizes on configuration memory programming which is probably the weakest point of using programmable devices on a secure context.||||||||||||./pdfs/74-RAW-paper-1.pdf",
    "FPGA Implementation of a License Plate Recognition SoC using Automatically |FPGA Implementation of a License Plate Recognition SoC using Automatically Generated Streaming Accelerators Nikolaos Bellas Sek Chai Malcolm Dwyer Dan Linzmeier Modern FPGA platforms provide the hardware and software infrastructure for building a bus-based System on Chip (SoC) that meet the applications requirements. The designer can customize the hardware by selecting from a large number of pre-defined peripherals and fixed IP functions and by providing new hardware, typically expressed using RTL. Hardware accelerators that provide application-specific extensions to the computational capabilities of a system is an efficient mechanism to enhance the performance and reduce the power dissipation. What is missing is an integrated approach to identify the computationally critical parts of the application and to create accelerators starting from a high level representation with a minimal design effort. In this paper, we present an automation methodology and a tool that generates accelerators. We apply the methodology on an FPGA-based license plate recognition (LPR) system used in law enforcement. The accelerators process streaming data and support a programming model which can naturally express a large number of embedded applications resulting in efficient hardware implementations. We show that we can achieve an overall LPR application speed up from 1.2x to 2.6x, thus enabling real-time functionality under realistic road scenes.||||||||||||./pdfs/78-RAW-paper-1.pdf",
    "Platform-based FPGA Architecture: Designing High-Performance and Low-Power |Platform-based FPGA Architecture: Designing High-Performance and Low-Power Routing Structure for Realizing DSP Applications Konstantinos Siozios Konstantinos Tatas Dimitrios Soudris Antonios Thanailakis The novel design of an efficient FPGA interconnection architecture with multiple Switch Boxes (SB) and hardwired connections for realizing data intensive applications (i.e. DSP applications), is introduced. For that purpose, after exhaustive exploration, we modify the routing architecture through efficient selection of the appropriate switch box with hardwired connections, taking into account the statistical and spatial routing restrictions of DSP applications mapped onto FPGA. More specifically, we propose a new technique for selecting the appropriate combination of switch boxes, depending on the localized performance and power consumption requirements of each specific region of FPGA architecture. In order to perform the mapping, we developed a novel algorithm, which takes into account the modified architectural routing features. This algorithm was implemented within a new tool called EX-VPR. Using a number of DSP applications, extensive comparison study of various combinations of switch boxes in terms of total power consumption, performance, Power�Delay product prove the effectiveness of the proposed approach.||||||||||||./pdfs/79-RAW-paper-1.pdf",
    "Accelerating CABAC Encoding for Multi-standard Media with Configurability |Accelerating CABAC Encoding for Multi-standard Media with Configurability Oskar Flordal Di Wu Dake Liu This paper presents the study of how to accelerate CABAC encoding for emerging heterogeneous multimedia applications. The latest image and video compression standards such as JPEG2000 and H.264 both have adopted Context Adaptive Binary Arithmetic Coding to achieve performance enhancement. However, CABAC requires high computing power. After investigating computational complexity of CABAC coding, firstly, instruction level acceleration is elaborated. Secondly, a configurable accelerator for CABAC encoding in multiple standards is proposed. Benchmarking performance and implementation cost is also addressed.||||||||||||./pdfs/8-RAW-paper-1.pdf",
    "Design Space Exploration for Low-Power Reconfigurable Fabrics |Design Space Exploration for Low-Power Reconfigurable Fabrics Gayatri Mehta Raymond R. Hoare Justin Stander Alex K. Jones This paper presents a parameterizable, coarse-grained, reconfigurable fabric model that attempts to maintain Field Programmable Gate Array (FPGA)-like programmability and Computer Aided Design (CAD), with Application Specific Integrated Circuit (ASIC)-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, architectural design space decisions are explored in order to define an energy-efficient fabric. The impact on energy and performance due to the variation of different parameters such as datawidth and interconnection flexibility has been studied. The multiplexer cardinality usage has also been studied by mapping some of the signal and image processing applications onto the fabric. The results point to the use of power optimized 32-bit width computational elements interconnected by low cardinality multiplexers like 4:1 multiplexers.||||||||||||./pdfs/80-RAW-paper-1.pdf",
    "Exploiting Dynamic Reconfiguration Techniques: The 2D-VLIW Approach |Exploiting Dynamic Reconfiguration Techniques: The 2D-VLIW Approach Ricardo Santos Rodolfo Azevedo Guido Araujo Fast reconfiguration is a mandatory feature for reconfigurable computing architectures. Research in this area has been increasingly focusing on new reconfiguration techniques that can sustain the target performance goal. For reconfigurable pipelined architectures, the challenge is to allow the simultaneous execution, at the same stage, of configuration and computation tasks. In this context, this paper presents a new dynamic reconfiguration technique, based on a configuration cache, that tackles this challenge by configuring and executing operations on functional units during the execution stage. This approach is implemented in a pipelined reconfigurable multiple-issue architecture called 2D-VLIW. Our dynamic reconfiguration technique takes advantage of the 2D-VLIW pipelined execution by starting reconfiguration concurrently to activities like reading operand registers and executing operations.||||||||||||./pdfs/81-RAW-paper-1.pdf",
    "Applying Single Processor Algorithms to Schedule Tasks on Reconfigurable De|Applying Single Processor Algorithms to Schedule Tasks on Reconfigurable Devices Respecting Reconfiguration Times Florian Dittmann Marcelo G&ouml;tz In the single machine environment, several scheduling algorithms exist that allow to quantify schedules with respect to feasibility, optimality, etc. In contrast, reconfigurable devices execute tasks in parallel, which intentionally collides with the single machine principle and seems to require new methods and evaluation strategies for scheduling. However, the reconfiguration phases of adaptable architectures usually take place sequentially. Run-time adaptation is realized using an exclusive port, which again is occupied for some reasonable time during reconfiguration. We have to handle the duration and the sequential exclusiveness of reconfiguration phases. Here, we can find an analogy to the single machine environment, as both scenarios must derive a sequential schedule for an exclusive resource. Thus, we investigate the appliance of single processor scheduling algorithms to task reconfiguration on reconfigurable systems in this paper. We determine necessary adaptations and propose methods to evaluate the scheduling algorithms.||||||||||||./pdfs/82-RAW-paper-1.pdf",
    "A Cost-Effective Context Memory Structure for Dynamically Reconfigurable Pr|A Cost-Effective Context Memory Structure for Dynamically Reconfigurable Processors Masayasu Suzuki Yohei Hasegawa Vu Manh Tuan Shohei Abe Hideharu Amano Multicontext reconfigurable processors can switch its configuration in a single clock cycle by providing a context memory in each of the processing elements. Although these processors have proven to be powerful in many applications, the number of contexts is often not enough. The context translation table which translates the global instruction pointer, or the global logical context number, into a local physical context number is proposed to realize a larger application while reducing the actual context memories. Our evaluation using NEC Electronics' DRP-1 shows that the proposed method is effective when the size of the tile is small and the number of context is large. In the most efficient case, the required number of contexts is reduced to 25\\%, and the total amount of configuration data becomes 6.9\\%. The template configuration method which extends this idea harnesses the power of multicontext devices by storing basic contexts as \\textit templates and combining them to form the actual contexts. While effective in theory, our evaluation shows that the return in adopting such mechanisms in more finer processors as the DRP-1 is minimal where the size of the context memory adds up relative to the number of processing units.||||||||||||./pdfs/84-RAW-paper-1.pdf",
    "Performance of FPGA Implementation of Bit-split Architecture for Intrusion |Performance of FPGA Implementation of Bit-split Architecture for Intrusion Detection Systems Hong-jip Jung Zachary K. Baker Viktor K. Prasanna The use of reconfigurable hardware for network security applications has recently made great strides forward as Field-Programmable Gate Array (FPGA) devices have provided larger and faster resources. The performance of an Intrusion Detection System is dependent on two metrics: throughput and the total number of patterns that can fit on a device. In this paper, we consider the FPGA implementation details of the bit-split string-matching architecture. The bit-split algorithm allows large hardware state machines to be converted into a form with much higher memory efficiency. We have extended the architecture to satisfy the requirements of the IDS state-of-the-art. We show that the architecture can be effectively optimized for FPGA implementation by making some changes to the parameters governing the pattern loading within the modules as well new interface hardware for communicating with an external controller. The overall performance (bandwidth * number of patterns) is competitive against other memory-based FPGA string matching architectures.||||||||||||./pdfs/85-RAW-paper-1.pdf",
    "Exploiting dynamic reconfiguration of platform FPGAs: Implementation issues|Exploiting dynamic reconfiguration of platform FPGAs: Implementation issues Miguel L. Silva Jo&atilde;o Canas Ferreira The effective use of dynamic reconfiguration requires the designer to address many implementation issues. The market introduction of feature-full platform FPGAs equipped with embedded CPU blocks expands the number of situations where dynamic reconfiguration may be applied to improve overall performance and logic utilization. The paper compares the design of two similar systems supporting dynamic reconfiguration and the issues that were addressed in their implementation. The first system supports 32-bit data transfers between CPU and the dynamically reconfigurable circuits. The other implementation supports 64-bit transfers, but its effective use is more complicated and several restrictions must be taken into account. The work includes a performance comparison of the two designs on several simple tasks, including pattern matching, image processing and hashing.||||||||||||./pdfs/86-RAW-paper-1.pdf",
    "Selection of Instruction Set Extensions for an FPGA Embedded Processor Core|Selection of Instruction Set Extensions for an FPGA Embedded Processor Core Brian F. Veale John K. Antonio Monte P. Tull Sean A. Jones A design process is presented for the selection of a set of instruction set extensions for the PowerPC 405 processor that is embedded into the Xilinx Virtex Family of FPGAs. The instruction set of the PowerPC 405 is extended by selecting additional instructions from the full 32-bit PowerPC instruction set architecture (ISA), of which the PowerPC 405 ISA is a subset. The selected instructions are supported in hardware using the reconfigurable resources of the FPGA. The proposed design process gathers execution statistics for a target application through profiling or simulation. These statistics are then used to estimate the speedup that would be achieved if selected instructions from the full PowerPC ISA are added to the ISA of the PowerPC 405 processor. An experimental study of two embedded benchmarks show significant speedup when this approach is used to extend the PowerPC 405 processor to support various floating-point operations through the use of floating-point cores developed by QinetiQ.||||||||||||./pdfs/90-RAW-paper-1.pdf",
    "Dynamically Reconfigurable Cache Architecture Using Adaptive Block Allocati|Dynamically Reconfigurable Cache Architecture Using Adaptive Block Allocation Policy Milene Barbosa Carvalho Lu&iacute;s Fabrcio Wanderley G&oacute;es Carlos Augusto Paiva Da Silva Martins In this paper, we present a dynamically reconfigurable cache architecture using adaptive block allocation policy analyzed by means of simulation. Our main objectives are: to propose a reconfigurable cache architecture and to propose, implement and analyze the performance of an adaptive cache block allocation policy. First, we present a proposal of the reconfigurable cache architecture that can adapt according to the workload. Then we present our adaptive policy and do performance tests comparing our cache architecture with some set associative configurations. In these tests, we use some traces from BYU Trace Distribution Center of SPEC 2000 Benchmark. Finally, we analyze the results based on metrics like cache miss ratio, response time, etc. Our main contributions are: the proposal of a dynamically reconfigurable cache architecture; proposal, development and implementation of an adaptive cache block allocation policy.||||||||||||./pdfs/93-RAW-paper-1.pdf",
    "Dynamic Configuration Steering for a Reconfigurable Superscalar Processor |Dynamic Configuration Steering for a Reconfigurable Superscalar Processor Nick A. Mould Brian F. Veale Monte P. Tull John K.antonio A new dynamic vector approach for the selection and management of the configuration of a reconfigurable superscalar processor is proposed. This new method improves on previous work that used steering vectors to guide the selection of functional units to be loaded into the processor. Dependencies among instructions in the instruction buffer are analyzed to enable a new scoring method. The dynamic vector technique is shown to reduce the amount of reconfiguration required while preserving execution resources. Simulation results reveal that, given enough configurable space, the configuration of the processor approaches a stable state.||||||||||||./pdfs/94-RAW-paper-1.pdf",
    "RAW Keynote 2: New Horizons of Very High Performance Computing (VHPC): Hurd|RAW Keynote 2: New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances Reiner Hartenstein Reconfigurable Computing (RC) delivers the success story of the century. First launched by the hardware / software co-design scene by adopting FPGAs for embedded system design, now a huge second wave has reached a wide variety of scientific computing communities. Google's yawdropping hit rates illustrate the pervasiveness of Reconfigurable Computing, now also being adopted by supercomputing (Cray, sgi, etc.). From FPGA usage as accelerators, speed-up factors by up to four orders of magnitude and more are reported, as well as floor space requirements and electricity invoice amounts reduced by one order of magnitude and more. This is astonishing, since FPGAs and rDPAs have a substantially lower clock speed than microprocessors and an effective integration density being lower by four orders of magnitude: the Reconfigurable Computing Paradox. Algorithmic cleverness is the secret of success, based on software to configware migration mechanisms, striving away from memory-cycle-hungry instruction stream-based computing paradigms. Even higher speedup is achievable by using coarse-grained reconfigurable datapath arrays (rDPAs) available from a number of start-ups. With automatically partitioning configware / software cocompilers the desktop personal supercomputer is near. The main benefit of RC, having replaced the use of hardwired accelerators, is their flexibility by non-procedural programmability. This also contributes to more recent developments in system architecture, which rely on processes of evolution, self-organization, adaptation and fault tolerance. The main hurdles on the way to heart-stopping new horizons of cheap highest performance are CS-related educational deficits causing the configware / software chasm and a methodology fragmentation between the different cultures of application domains. Since the von Neumann paradigm is loosing its dominance by emerging reconfigurable main processors using hardwired von Neumann coprocessors as auxiliary clerks, it is time for a curricular upgrade. Current CS curricula do not sufficiently meet their transdisciplinary responsibility. The talk gives a survey on fundamental issues in RC and on new directions in CS-related curricula, focused on a dual paradigm organic computing approach.||||||||||||./pdfs/998-RAW-paper-1.pdf",
    "RAW Keynote 1: The Outer Limits: Reconfigurable Computing in Space and In |RAW Keynote 1: The Outer Limits: Reconfigurable Computing in Space and In Orbit Maya Gokhale Programmable hardware offers unique opportunities for flexible control and processing on board spacecrafts and satellites. Space missions dictate stringent requirements on size, weight, power, versatility and performance of the on-board data acquisition and computing resources. Reconfigurable FPGA-based devices, with the programmability of software and and speed/size approaching application-specific integrated circuits, make it possible to control and communicate with sensors, as well as process scientific data right on the spacecraft, sending only relevant information back home over the low bandwidth communications link. Challenges to computing in harsh space environments abound: vibration, thermal cycling, heat dissipation, and radiation all take their toll on space electronics. In spite of these barriers, notable experiments in reconfigurable computing for space applications are being undertaken. These include NASA's reconfigurable scalable computing project, intended for planetary rovers, cameras, and other sensors; the Queensland University FedSat and its successors, using FPGAs for near real-time image processing, communications, and navigation; and the Cibola Flight Experiment, a Los Alamos National Laboratory experiment in on-orbit signal processing using radiation-tolerant FPGAs. This talk will discuss the perils and possibilities of reconfigurable computing at the outer limits.||||||||||||./pdfs/999-RAW-paper-1.pdf"
);