TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION: Best Papers

  • Online Adaptive Code Generation and Tuning Authors: Ananta N Tiwari (University of Maryland at College Park, USA); Jeffrey K. Hollingsworth (University of Maryland, USA)
    In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-?y. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today’s HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.
  • GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs Authors: José Luis Abellán (University of Murcia, Spain); Juan Fernández (University of Murcia, Spain); Manuel E. Acacio (Universidad de
    Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency con?icts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates GLocks, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. GLocks use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely ef?cient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most ef?cient shared-memory-based lock implementation for a set of microbenchmarks and real applications quanti?es the goodness of GLocks. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traf?c, and also an average reduction of 78% and 28% in energy-delay^2 product (ED^2P) metric for the full CMP for the microbenchmarks and the real applications, respectively. In light of our performance results, we can conclude that GLocks satisfy our initial working hypothesis. GLocks minimize cache-coherence network traf?c due to lock synchronization which translates into reduced power consumption and execution time.
  • Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms Authors: Andrew Nere (University of Wisconsin - Madison, USA); Atif Hashmi (University of Wisconsin - Madison, USA); Mikko Lipasti (Unive
    Recent advances in neuroscienti?c understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-ef?cient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today’s GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inef?ciencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime pro?ling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the pro?ling tool with these optimizations on Nvidia’s CUDA framework, we achieve up to 60x speedup over a singlethreaded CPU implementation of the model.
  • PHAST: Hardware-Accelerated Shortest Path Trees Authors: Daniel Delling (Microsoft Research Silicon Valley, USA); Andrew Goldberg (Microsoft Research Silicon Valley, USA); Andreas Nowat
    We present a novel algorithm to solve the nonnegative single-source shortest path problem on road networks and other graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multi-core. Compared to Dijkstra’s algorithm, our method needs fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where our algorithm is up to three orders of magnitude faster than Dijkstra’s algorithm on a highend CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several algorithms, such as computing the graph diameter, exact arc ?ags, or centrality measures (exact reaches or betweenness), can be greatly accelerated by our method.