IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION: Best Papers
Online Adaptive Code Generation and TuningIn this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-?y. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in todayâ€™s HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.
GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPsSynchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency con?icts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates GLocks, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. GLocks use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely ef?cient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most ef?cient shared-memory-based lock implementation for a set of microbenchmarks and real applications quanti?es the goodness of GLocks. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traf?c, and also an average reduction of 78% and 28% in energy-delay^2 product (ED^2P) metric for the full CMP for the microbenchmarks and the real applications, respectively. In light of our performance results, we can conclude that GLocks satisfy our initial working hypothesis. GLocks minimize cache-coherence network traf?c due to lock synchronization which translates into reduced power consumption and execution time.
Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning AlgorithmsRecent advances in neuroscienti?c understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-ef?cient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making todayâ€™s GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inef?ciencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime pro?ling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the pro?ling tool with these optimizations on Nvidiaâ€™s CUDA framework, we achieve up to 60x speedup over a singlethreaded CPU implementation of the model.
PHAST: Hardware-Accelerated Shortest Path TreesWe present a novel algorithm to solve the nonnegative single-source shortest path problem on road networks and other graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multi-core. Compared to Dijkstraâ€™s algorithm, our method needs fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where our algorithm is up to three orders of magnitude faster than Dijkstraâ€™s algorithm on a highend CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several algorithms, such as computing the graph diameter, exact arc ?ags, or centrality measures (exact reaches or betweenness), can be greatly accelerated by our method.