TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

IPDPS 2011 Keynotes and Panels

  • Keynote: Algorithm Engineering for Scalable Parallel External Sorting Authors: Peter Sanders, Karlsruhe Institute of Technology
    The talk describes algorithm engineering (AE) as a methodology for algorithmic research where design, analysis, implementation and experimental evaluation of algorithms form a feedback cycle driving the development of efficient algorithm. Additional important components of the methodology include realistic models, algorithm libraries, and collections of realistic benchmark instances. We use one main example throughout this paper: sorting huge data sets using many multi-core processors and disks. The described system broke records for the GraySort and MinuteSort sorting benchmarks and helped with the record for the JouleSort benchmark.
  • 25th Year Panel: LOOKING BACK Authors: Moderator: Yves Robert, Ecole Normale Supérieure de Lyon, France
    The 25th year of IPDPS gives us the opportunity to look back and (to attempt) to assess what has gone wrong, what has gone well, and what came as a surprise, in the field of parallel and distributed processing. The panel members will give a few examples of striking events that took place in their area (covering Algorithms/ Applications/ Architectures/ Software). They will also give a short statement on how they would summarize the evolution of the field as a whole over the last 25 years. Panelists: William (Bill) Dally, Stanford & NVIDIA Jack Dongarra, University of Tennessee & Oak Ridge National Laboratory Satoshi Matsuoka, Tokyo Institute of Technology, Japan Rob Schreiber, HP Labs, Palo Alto Horst Simon, NERSC, Lawrence Berkeley National Laboratory Uzi Vishkin, University of Maryland
  • Keynote: Architecture-aware Algorithms and Software for Peta and Exascale Computing Authors: Jack Dongarra, University of Tennessee & Oak Ridge National Laboratory
    In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder. We will look at five areas of research that will have an importance impact in the development of software and algorithms. We will focus on following themes: Redesign of software to fit multicore and hybrid architectures, Automatically tuned application software, Exploiting mixed precision for performance, The importance of fault tolerance, Communication avoiding algorithms.
  • 25th Year Panel: WHAT'S AHEAD Authors: Moderator: Per Stenström, Chalmers University of Technology, Sweden
    Parallel computing has become ubiquitous and relates to challenging computational problems in science via business-driven computing to mobile computing. The scope has widened dramatically over the last decade. This panel will debate and speculate on how the parallel computing landscape is expected to change in the years to come. Areas of focus will include: Computing platforms: How will we be able to maintain the performance growth of the past and what will be the major challenges in the next 10 years and beyond that? What technical barriers are anticipated and what disruptive technologies are behind the corner? Software: How will software infrastructures evolve to meet performance requirements in the next 10 years and beyond? How will we ever be able to hide parallelism obstacles for the masses and what is the road forward towards that? Algorithms: What will be the major computational problems to tackle in the next 10 years and beyond? What are the most challenging algorithmic problems to solve? Applications: What will be the next wave of grand challenge problems to focus on in the next 10 years and beyond? What will be the major performance driving applications in the general and mobile computing domains? Panelists: Doug Burger, Microsoft Research; Wen-mei Hwu, University of Illinois, Urbana-Champaign; Vipin Kumar, University of Minnesota; Kunle Olukotun, Stanford University; David Padua, University of Illinois, Urbana-Champaign; Burton Smith, Microsoft.
  • Keynote: Power, Programmability, and Granularity: The Challenges of ExaScale Computing Authors: William (Bill) Dally, Stanford/NVIDIA
    Reaching an ExaScale computer by the end of the decade, and enabling the continued performance scaling of smaller systems requires significant research breakthroughs in three key areas: power efficiency, programmability, and execution granularity. To build an ExaScale machine in a power budget of 20MW requires a 200-fold improvement in energy per instruction: from 2nJ to 10pJ. Only 4x is expected from improved technology. The remaining 50x must come from improvements in architecture and circuits. To program a machine of this scale requires more productive parallel programming environments - that make parallel programming as easy as sequential programming is today. Finally, problem size and memory size constraints prevent the continued use of weak scaling, requiring these machines to extract parallelism at very fine granularity - down to the level of a few instructions. This talk will discuss these challenges and current approaches to address them.

SESSION: Best Papers

  • Online Adaptive Code Generation and Tuning Authors: Ananta N Tiwari (University of Maryland at College Park, USA); Jeffrey K. Hollingsworth (University of Maryland, USA)
    In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-?y. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today’s HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.
  • GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs Authors: José Luis Abellán (University of Murcia, Spain); Juan Fernández (University of Murcia, Spain); Manuel E. Acacio (Universidad de
    Synchronization is of paramount importance to exploit thread-level parallelism on many-core CMPs. In these architectures, synchronization mechanisms usually rely on shared variables to coordinate multithreaded access to shared data structures thus avoiding data dependency con?icts. Lock synchronization is known to be a key limitation to performance and scalability. On the one hand, lock acquisition through busy waiting on shared variables generates additional coherence activity which interferes with applications. On the other hand, lock contention causes serialization which results in performance degradation. This paper proposes and evaluates GLocks, a hardware-supported implementation for highly-contended locks in the context of many-core CMPs. GLocks use a token-based message-passing protocol over a dedicated network built on state-of-the-art technology. This approach skips the memory hierarchy to provide a non-intrusive, extremely ef?cient and fair lock implementation with negligible impact on energy consumption or die area. A comprehensive comparison against the most ef?cient shared-memory-based lock implementation for a set of microbenchmarks and real applications quanti?es the goodness of GLocks. Performance results show an average reduction of 42% and 14% in execution time, an average reduction of 76% and 23% in network traf?c, and also an average reduction of 78% and 28% in energy-delay^2 product (ED^2P) metric for the full CMP for the microbenchmarks and the real applications, respectively. In light of our performance results, we can conclude that GLocks satisfy our initial working hypothesis. GLocks minimize cache-coherence network traf?c due to lock synchronization which translates into reduced power consumption and execution time.
  • Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms Authors: Andrew Nere (University of Wisconsin - Madison, USA); Atif Hashmi (University of Wisconsin - Madison, USA); Mikko Lipasti (Unive
    Recent advances in neuroscienti?c understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-ef?cient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today’s GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inef?ciencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime pro?ling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the pro?ling tool with these optimizations on Nvidia’s CUDA framework, we achieve up to 60x speedup over a singlethreaded CPU implementation of the model.
  • PHAST: Hardware-Accelerated Shortest Path Trees Authors: Daniel Delling (Microsoft Research Silicon Valley, USA); Andrew Goldberg (Microsoft Research Silicon Valley, USA); Andreas Nowat
    We present a novel algorithm to solve the nonnegative single-source shortest path problem on road networks and other graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multi-core. Compared to Dijkstra’s algorithm, our method needs fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where our algorithm is up to three orders of magnitude faster than Dijkstra’s algorithm on a highend CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several algorithms, such as computing the graph diameter, exact arc ?ags, or centrality measures (exact reaches or betweenness), can be greatly accelerated by our method.

Special NSF-SEES Presentation

  • NSF Science, Engineering and Education for Sustainability (SEES) Initiative Authors: Krishna Kant, National Science Foundation
    SEES (Science, Engineering and Education for Sustainability) describes an investment area established by NSF in 2010 to enable discoveries needed to inform actions that lead to environmental, energy and societal sustainability. SEES is expected to be a multi-year effort that will address challenges in climate and energy research and education using a systems- based approach to understanding, predicting, and reacting to change in the linked natural, social, and built environment. In FY 2011, NSF plans to encourage interdisciplinary research and education on energy sustainability, with a particular emphasis on the socioeconomic and environmental implications. Because the SEES initiative is very much an interdisciplinary program, the purpose of this presentation is to give the IPDPS community a briefing on existing and upcoming funding opportunities and how they can start to build communities to take advantage of these opportunities.