TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

Intel Platinum Patron Night

  • Architecting Parallel Software: Design patterns in practice and teaching Authors: Michael Wrinn, Intel
    Design patterns can systematically identify reusable elements in software engineering, and have been particularly effective in codifying practice in object-oriented software. A team of researchers centered at UC Berkeley’s Parallel Computing Laboratory continues to investigate a design pattern approach to parallel software; the effort has matured to the point that an undergraduate course was delivered on the topic in Fall 2010. This talk will briefly describe the pattern language itself, then demonstrate its application in examples from both image processing and game design.
  • Teaching Parallelism Using Games Authors: Ashish Amresh, Intel; Amit Jindal, Intel
    Academic institutions do not have to spend expensive multi-core hardware to support game-based courses to teach parallelism. We will discuss what teaching methodologies educators can use for integrating parallel computing curriculum inside a game engine. We will talk about the full game development process, from game design to game engineering and how parallelism is critical. We will show five game demos that mirror current trends in the industry and how educators can use in these games in the classroom. We will also show the learning outcomes, what parallelism topics are appropriate to teach students at various levels. We will demonstrate how to take games running serially and modify them to run parallel.
  • Starting Your Future Career at Intel Authors: Dani Napier, Intel; Lauren Dankiewicz, Intel
    Intel's Dani Napier will introduce why Intel is a great place to work-- it's challenging, has great benefits and is abundant with rewarding growth opportunities. She will expand on why parallelism is crucial to Intel's growth strategy and give an overview of the various types of jobs in which knowledge of parallel and distributed processing apply at Intel. Finally, Dani will explain the new hire development process and why Intel is the company that will help you become successful in your desired career path. Lauren Dankiewicz will discuss her background from the University of California, Berkeley. She gives an insightful and humorous commentary on the interview process at Intel, drawing similarities to dating. Lauren describes the excitement, the uncertainty, and what it takes to make the right choice! Listen to this fun and engaging real-life clip of how an intern became a full-time employee at Intel.
  • Opening Remarks Authors:
    Intel Platinum Patron Night will be held on Thursday evening, 5:30-8:30pm, in the Kuskokwim Ballroom. This will be an exciting opportunity for IPDPS attendees to network and learn about the Intel Academic Community’s free resources to support parallel computing research and teaching. Intel recruiters will share information about engineering internships and careers for recent college graduates.

25th Year IPDPS Celebration

SESSION 4: Runtime Systems

  • A Study of Speculative Distributed Scheduling on the Cell/B.E. Authors: Pieter Bellens (Barcelona Supercomputing Center, Spain); Josep M. Perez (Barcelona Supercomputing Center, Spain); Rosa M. Badia
    Star Superscalar’s (StarSs) programming model converts a sequential application in C or Fortran into an ef?cient parallel program. The resulting parallel code is highly dynamic in the sense that data analysis and task scheduling occur at runtime, while the application executes. In this paper we compare this approach to the strategy adopted by other multi-core programming environments. The prize to pay for dynamic scheduling and dependence tracking is higher runtime overhead. We propose a distributed scheduler for Task Dependence Graphs (TDGs) to attenuate the scheduling cost in heterogeneous multi-core architectures. This scheduler allows the cores to speculatively select tasks from a conservative estimate of the TDG.In case of con?icts or lack of tasks a lightweight centralized scheduler services the faulting core after which the latter resumes its participation in the distributed scheme. Experiments with Cell Superscalar (CellSs) on a representative set of benchmarks demonstrate the reduction in runtime overhead achieved by the distributed scheduler. This reduction in runtime overhead carries over directly to a performance improvement for a large fraction of the benchmarks.
  • Exploiting Data Similarity to Reduce Memory Footprints Authors: Susmit Biswas (Lawrence Livermore National Laboratory, USA); Bronis R. de Supinski (Lawrence Livermore National Laboratory, USA
    Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a signi?cant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more ef?ciently —preferably transparently— could increase effective DRAM capacity and thus the bene?t of multicore nodes for HPC systems. MPI application processes often exhibit signi?cant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them dif?cult to manage statically. Our novel memory allocation library, SBLLmallocShort, automatically identi?es identical memory blocks and merges them into a single copy. Our implementation is transparent to the application and does not require any kernel modi?cations. Overall, we demonstrate that SBLLmalloc reduces the memory footprint of a range of MPI applications by 32:03% on average and up to 60:87%. Further, SBLLmalloc supports problem sizes for IRS over 21:36% larger than using standard memory management techniques, thus signi?cantly increasing effective system size. Similarly, SBLLmalloc requires 43:75% fewer nodes than standard memory management techniques to solve an AMG problem.
  • The Evaluation of an Effective Out-of-core Run-Time System in the Context of Parallel Mesh Generation Authors: Andriy Kot (College of William and Mary, USA); Andrey N Chernikov (College of William and Mary, USA); Nikos Chrisochoides (Coll
    We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies re?ecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simpli?es and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacri?cing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with signi?cantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead (up to 18% on most con?gurations) when computing in-core which means performance is not compromised.
  • Enriching 3-D video games on multicores Authors: Romain Cledat (Georgia Institute of Technology, USA); Tushar Kumar (Georgia Institute of Technology, USA); Jaswanth Sreeram (Ge
    The introduction of multicore processors on desktops and other personal computing platforms has given rise to multiple interesting end-user application possibilities. One important trend is the increased presence of resource hungry applications like gaming and multimedia applications. One of the key distinguishing factors of these applications is that they are amenable to variable semantics (ie, multiple possibilities of results) unlike traditional applications wherein a ?xed, unique answer is expected. For example, varying degrees of image processing improves picture quality; different model complexities used in game physics allow different degrees of realism during game play, and so on. The goal of this paper is to demonstrate that scalable semantics in applications such as video games can be enriched with optional tasks that can be launched and thus adapt to the amount of available resources at runtime. We propose a C/C++ API that allows the programmer to de?ne how the current semantics of a program can be opportunistically enriched, as well as the underlying runtime system that orchestrates the different computations We show how this infrastructure can be used to enrich a well known game called Quake 3. Our results show that it is possible to perform signi?cant enrichment without degrading the application’s performance by utilizing additional cores.