TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

Intel Platinum Patron Night

  • Architecting Parallel Software: Design patterns in practice and teaching Authors: Michael Wrinn, Intel
    Design patterns can systematically identify reusable elements in software engineering, and have been particularly effective in codifying practice in object-oriented software. A team of researchers centered at UC Berkeley’s Parallel Computing Laboratory continues to investigate a design pattern approach to parallel software; the effort has matured to the point that an undergraduate course was delivered on the topic in Fall 2010. This talk will briefly describe the pattern language itself, then demonstrate its application in examples from both image processing and game design.
  • Teaching Parallelism Using Games Authors: Ashish Amresh, Intel; Amit Jindal, Intel
    Academic institutions do not have to spend expensive multi-core hardware to support game-based courses to teach parallelism. We will discuss what teaching methodologies educators can use for integrating parallel computing curriculum inside a game engine. We will talk about the full game development process, from game design to game engineering and how parallelism is critical. We will show five game demos that mirror current trends in the industry and how educators can use in these games in the classroom. We will also show the learning outcomes, what parallelism topics are appropriate to teach students at various levels. We will demonstrate how to take games running serially and modify them to run parallel.
  • Starting Your Future Career at Intel Authors: Dani Napier, Intel; Lauren Dankiewicz, Intel
    Intel's Dani Napier will introduce why Intel is a great place to work-- it's challenging, has great benefits and is abundant with rewarding growth opportunities. She will expand on why parallelism is crucial to Intel's growth strategy and give an overview of the various types of jobs in which knowledge of parallel and distributed processing apply at Intel. Finally, Dani will explain the new hire development process and why Intel is the company that will help you become successful in your desired career path. Lauren Dankiewicz will discuss her background from the University of California, Berkeley. She gives an insightful and humorous commentary on the interview process at Intel, drawing similarities to dating. Lauren describes the excitement, the uncertainty, and what it takes to make the right choice! Listen to this fun and engaging real-life clip of how an intern became a full-time employee at Intel.
  • Opening Remarks Authors:
    Intel Platinum Patron Night will be held on Thursday evening, 5:30-8:30pm, in the Kuskokwim Ballroom. This will be an exciting opportunity for IPDPS attendees to network and learn about the Intel Academic Community’s free resources to support parallel computing research and teaching. Intel recruiters will share information about engineering internships and careers for recent college graduates.

25th Year IPDPS Celebration

SESSION 16: Programming Environments and Tools

  • Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs Authors: Zoltan Szebenyi (Jülich Supercomputing Centre, Germany); Todd Gamblin (Lawrence Livermore National Laboratory, USA); Marti
    We can pro?le the performance behavior of parallel programs at the level of individual call paths through sampling or direct instrumentation. While we can easily control measurement dilation by adjusting the sampling frequency, the statistical nature of sampling and the dif?culty of accessing the parameters of sampled events make it unsuitable for obtaining certain communication metrics, such as the size of message payloads. Alternatively, direct instrumentation, which is preferable for capturing message-passing events, can excessively dilate measurements, particularly for C++ programs, which often have many short but frequently called class member functions. Thus, we combine these techniques in a uni?ed framework that exploits the strengths of each approach while avoiding their weaknesses: We use direct instrumentation to intercept MPI routines while we record the execution of the remaining code through low-overhead sampling. One of the main technical hurdles mastered was the inexpensive and portable determination of call-path information during the invocation of MPI routines. We show that the overhead of our implementation is suf?ciently low to support substantial performance improvement of a C++ ?uid-dynamics code.
  • A Practical Approach for Performance Analysis of Shared Memory Programs Authors: Bogdan Marius Tudor (National University of Singapore, Singapore); Yong Meng Teo (National University of Singapore, Singapore)
    Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still dif?cult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various con?gurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.
  • Single Node On-Line Simulation of MPI Applications with SMPI Authors: Pierre-Nicolas Clauss (Nancy University, France); Mark Lee Stillwell (INRIA, France); Stéphane Genaud (University of Str
    Simulation is a popular approach for predicting the performance of MPI applications for platforms that are not at one’s disposal. It is also a way to teach the principles of parallel programming and high-performance computing to students without access to a parallel computer. In this work we present SMPI, a simulator for MPI applications that uses on-line simulation, i.e., the application is executed but part of the execution takes place within a simulation component. SMPI simulations account for network contention in a fast and scalable manner. SMPI also implements an original and validated piece- wise linear model for data transfer times between cluster nodes. Finally SMPI simulations of large-scale applications on large-scale platforms can be executed on a single node thanks to techniques to reduce the simulation’s compute time and memory footprint. These contributions are validated via a large set of experiments in which SMPI is compared to popular MPI implementations with a view to assess its accuracy, scalability, and speed.
  • Patus: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures Authors: Matthias Christen (University of Basel, Switzerland); Olaf Schenk (University of Basel, Switzerland); Helmar Burkhart (Universi
    Stencil calculations comprise an important class of kernels in many scienti?c computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an ef?cient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution. However, in the current complex hardware microarchitectures, meticulous architecture-speci?c tuning is required to elicit the machine’s full compute power. We present a code generation and auto-tuning framework PATUS for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a speci?cation of the stencil operation and a parallelization and optimization strategy, and leverages the autotuning methodology to optimize strategydependent parameters for the given hardware architecture.