TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 3: Hardware-Software Interaction

  • A Novel Power management for CMP Systems in Data-intensive Environment Authors: Pengju Shang (University of Central Florida, USA); Jun Wang (University of Central Florida, USA)
    The emerging data-intensive applications of today are comprised of non-uniform CPU and I/O intensive workloads, thus imposing a requirement to consider both CPU and I/O effects in the power management strategies. Only scaling down the processor’s frequency based on its busy/idle ratio cannot fully exploit opportunities of saving power. Our experiments show that besides the busy and idle status, each processor may also have I/O wait phases waiting for I/O operations to complete. During this period, the completion time is decided by the I/O subsystem rather than the CPU thus scaling the processor to a lower frequency will not affect the performance but save more power. In addition, the CPU’s reaction to the I/O operations may be signi?cantly affected by several factors, such as I/O type (sync or unsync), instruction/job level parallelism; it cannot be accurately modeled via physics laws like mechanical or chemical systems. In this paper, we propose a novel power management scheme called MAR (modeless, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption under performance constraints. By using richer feedback factors, e.g. the I/O wait, MAR is able to accurately describe the relationships among core frequencies, performance and power consumption.We adopt a modeless control model to reduce the complexity of system modeling. MAR is designed for CMP (Chip Multi Processor) systems by employing multi-input/multi-output (MIMO) theory and percore level DVFS (Dynamic Voltage and Frequency Scaling). Our extensive experiments on a physical test bed demonstrate that, for the SPEC benchmark and data-intensive (TPC-C) benchmark, the ef?ciency of MAR is 93.6-96.2% accurate to the ideal power saving strategy calculated off-line. Compared with baseline solutions, MAR could save 22.5-32.5% more power while keeping the comparable performance loss of about 1.8-2.9%. In addition, simulation results show the ef?ciency of our design for various CMP con?gurations.
  • Characterization of System Services and Their Performance Impact in Multicore Nodes Authors: Seetharami R Seelam (IBM Research, USA); Liana L Fong (IBM TJ Watson Research Center, USA); John Divirgilio (IBM, USA); Brian F
    The performance of parallel applications on large scale systems is shown to disproportionately degrade due to interference from system services. This interference from system services is also known as jitter. However, there is limited understanding of sources and patterns of jitter on multi-core systems. In this paper, we identify and characterize jitter sources in terms of their amplitude and execution interval distributions on multi-core IBM Power systems with UNIX-based general purpose operating systems: AIX and Linux. Our analysis shows that there are various kinds of jitter sources and their execution varies drastically between different cores and between hardware threads within each core for practical reasons. This in-depth knowledge of jitter events is leveraged to devise effective approaches to mitigate the jitter impact on application performance in large scale systems. Moreover, such knowledge would provide useful insights to a new generation of operating system designs such as multikernel or satellite kernel for multi-core systems.
  • Automatic Recognition of Performance Idioms in Scientific Applications Authors: Jiahua He (University of California, San Diego, USA); Allan Snavely (University of California, San Diego, USA); Rob F Van der W
    Basic data ?ow patterns that we call performance idioms, such as stream, transpose, reduction, random access and stencil, are common in scienti?c numerical applications. We hypothesize that a small number of idioms can cover most programming constructs that dominate the execution time of scienti?c codes and can be used to approximate the application performance. To check these hypotheses, we proposed an automatic idioms recognition method and implemented the method, based on the open source compiler Open64. With the NAS Parallel Benchmark (NPB) as a case study, the prototype system is about 90% accurate compared with idiom classi?cation by a human expert. Our results showed that the above ?ve idioms suf?ce to cover 100% of the six NPB codes (MG, CG, FT, BT, SP and LU). We also compared the performance of our idiom benchmarks with their corresponding instances in the NPB codes on two different platforms with different methods. The approximation accuracy is up to 96:6%. The contribution is to show that a small set of idioms can cover more complex codes, that idioms can be recognized automatically, and that suitably de?ned idioms may approximate application performance.
  • Iso-energy-efficiency: An approach to power-constrained parallel computation Authors: Shuaiwen Song (Virginia Tech, USA); Chun-Yi Su (Virginia Tech, USA); Rong Ge (Marquette University, USA); Abhinav Vishnu (Pacif
    Future large scale high performance supercomputer systems require high energy ef?ciency to achieve exa?ops computational power and beyond. Despite the need to understand energy ef?ciency in high-performance systems, there are few techniques to evaluate energy ef?ciency at scale. In this paper, we propose a system-level iso-energy-ef?ciency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy ef?ciency and isolate ef?cient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-ef?ciency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.

SESSION 19: Storage Systems and Memory

  • H-Code: A Hybrid MDS Array Code to Optimize Partial Stripe Writes in RAID-6 Authors: Chentao Wu (Virginia Commonwealth University, USA); Shenggang Wan (Huazhong University of Science and Technology, P.R. China);
    RAID-6 is widely used to tolerate concurrent failures of any two disks to provide a higher level of reliability with the support of erasure codes. Among many implementations, one class of codes called Maximum Distance Separable (MDS) codes aims to offer data protection against disk failures with optimal storage ef?ciency. Typical MDS codes contain horizontal and vertical codes. Due to the horizontal parity, in the case of partial stripe write (refers to I/O operations that write new data or update data to a subset of disks in an array) in a row, horizontal codes may get less I/O operations in most cases, but suffer from unbalanced I/O distribution. They also have limitation on high single write complexity. Vertical codes improve single write complexity compared to horizontal codes, while they still suffer from poor performance in partial stripe writes. In this paper, we propose a new XOR-based MDS array code, named Hybrid Code (H-Code), which optimizes partial stripe writes for RAID-6 by taking advantages of both horizontal and vertical codes. H-Code is a solution for an array of (p + 1) disks, where p is a prime number. Unlike other codes taking a dedicated anti-diagonal parity strip, H-Code uses a special anti-diagonal parity layout and distributes the anti-diagonal parity elements among disks in the array, which achieves a more balanced I/O distribution. On the other hand, the horizontal parity of H-Code ensures a partial stripe write to continuous data elements in a row share the same row parity chain, which can achieve optimal partial stripe write performance. Not only within a row but also within a stripe, H-Code offers optimal partial stripe write complexity to two continuous data elements and optimal partial stripe write performance among all MDS codes to the best of our knowledge. Speci?cally, compared to RDP and EVENODD codes, H-Code reduces I/O cost by up to 15:54% and 22:17%. Overall, H-code has optimal storage ef?ciency, optimal encoding/decoding computational complexity, optimal complexity of both single write and partial stripe write.
  • LACIO: A New Collective I/O Strategy for Parallel I/O Systems Authors: Yong Chen (Oak Ridge National Laboratory, USA); Xian-He Sun (Illinois Institute of Technology, USA); Rajeev Thakur (Argonne Nat
    Parallel applications bene?t considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical ?le layout to avoid multiple ?le system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We con?rm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous ?le system calls. It holds promise in improving the I/O performance for parallel systems.
  • Using Shared Memory to Accelerate MapReduce on Graphics Processing Units Authors: Feng Ji (North Carolina State University, USA); Xiaosong Ma (NC State University, USA)
    Modern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces signi?cant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential bene?t of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the ef?cient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with ?ve popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the bene?t of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a signi?cant speedup in Map/Reduce phases.
  • Unified Signatures for Improving Performance in Transactional Memory Authors: Woojin Choi (University of Southern California/Information Sciences Institute, USA); Jeffrey Draper (University of Southern Cal
    Transactional Memory (TM) promises to increase programmer productivity by making it easier to write correct parallel programs. In ful?lling this goal, a TM system should maximize its performance with limited hardware resources. Con?ict detection is an essential element for maintaining correctness among concurrent transactions in a TM system. Hardware signatures have been proposed as an area-ef?cient method for detecting con?icts. However, signatures can degrade TM performance by falsely declaring con?icts. Hence, increasing the quality of signatures within a given hardware budget is a crucial issue for TM to be adopted as a mainstream programming model. In this paper, we propose a simple and effective signature design, uni?ed signature. Instead of using separate read- and write-signatures, as is often done in TM systems, we implement a single signature to track all read- and write-accesses. By merging read- and write-signatures, a uni?ed signature can effectively enlarge the signature size without additional overhead. Within the constraints of a given hardware budget, a TM system with a uni?ed signature outperforms a baseline system with the same hardware budget by reducing the number of falsely detected con?icts. Even though the uni?ed signature scheme incurs read-after-read dependencies, we show that these false dependencies do not negate the bene?t of uni?ed signatures for practical signature sizes. A TM system with 2K-bit uni?ed signatures achieves average speedups of 22% over baseline TM systems.

SESSION 4: Runtime Systems

  • A Study of Speculative Distributed Scheduling on the Cell/B.E. Authors: Pieter Bellens (Barcelona Supercomputing Center, Spain); Josep M. Perez (Barcelona Supercomputing Center, Spain); Rosa M. Badia
    Star Superscalar’s (StarSs) programming model converts a sequential application in C or Fortran into an ef?cient parallel program. The resulting parallel code is highly dynamic in the sense that data analysis and task scheduling occur at runtime, while the application executes. In this paper we compare this approach to the strategy adopted by other multi-core programming environments. The prize to pay for dynamic scheduling and dependence tracking is higher runtime overhead. We propose a distributed scheduler for Task Dependence Graphs (TDGs) to attenuate the scheduling cost in heterogeneous multi-core architectures. This scheduler allows the cores to speculatively select tasks from a conservative estimate of the TDG.In case of con?icts or lack of tasks a lightweight centralized scheduler services the faulting core after which the latter resumes its participation in the distributed scheme. Experiments with Cell Superscalar (CellSs) on a representative set of benchmarks demonstrate the reduction in runtime overhead achieved by the distributed scheduler. This reduction in runtime overhead carries over directly to a performance improvement for a large fraction of the benchmarks.
  • Exploiting Data Similarity to Reduce Memory Footprints Authors: Susmit Biswas (Lawrence Livermore National Laboratory, USA); Bronis R. de Supinski (Lawrence Livermore National Laboratory, USA
    Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a signi?cant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more ef?ciently —preferably transparently— could increase effective DRAM capacity and thus the bene?t of multicore nodes for HPC systems. MPI application processes often exhibit signi?cant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them dif?cult to manage statically. Our novel memory allocation library, SBLLmallocShort, automatically identi?es identical memory blocks and merges them into a single copy. Our implementation is transparent to the application and does not require any kernel modi?cations. Overall, we demonstrate that SBLLmalloc reduces the memory footprint of a range of MPI applications by 32:03% on average and up to 60:87%. Further, SBLLmalloc supports problem sizes for IRS over 21:36% larger than using standard memory management techniques, thus signi?cantly increasing effective system size. Similarly, SBLLmalloc requires 43:75% fewer nodes than standard memory management techniques to solve an AMG problem.
  • The Evaluation of an Effective Out-of-core Run-Time System in the Context of Parallel Mesh Generation Authors: Andriy Kot (College of William and Mary, USA); Andrey N Chernikov (College of William and Mary, USA); Nikos Chrisochoides (Coll
    We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies re?ecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simpli?es and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacri?cing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with signi?cantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead (up to 18% on most con?gurations) when computing in-core which means performance is not compromised.
  • Enriching 3-D video games on multicores Authors: Romain Cledat (Georgia Institute of Technology, USA); Tushar Kumar (Georgia Institute of Technology, USA); Jaswanth Sreeram (Ge
    The introduction of multicore processors on desktops and other personal computing platforms has given rise to multiple interesting end-user application possibilities. One important trend is the increased presence of resource hungry applications like gaming and multimedia applications. One of the key distinguishing factors of these applications is that they are amenable to variable semantics (ie, multiple possibilities of results) unlike traditional applications wherein a ?xed, unique answer is expected. For example, varying degrees of image processing improves picture quality; different model complexities used in game physics allow different degrees of realism during game play, and so on. The goal of this paper is to demonstrate that scalable semantics in applications such as video games can be enriched with optional tasks that can be launched and thus adapt to the amount of available resources at runtime. We propose a C/C++ API that allows the programmer to de?ne how the current semantics of a program can be opportunistically enriched, as well as the underlying runtime system that orchestrates the different computations We show how this infrastructure can be used to enrich a well known game called Quake 3. Our results show that it is possible to perform signi?cant enrichment without degrading the application’s performance by utilizing additional cores.