TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 11: Multiprocessing and Concurrency

  • Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Authors: Junghee Lee (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Yongjae Lee (Georgi
    The seemingly interminable dwindle of technology feature sizes well into the nano-scale regime has afforded computer architects with an abundance of computational resources on a single chip. The Chip Multi-Processor (CMP) paradigm is now seen as the de facto architecture for years to come. However, in order to ef?ciently exploit the increasing number of on-chip processing cores, it is imperative to achieve and maintain ef?cient utilization of the resources at run time. Uneven and skewed distribution of workloads misuses the CMP resources and may even lead to such undesired effects as traf?c and temperature hotspots. While existing techniques rely mostly on software for the undertaking of load balancing duties and exploit hardware mainly for synchronization, we will demonstrate that there are wider opportunities for hardware support of load balancing in CMP systems. Based on this fact, this paper proposes IsoNet, a con?ict-free dynamic load distribution engine that exploits hardware aggressively to reinforce massively parallel computation in manycore settings. Moreover, the proposed architecture provides extensive fault-tolerance against both CPU faults and intra-IsoNet faults. The hardware takes charge of both (1) the management of the list of jobs to be executed, and (2) the transfer of jobs between processing elements to maintain load balance. Experimental results show that, unlike the existing popular techniques of blocking and job stealing, IsoNet is scalable with as many as 1024 processing cores.
  • HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors Authors: Javier Lira (Universitat Politècnica de Catalunya, Spain); Carlos Molina (Universitat Rovira i Virgili, Spain); Antonio
    The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it dif?cult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. Traditionally, NUCA organizations have been classi?ed as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain signi?cant performance bene?ts, they used a non-affordable access scheme mechanism to ?nd data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (Home Knows where to ?nd data within the NUCA cache). It exploits migration features by providing fast and power ef?cient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an ef?cient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40% energy consumed per each memory request, and achieves an average performance improvement of 6%.
  • Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads Authors: Juan M. Cebrián (University of Murcia, Spain); Juan L. Aragón (University of Murcia, Spain); Stefanos Kaxiras (Un
    In the recent years virtually all processor architectures employ multiple cores per chip (CMPs). It is possible to use legacy (i.e., single-core) power saving techniques in CMPs which run either sequential applications or independent multithreaded workloads. However, new challenges arise when running parallel shared-memory applications. In the later case, sacri?cing some performance in a single core (thread) in order to be more energy-ef?cient might unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), therefore, harming the performance of the whole application. CMPs increasingly face thermal and power-related problems during their typical use. Such problems can be solved by setting a power budget to the processor/core. This paper initially studies the behavior of different techniques to match a prede?ned power budget in a CMP processor. While legacy techniques properly work for thread independent/multi-programmed workloads, parallel workloads exhibit the problem of independently adapting the power of each core in a thread dependent scenario. In order to solve this problem we propose a novel mechanism, Power Token Balancing (PTB), aimed at accurately matching an external power constraint by balancing the power consumed among the different cores using a power tokenbased approach while optimizing the energy ef?ciency. We can use power (seen as tokens or coupons) from non-critical threads for the bene?t of critical threads. PTB runs transparent for thread independent / multiprogrammed workloads and can be also used as a spinlock detector based on power patterns. Results show that PTB matches more accurately a prede?ned power budget (total energy consumed over the budget is reduced to 8% for a 16-core CMP) than DVFS with only a 3% energy increase. Finally, we can trade accuracy on matching the power budget for energy-ef?ciency reducing the energy a 4% with a 20% of accuracy.
  • A Very Fast Simulator For Exploring The Many-Core Future Authors: Olivier Certner (INRIA, France); Zheng Li (INRIA, France); Arun Raman (Princeton University, USA); Olivier Temam (INRIA Futurs,
    Although multi-core architectures with a large number of cores (“many-cores”) are considered the future of computing systems, there are currently few practical tools to quickly explore both their design and general program scalability. In this paper, we present SiMany, a discrete-event-based many-core simulator able to support more than a thousand cores while being orders of magnitude faster than existing ?exible approaches. One of the dif?cult challenges for a reasonably realistic many-core simulation is to model faithfully the potentially high concurrency a program can exhibit. SiMany uses a novel virtual time synchronization technique, called spatial synchronization, to achieve this goal in a completely local and distributed fashion, which diminishes interactions and preserves locality. Compared to previous simulators, it raises the level of abstraction by focusing on modeling concurrent interactions between cores, which enables fast coarse comparisons of high-level architecture design choices and parallel programs performance. Sequential pieces of code are executed natively for maximal speed. We exercise the simulator with a set of dwarf-like task-based benchmarks with dynamic control ?ow and irregular data structures. Scalability results are validated through comparison with a cycle-level simulator up to 64 cores. They are also shown consistent with well-known benchmark characteristics. We ?nally demonstrate how SiMany can be used to ef?ciently compare the benchmarks’ behavior over a wide range of architectural organizations, such as polymorphic architectures and network of clusters

SESSION 21: Numerical Algorithms

  • QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Authors: Emmanuel Agullo (INRIA / LaBRI, France); Cédric Augonnet (LaBRI / University of Bordeaux / INRIA Bordeaux Sud-Ouest, Fra
    One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly ef?cient QR factorization for such a node. Our method is in three steps. The ?rst step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can ef?ciently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the bene?ts of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an ef?cient runtime system, StarPU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a ?ne interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.
  • Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures Authors: Piotr Luszczek (University of Tennessee, USA); Hatem Ltaief (University of Tennessee, USA); Jack Dongarra (University of Tennes
    While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and dif?cult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor-memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an ef?cient implementation of the tridiagonal reduction, which is the ?rst algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a two-stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a left-looking bulge chasing technique to reduce memory traf?c and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction signi?cantly outperforms the state-of-the-art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.
  • An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU Authors: Andrew Davidson (University of California, Davis, USA); Yao Zhang (University of California, Davis, USA); John D. Owens (Univer
    We present a multi-stage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be ef?ciently solved due to the limitation of on-chip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on-chip. The multi-stage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an auto-tuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that auto-tuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multi-stage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6-11x.
  • A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method Authors: Mark Hoemmen (Sandia National Laboratories, USA)
    Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, ”communication” includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block (”normalization”). TSQR’s rank-revealing capability also makes it useful for detecting de?ation in block iterative methods, for which existing approaches sacri?ce performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5– 20) of very long vectors. The shared-memory parallel component uses Intel’s Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.

SESSION 22: Fault Tolerance

  • Flease - Lease Coordination Without a Lock Server Authors: Björn Kolbeck (Zuse Institute Berlin, Germany); Mikael Högqvist (Zuse Institute Berlin, Germany); Jan Stender (Zuse I
    Large-scale distributed systems often require scalable and fault-tolerant mechanisms to coordinate exclusive access to shared resources such as ?les, replicas or the primary role. The best known algorithms to implement distributed mutual exclusion with leases, such as Multipaxos, are complex, dif?cult to implement, and rely on stable storage to persist lease information. In this paper we present FLEASE, an algorithm for fault-tolerant lease coordination in distributed systems that is simpler than Multipaxos and does not rely on stable storage. The evaluation shows that FLEASE can be used to implement scalable, decentralized lease coordination that outperforms a central lock service implementation by an order of magnitude.
  • Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Authors: Amina Guermouche (University of Paris South 11, France); Thomas Ropars (INRIA, France); Elisabeth Brunet (Télécom
    As reported by many recent studies, the mean time between failures of future post-petascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated checkpointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure; b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated checkpointing and message logging require logging all messages, imposing a high memory/storage occupation and a signi?cant overhead on communications. It has recently been observed that many MPI HPC applications are send-deterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated checkpointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We ?rst describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated checkpointing.
  • Minimal Obstructions for the Coordinated Attack Problem and Beyond Authors: Tristan Fevat (Aix-Marseille Université, France); Emmanuel Godard (Pims, Cnrs Umi, France)
    We consider the well known Coordinated Attack Problem, where two generals have to decide on a common attack, when their messengers can be captured by the enemy. Informally, this problem represents the dif?culties to agree in the present of communication faults. We consider here only omission faults (loss of message), but contrary to previous studies, we do not to restrict the way messages can be lost, ie. we use no speci?c failure metric. Our contribution is threefold. First, we introduce the study of arbitrary patterns of failure (”omission schemes”), proposing notions and notations that revealed very convenient to handle. In the large subclass of omission schemes where the double simultaneous omission can never happen, we characterize which one are obstructions for the Coordinated Attack Problem. We present then some interesting applications. We show for the ?rst time that the well studied omission scheme, where at most one message can be lost at each round, is a kind of least worst case environment for the Coordinated Attack Problem. We also extend our study to networks of arbitrary size. In particular, we address an open question of Santoro and Widmayer about the Consensus Problem in communication networks with omission faults.
  • Scheduling Parallel Iterative Applications on Volatile Resources Authors: Henri Casanova (University of Hawaii at Manoa, USA); Fanny Dufossé (LIP, ENS Lyon, France); Yves Robert (ENS Lyon, Franc
    In this paper we study the execution of iterative applications on volatile processors such as those found on desktop grids. We develop master-worker scheduling schemes that attempt to achieve good trade-offs between worker speed and worker availability. A key feature of our approach is that we consider a communication model where the bandwidth capacity of the master for sending application data to workers is limited. This limitation makes the scheduling problem more dif?cult both in a theoretical sense and in a practical sense. Furthermore, we consider that a processor can be in one of three states: available, down, or temporarily preempted by its owner. This preempted state also complicates the scheduling problem. In practical settings, e.g., desktop grids, master bandwidth is limited and processors are temporarily reclaimed. Consequently, addressing the aforementioned dif?culties is necessary for successfully deploying master-worker applications on volatile platforms. Our ?rst contribution is to determine the complexity of the scheduling problem in its off-line version, i.e., when processor availability behaviors are known in advance. Even with this knowledge, the problem is NP-hard, and cannot be approximated within a factor 8=7. Our second contribution is a closed-form formula for the expectation of the time needed by a worker to complete a set of tasks. This formula relies on a Markovian assumption for the temporal availability of processors, and is at the heart of some heuristics that aim at favoring “reliable” processors in a sensible manner. Our third contribution is a set of heuristics, which we evaluate in simulation. Our results provide guidance to selecting the best strategy as a function of processor state availability versus average task duration.