TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 4: Runtime Systems

  • A Study of Speculative Distributed Scheduling on the Cell/B.E. Authors: Pieter Bellens (Barcelona Supercomputing Center, Spain); Josep M. Perez (Barcelona Supercomputing Center, Spain); Rosa M. Badia
    Star Superscalar’s (StarSs) programming model converts a sequential application in C or Fortran into an ef?cient parallel program. The resulting parallel code is highly dynamic in the sense that data analysis and task scheduling occur at runtime, while the application executes. In this paper we compare this approach to the strategy adopted by other multi-core programming environments. The prize to pay for dynamic scheduling and dependence tracking is higher runtime overhead. We propose a distributed scheduler for Task Dependence Graphs (TDGs) to attenuate the scheduling cost in heterogeneous multi-core architectures. This scheduler allows the cores to speculatively select tasks from a conservative estimate of the TDG.In case of con?icts or lack of tasks a lightweight centralized scheduler services the faulting core after which the latter resumes its participation in the distributed scheme. Experiments with Cell Superscalar (CellSs) on a representative set of benchmarks demonstrate the reduction in runtime overhead achieved by the distributed scheduler. This reduction in runtime overhead carries over directly to a performance improvement for a large fraction of the benchmarks.
  • Exploiting Data Similarity to Reduce Memory Footprints Authors: Susmit Biswas (Lawrence Livermore National Laboratory, USA); Bronis R. de Supinski (Lawrence Livermore National Laboratory, USA
    Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a signi?cant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more ef?ciently —preferably transparently— could increase effective DRAM capacity and thus the bene?t of multicore nodes for HPC systems. MPI application processes often exhibit signi?cant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them dif?cult to manage statically. Our novel memory allocation library, SBLLmallocShort, automatically identi?es identical memory blocks and merges them into a single copy. Our implementation is transparent to the application and does not require any kernel modi?cations. Overall, we demonstrate that SBLLmalloc reduces the memory footprint of a range of MPI applications by 32:03% on average and up to 60:87%. Further, SBLLmalloc supports problem sizes for IRS over 21:36% larger than using standard memory management techniques, thus signi?cantly increasing effective system size. Similarly, SBLLmalloc requires 43:75% fewer nodes than standard memory management techniques to solve an AMG problem.
  • The Evaluation of an Effective Out-of-core Run-Time System in the Context of Parallel Mesh Generation Authors: Andriy Kot (College of William and Mary, USA); Andrey N Chernikov (College of William and Mary, USA); Nikos Chrisochoides (Coll
    We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies re?ecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simpli?es and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacri?cing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with signi?cantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead (up to 18% on most con?gurations) when computing in-core which means performance is not compromised.
  • Enriching 3-D video games on multicores Authors: Romain Cledat (Georgia Institute of Technology, USA); Tushar Kumar (Georgia Institute of Technology, USA); Jaswanth Sreeram (Ge
    The introduction of multicore processors on desktops and other personal computing platforms has given rise to multiple interesting end-user application possibilities. One important trend is the increased presence of resource hungry applications like gaming and multimedia applications. One of the key distinguishing factors of these applications is that they are amenable to variable semantics (ie, multiple possibilities of results) unlike traditional applications wherein a ?xed, unique answer is expected. For example, varying degrees of image processing improves picture quality; different model complexities used in game physics allow different degrees of realism during game play, and so on. The goal of this paper is to demonstrate that scalable semantics in applications such as video games can be enriched with optional tasks that can be launched and thus adapt to the amount of available resources at runtime. We propose a C/C++ API that allows the programmer to de?ne how the current semantics of a program can be opportunistically enriched, as well as the underlying runtime system that orchestrates the different computations We show how this infrastructure can be used to enrich a well known game called Quake 3. Our results show that it is possible to perform signi?cant enrichment without degrading the application’s performance by utilizing additional cores.

SESSION 20: Operating Systems and Resource Management

  • Reducing fragmentation on torus-connected supercomputers Authors: Wei Tang (Illinois Institute of Technology, USA); Zhiling Lan (Illinois Institute of Technology, USA); Narayan Desai (Argonne N
    Torus-based networks are prevalent on leadership-class petascale systems, providing a good balance between network cost and performance. The major disadvantage of this network architecture is its susceptibility to fragmentation. Many studies have attempted to reduce resource fragmentation in this architecture. Although the approaches suggested can make good allocation decisions reducing fragmentation at job start time, none of them considers a job’s walltime, which can cause resource fragmentation when neighboring jobs do not complete closely. In this paper, we propose a walltime-aware job allocation strategy, which adjacently packs jobs that ?nish around the same time, in order to minimize resource fragmentation caused by job length, discrepancy. Event-driven simulations using real job traces from a production Blue Gene/P system at Argonne National Laboratory demonstrate that our walltime-aware strategy can effectively reduce system fragmentation and improve overall system performance.
  • Co-Analysis of RAS Log and Job Log on Blue Gene/P Authors: Ziming Zheng (Illinois Institute of Technology, USA); Li Yu (Illinois Institute of Technology, USA); Wei Tang (Illinois Institu
    With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insuf?cient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identi?ed a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can signi?cantly facilitate the research in fault resilience of large-scale systems.
  • Decal: Transparent Checkpointing and Process Migration of OpenCL Applications Authors: Hiroyuki Takizawa (Tohoku University, Japan); Kentaro Koyama (Tohoku University, Japan); Katsuto Sato (Tohoku University, Japan
    In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modi?cation and recompilation of its code. A conventional checkpointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function; two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely checkpointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent checkpointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.

SESSION 5: Routing and Communication

  • On Nonblocking Folded-Clos Networks in Computer Communication Environments Authors: Xin Yuan (Florida State University, USA)
    Folded-Clos networks, also referred to as fat-trees, have been widely used as interconnects in large scale high performance computing clusters. The switching capability of such interconnects in computer communication environments, however, is not well understood. In particular, the concept of nonblocking interconnects, which is often used by system vendors, has only been studied in the telephone communication environment with the assumption of a centralized controller. Such “nonblocking”networks do not support nonblocking communications in computer communication environments where the network control is distributed. This paper theoretically analyzes the conditions for folded-Clos networks to achieve nonblocking communications in computer communication environments with various routing schemes including deterministic routing and adaptive routing, and establishes nonblocking conditions.
  • vFtree - A Fat-tree Routing Algorithm using Virtual Lanes to Alleviate Congestion Authors: Wei Lin Guay (Simula Research Laboratory, Norway); Bartosz Bogdanski (Simula Research Laboratory, Norway); Sven-Arne Reinemo (S
    It is a well known fact that multiple virtual lanes can improve performance in interconnection networks, but this knowledge has had little impact on real clusters. Currently, a large number of clusters using In?niBand is based on fat-tree topologies that can be routed deadlock-free using only one virtual lane. Consequently, all the remaining virtual lanes are left unused. In this paper we suggest an enhancement to the fat-tree algorithm that utilizes virtual lanes to improve performance when hot-spots are present. Even though the bisection bandwidth in a fat-tree is constant, hot-spots are still possible and they will degrade performance for ?ows not contributing to them due to head-of-line blocking. Such a situation may be alleviated through adaptive routing or congestion control, however, these methods are not yet readily available in In?niBand technology. To remedy this problem, we have implemented an enhanced fat-tree algorithm in OpenSM that distributes traf?c across all available virtual lanes without any con?guration needed. We evaluated the performance of the algorithm on a small cluster and did a large-scale evaluation through simulations. In a congested environment, results show that we are able to achieve throughput increases up to 38% on a small cluster and from 221% to 757% depending on the hot-spot scenario for a 648-port simulated cluster.
  • Measuring Temporal Lags in Delay-Tolerant Networks Authors: Arnaud Casteigts (University of Ottawa, Canada); Paola Flocchini (University of Ottawa, Canada); Bernard Mans (Macquarie Univer
    Delay-tolerant networks (DTNs) are characterized by a possible absence of end-to-end communication routes at any instant. In most cases, however, a form of connectivity can be established over time and space. This particularity leads to consider the relevance of a given route not only in terms of hops (topological length), but also in terms of time (temporal length). The problem of measuring temporal distances between individuals in a social network was recently addressed, based on a posteriori analysis of interaction traces. This paper focuses on the distributed version of this problem, asking whether every node in a network can know precisely and in real time how out-of-date it is with respect to every other. Answering af?rmatively is simple when contacts between the nodes are punctual, using the temporal adaptation of vector clocks provided in (Kossinets et al., 2008). It becomes more dif?cult when contacts have a duration and can overlap in time with each other. We demonstrate that the problem remains solvable with arbitrarily long contacts and non-instantaneous (though invariant and known) propagation delays on edges. This is done constructively by extending the temporal adaptation of vector clocks to non-punctual causality. The second part of the paper discusses how the knowledge of temporal lags could be used as a building block to solve more concrete problems, such as the construction of foremost broadcast trees or network backbones in periodically-varying DTNs.