TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 15: Distributed Systems and Networks

  • Critical Bubble Scheme: An Efficient Implementation of Globally-aware Network Flow Control Authors: Lizhong Chen (University of Southern California, USA); Ruisheng Wang (University of Southern California, USA); Timothy M. Pinkst
    Network ?ow control mechanisms that are aware of global conditions potentially can achieve higher performance than ?ow control mechanisms that are only locally aware. Owing to high implementation overhead, globally-aware ?ow control mechanisms in their purest form are seldom adopted in practice, leading to less ef?cient simpli?ed implementations. In this paper, we propose an ef?cient implementation of a globally-aware ?ow control mechanism, called Critical Bubble Scheme, and apply it successfully to k-ary n-cube networks for the general class of buffer occupancy-based network ?ow control techniques. Simulation results show that the proposed scheme can reduce the buffer access portion of packet latency by as much as 77%, leading to much lower average packet latency at medium and high network loads while sustaining 11% throughput improvement after network saturation.
  • A Scalable Reverse Lookup Scheme using Group-based Shifted Declustering Layout Authors: Junyao Zhang (University of Central Florida, USA); Pengju Shang (University of Central Florida, USA); Jun Wang (University of Ce
    Recent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabytescale that can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at large scale, a reverse lookup problem, namely ?nding the objects list for the failed node remains open. This is especially true for storage systems with high requirement of data integrity and availability, such as scienti?c research data clusters and etc. Existing solutions are either time consuming or expensive. Meanwhile, replication based block placement can be used to realize fast reverse lookup. However, they are designed for centralized, small-scale storage architectures. In this paper, we propose a fast and ef?cient reverse lookup scheme named Group- based Shifted Declustering (G-SD) layout that is able to locate the whole content of the failed node. G-SD extends our previous shifted declustering layout and applies to large-scale ?le systems. Our mathematical proofs and real-life experiments show that G- SD is a scalable reverse lookup scheme that is up to one order of magnitude faster than existing schemes.
  • Deadlock-Free Oblivious Routing for Arbitrary Topologies Authors: Jens Domke (TU Dresden, Germany); Torsten Hoefler (University of Illinois at Urbana-Champaign, USA); Wolfgang E. Nagel (Technisc
    Ef?cient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured highperformance networks. We investigate a novel routing strategy based on the single-source-shortest-path routing algorithm and extend it to use virtual channels to guarantee deadlock-freedom. We show that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. We demonstrate that the problem of ?nding the minimal number of virtual channels needed to route a general network deadlock-free is NP-complete and we propose different heuristics to solve the problem. We implement all proposed algorithms in the Open Subnet Manager of In?niBand and compare the number of needed virtual channels and the bandwidths of multiple real and arti?cial network topologies which are established in practice. Our approach allows to use the existing virtual channels more effectively to guarantee deadlock-freedom and increase the effective bandwidth of up to a factor of two. Application benchmarks show an improvement of up to 95%. Our routing scheme is not limited to In?niBand but can be deployed on existing In?niBand installations to increase network performance transparently without modi?cations to the user applications.
  • RDMA Capable iWARP over Datagrams Authors: Ryan E Grant (Queen's University, Canada); Mohammad J Rashti (Queen's University, Canada); Ahmad Afsahi (Queen's University, Can
    iWARP is a state of the art high-speed connection-based RDMA networking technology for Ethernet networks to provide In?niBand-like zero-copy and one-sided communication capabilities over Ethernet. Despite the bene?ts offered by iWARP, many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only de?ned over reliable, connection-oriented transports. This paper presents an RDMA model that functions over reliable and unreliable datagrams. The ability to use datagrams signi?cantly expands the application space serviced by iWARP and can bring the scalability advantages of a connectionless transport to iWARP. In our previous work, we had developed an iWARP datagram solution using send/receive semantics showing excellent memory scalability and performance bene?ts over the current TCP-based iWARP. In this paper, we demonstrate an improved iWARP design that provides true RDMA semantics over datagrams. Speci?cally, because traditional RDMA semantics do not map well to unreliable communication, we propose RDMA Write-Record, the ?rst and the only method capable of supporting RDMA Write over both unreliable and reliable datagrams. We demonstrate through a proof-of-concept software implementation that datagram-iWARP is feasible for realworld applications. Our proposed RDMA Write-Record method has been designed with data loss in mind and can provide superior performance under conditions of packet loss. It is shown through micro-benchmarks that by using RDMA capable datagram-iWARP a maximum of 256% increase in large message bandwidth and a maximum of 24.4% improvement in small message latency can be achieved over traditional iWARP. For application results we focus on streaming applications, showing a 24% improvement in memory usage and up to a 74% improvement in performance, although the proposed approach is also applicable to the HPC domain.

SESSION 16: Programming Environments and Tools

  • Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs Authors: Zoltan Szebenyi (Jülich Supercomputing Centre, Germany); Todd Gamblin (Lawrence Livermore National Laboratory, USA); Marti
    We can pro?le the performance behavior of parallel programs at the level of individual call paths through sampling or direct instrumentation. While we can easily control measurement dilation by adjusting the sampling frequency, the statistical nature of sampling and the dif?culty of accessing the parameters of sampled events make it unsuitable for obtaining certain communication metrics, such as the size of message payloads. Alternatively, direct instrumentation, which is preferable for capturing message-passing events, can excessively dilate measurements, particularly for C++ programs, which often have many short but frequently called class member functions. Thus, we combine these techniques in a uni?ed framework that exploits the strengths of each approach while avoiding their weaknesses: We use direct instrumentation to intercept MPI routines while we record the execution of the remaining code through low-overhead sampling. One of the main technical hurdles mastered was the inexpensive and portable determination of call-path information during the invocation of MPI routines. We show that the overhead of our implementation is suf?ciently low to support substantial performance improvement of a C++ ?uid-dynamics code.
  • A Practical Approach for Performance Analysis of Shared Memory Programs Authors: Bogdan Marius Tudor (National University of Singapore, Singapore); Yong Meng Teo (National University of Singapore, Singapore)
    Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still dif?cult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various con?gurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.
  • Single Node On-Line Simulation of MPI Applications with SMPI Authors: Pierre-Nicolas Clauss (Nancy University, France); Mark Lee Stillwell (INRIA, France); Stéphane Genaud (University of Str
    Simulation is a popular approach for predicting the performance of MPI applications for platforms that are not at one’s disposal. It is also a way to teach the principles of parallel programming and high-performance computing to students without access to a parallel computer. In this work we present SMPI, a simulator for MPI applications that uses on-line simulation, i.e., the application is executed but part of the execution takes place within a simulation component. SMPI simulations account for network contention in a fast and scalable manner. SMPI also implements an original and validated piece- wise linear model for data transfer times between cluster nodes. Finally SMPI simulations of large-scale applications on large-scale platforms can be executed on a single node thanks to techniques to reduce the simulation’s compute time and memory footprint. These contributions are validated via a large set of experiments in which SMPI is compared to popular MPI implementations with a view to assess its accuracy, scalability, and speed.
  • Patus: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures Authors: Matthias Christen (University of Basel, Switzerland); Olaf Schenk (University of Basel, Switzerland); Helmar Burkhart (Universi
    Stencil calculations comprise an important class of kernels in many scienti?c computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an ef?cient parallel implementation of the kernel is therefore crucial in order to reduce the time to solution. However, in the current complex hardware microarchitectures, meticulous architecture-speci?c tuning is required to elicit the machine’s full compute power. We present a code generation and auto-tuning framework PATUS for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units, which makes it possible to generate compute kernels from a speci?cation of the stencil operation and a parallelization and optimization strategy, and leverages the autotuning methodology to optimize strategydependent parameters for the given hardware architecture.

SESSION 1: Resource Management

  • Power-aware replica placement and update strategies in tree networks Authors: Anne Benoit (ENS Lyon, France); Paul Renaud-Goud (LIP, ENS Lyon, France); Yves Robert (ENS Lyon, France)
    This paper deals with optimal strategies to place replicas in tree networks, with the double objective to minimize the total cost of the servers, and/or to optimize power consumption. The client requests are known beforehand, and some servers are assumed to pre-exist in the tree. Without power consumption constraints, the total cost is an arbitrary function of the number of existing servers that are reused, and of the number of new servers. Whenever creating and operating a new server has higher cost than reusing an existing one (which is a very natural assumption), cost optimal strategies have to trade-off between reusing resources and load-balancing requests on new servers. We provide an optimal dynamic programming algorithm that returns the optimal cost, thereby extending known results without pre-existing servers. With power consumption constraints, we assume that servers operate under a set of M different modes depending upon the number of requests that they have to process. In practice M is a small number, typically 2 or 3, depending upon the number of allowed voltages. Power consumption includes a static part, proportional to the total number of servers, and a dynamic part, proportional to a constant exponent of the server mode, which depends upon the model for power. The cost function becomes a more complicated function that takes into account reuse and creation as before, but also upgrading or downgrading an existing server from one mode to another. We show that with an arbitrary number of modes, the power minimization problem is NP-complete, even without cost constraint, and without static power. Still, we provide an optimal dynamic programming algorithm that returns the minimal power, given a threshold value on the total cost; it has exponential complexity in the number of modes M, and its practical usefulness is limited to small values of M. Still, experiments conducted with this algorithm show that it can process large trees in reasonable time, despite its worst-case complexity.
  • Minimum Cost Resource Allocation for meeting job requirements Authors: Venkatesan T Chakaravarthy (IBM Research (India), India); Sambuddha Roy (IBM Research - India, India); Yogish Sabharwal (IBM Re
    We consider the problem of allocating resources for completing a collection of jobs. Each resource is speci?ed by a start-time, ?nish-time and the capacity of resource available and has an associated cost; and each job is speci?ed by a starttime, ?nish-time and the amount of the resource required (demand) during this interval. A feasible solution is a multiset of resources (i.e., multiple units of each resource may be picked) such that at any point of time, the sum of the capacities offered by the resources is at least the total demand of the jobs active at that point of time. The cost of the solution is the sum of the costs of the resources included in the solution (taking into account the units of the resources). The goal is to ?nd a feasible solution of minimum cost. This problem arises naturally in many scenarios. For example, given a set of jobs, we would like to allocate some resource such as machines, memory or bandwidth in order to complete all the jobs. This problem generalizes a covering version of the knapsack problem which is known to be NP-hard. We present a constant factor approximation algorithm for this problem based on a Primal-Dual approach.
  • Power and Performance Management in Priority-type Cluster Computing Systems Authors: Kaiqi Xiong (North Carolina State University, USA)
    Cluster computing not only improves performance but also increase power consumption. It is a challenge to increase the performance of a cluster computing system and reduce its power consumption simultaneously. In this paper, we consider a collection of cluster computing resources owned by a service provider to host an enterprise application for multiple class business customers where customer requests are distinguished, with different request characteristics and service requirements. We start with a development of computing an average end-to-end delay and an average energy consumption for multiple class customers in such an application. Then, we present approaches for optimizing the average end-to-end delay subject to the constraint of an average energy consumption and optimizing the average end-to-end energy consumption subject to the constraints of an average end-to-end delay for all class and each class customer requests respectively. Moreover, a service provider processes the service requests of customers according to a service level agreement (SLA), which is a contract agreed between a customer and a service provider. It becomes important and commonplace to prioritize multiple customer services in favor of customers who are willing to pay higher fees. We propose an approach for minimizing the total cost of cluster computing resources allocated to ensure multiple priority customer service guarantees by the service provider. It is demonstrated through our simulation that the proposed approaches are ef?cient and accurate for power management and performance guarantees in priority-type cluster computing systems
  • Willow: A Control System For Energy And Thermal Adaptive Computing Authors: Krishna Kant (National Science Foundation, USA); Muthukumar Murugan (University of Minnesota, USA); David Du (University of Min
    The increasing energy demand coupled with emerging sustainability concerns requires a re-examination of power/thermal issues in data centers from the perspective of short term energy de?ciencies. Such energy de?cient scenarios arise for a variety of reasons including variable energy supply from renewable sources and inadequate power, thermal and cooling capacities. In this paper we propose a hierarchical control scheme to adapt assignments of tasks to servers in a way that can cope with the varying energy limitations and still provide necessary QoS . The rescheduling of tasks on different servers has direct (migration related) and indirect (changed traf?c patterns) network energy impacts that we also consider. We show the stability of our scheme and evaluate its performance via detailed simulations and experiments.