TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 5: Routing and Communication

  • On Nonblocking Folded-Clos Networks in Computer Communication Environments Authors: Xin Yuan (Florida State University, USA)
    Folded-Clos networks, also referred to as fat-trees, have been widely used as interconnects in large scale high performance computing clusters. The switching capability of such interconnects in computer communication environments, however, is not well understood. In particular, the concept of nonblocking interconnects, which is often used by system vendors, has only been studied in the telephone communication environment with the assumption of a centralized controller. Such “nonblocking”networks do not support nonblocking communications in computer communication environments where the network control is distributed. This paper theoretically analyzes the conditions for folded-Clos networks to achieve nonblocking communications in computer communication environments with various routing schemes including deterministic routing and adaptive routing, and establishes nonblocking conditions.
  • vFtree - A Fat-tree Routing Algorithm using Virtual Lanes to Alleviate Congestion Authors: Wei Lin Guay (Simula Research Laboratory, Norway); Bartosz Bogdanski (Simula Research Laboratory, Norway); Sven-Arne Reinemo (S
    It is a well known fact that multiple virtual lanes can improve performance in interconnection networks, but this knowledge has had little impact on real clusters. Currently, a large number of clusters using In?niBand is based on fat-tree topologies that can be routed deadlock-free using only one virtual lane. Consequently, all the remaining virtual lanes are left unused. In this paper we suggest an enhancement to the fat-tree algorithm that utilizes virtual lanes to improve performance when hot-spots are present. Even though the bisection bandwidth in a fat-tree is constant, hot-spots are still possible and they will degrade performance for ?ows not contributing to them due to head-of-line blocking. Such a situation may be alleviated through adaptive routing or congestion control, however, these methods are not yet readily available in In?niBand technology. To remedy this problem, we have implemented an enhanced fat-tree algorithm in OpenSM that distributes traf?c across all available virtual lanes without any con?guration needed. We evaluated the performance of the algorithm on a small cluster and did a large-scale evaluation through simulations. In a congested environment, results show that we are able to achieve throughput increases up to 38% on a small cluster and from 221% to 757% depending on the hot-spot scenario for a 648-port simulated cluster.
  • Measuring Temporal Lags in Delay-Tolerant Networks Authors: Arnaud Casteigts (University of Ottawa, Canada); Paola Flocchini (University of Ottawa, Canada); Bernard Mans (Macquarie Univer
    Delay-tolerant networks (DTNs) are characterized by a possible absence of end-to-end communication routes at any instant. In most cases, however, a form of connectivity can be established over time and space. This particularity leads to consider the relevance of a given route not only in terms of hops (topological length), but also in terms of time (temporal length). The problem of measuring temporal distances between individuals in a social network was recently addressed, based on a posteriori analysis of interaction traces. This paper focuses on the distributed version of this problem, asking whether every node in a network can know precisely and in real time how out-of-date it is with respect to every other. Answering af?rmatively is simple when contacts between the nodes are punctual, using the temporal adaptation of vector clocks provided in (Kossinets et al., 2008). It becomes more dif?cult when contacts have a duration and can overlap in time with each other. We demonstrate that the problem remains solvable with arbitrarily long contacts and non-instantaneous (though invariant and known) propagation delays on edges. This is done constructively by extending the temporal adaptation of vector clocks to non-punctual causality. The second part of the paper discusses how the knowledge of temporal lags could be used as a building block to solve more concrete problems, such as the construction of foremost broadcast trees or network backbones in periodically-varying DTNs.

SESSION 6: Self Stabilization and Security

  • A Lightweight Method for Automated Design of Convergence Authors: Ali Ebnenasir (Michigan Technological University, USA); Aly Farahat (Michigan Technological University, USA)
    Design and veri?cation of Self-Stabilizing (SS) network protocols are dif?cult tasks in part because of the requirement that a SS protocol must recover to a set of legitimate states from any state in its state space (when perturbed by transient faults). Moreover, distribution issues exacerbate the design complexity of SS protocols as processes should take local actions that result in global recovery/convergence of a network protocol. As such, most existing design techniques focus on protocols that are locally-correctable. To facilitate the design of ?nite-state SS protocols (that may not necessarily be locally-correctable), this paper presents a lightweight formal method supported by a software tool that automatically adds convergence to nonstabilizing protocols. We have used our method/tool to automatically generate several SS protocols with up to 40 processes (and 3 40 states) in a few minutes on a regular PC. Surprisingly, our tool has automatically synthesized both protocols that are the same as their manually-designed versions as well as new solutions for well-known problems in the literature (e.g., Dijkstra’s token ring [?]). Moreover, the proposed method has helped us reveal ?aws in a manually designed SS protocol.
  • Snap-Stabilizing Committee Coordination Authors: Borzoo Bonakdarpour (University of Waterloo, Canada); Stéphane Devismes (Université Joseph Fourier, France); Fran
    In this paper, we propose two snap-stabilizing distributed algorithms for the committee coordination problem. In this problem, a committee consists of a set of processes and committee meetings are synchronized, so that each process participates in at most one committee meeting at a time. Snap-stabilization is a versatile technique allowing to design algorithms that ef?ciently tolerate transient faults. Indeed, after a ?nite number of such faults (e.g. memory corruptions, message losses, etc), a snapstabilizing algorithm immediately operates correctly, without any external intervention. We design snap-stabilizing committee coordination algorithms enriched with some desirable properties related to concurrency, (weak) fairness, and a stronger synchronization mechanism called 2-Phase Discussion Time. From previous papers, we know that (1) in the general case, (weak) fairness cannot be achieved in the committee coordination, and (2) it becomes feasible provided that each process waits for meetings in?nitely often. Nevertheless, we show that even under this latter assumption, it is impossible to implement a fair solution that allows maximal concurrency. Hence, we propose two orthogonal snap-stabilizing algorithms, each satisfying 2-phase discussion time, and either maximal concurrency or fairness. The algorithm implementing fairness requires that every process waits for meetings in?nitely often. Moreover, for this algorithm, we introduce and evaluate a new ef?ciency criterion called the degree of fair concurrency. This criterion shows that even if it does not satisfy maximal concurrency, our snap-stabilizing fair algorithm still allows a high level of concurrency
  • SC-OA: A Secure and Efficient Scheme for Origin Authentication of Interdomain Routing in Cloud Computing Networks, Authors: Z. Le (Jiangxi University of Finance and Economics, China), N. Xiong (Georgia State University, USA), B. Yang (Jiangxi Universit
    IP pre?x hijacking is one of the top threats in the cloud computing Internets. Based on cryptography, many schemes for preventing pre?x hijacks have been proposed. Securing binding between IP pre?x and its owner underlies these schemes. We believe that a scheme for securing this binding should try to satisfy these seven critical requirements: no key escrow, no other secure channel, defending against Malicious Key Issuer (MKI) in the phase of pre?x announcement, defending against MKI in the phase of key issuing, no certi?cate, in-band delegation attestation, and in-band public key witness. In this paper, we propose a new scheme, Origin Authentication based on Self-Certi?ed public keys (SC-OA), using self-certi?ed public keys to authenticate origin autonomous systems. To the best of our knowledge, it is the ?rst work for securing pre?x ownership using self-certi?ed public keys to achieve an ef?cient and secure scheme that satis?es all seven requirements. The analyses show that SC-OA can defend against regular pre?x, subpre?x, unassigned pre?x, interception-based, and MKI hijacking, and improve performance in many aspects. It will be pushed ahead to practical deployment for preventing pre?x hijacks.

SESSION 7: Numerical Algorithms

  • Automatic Library Generation for BLAS3 on GPUs Authors: Huimin Cui (Institute of Computing Technology, P.R. China); Lei Wang (Institute of Computing Technology, Chinese Academy of Sci
    High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential bene?ts of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).
  • Redesign of Higher-Level Matrix Algorithms for Multicore and Distributed Architectures and Applications in Quantum Monte Carlo Simulation Authors: Che-Rung Lee (National Tsing Hua University, Taiwan); Zhaojun Bai (University of California, Davis, USA)
    A matrix operation is referred to as a hard-to-parallel matrix operation (HPMO) if it has serial bottlenecks that are hardly parallelizable. Otherwise, it is referred to as an easy-to-parallel matrix operation (EPMO). Empirical evidences showed the performance scalability of an HPMO is signi?cantly poorer than an EPMO on multicore and distributed architectures. As the result, the design of higher-level algorithms for applications, for the performance considerations on multicore and distributed architectures, should avoid the use of HPMOs as the computational kernels. In this paper, as a case study, we present an HPMO-avoiding algorithm for the Green’s function calculation in quantum Monte Carlo simulation. The original algorithm utilizes the QR-decomposition with column pivoting (QRP) as its computational kernel. QRP is an HPMO. The redesigned algorithm maintains the same simulation stability but employs the standard QR decomposition without pivoting (QR), which is an EPMO. Different implementations of the redesigned algorithm on multicore and distributed architectures are investigated. Although some implementations of the redesigned method use about a factor of three more ?oating-point operations than the original algorithm, they are about 20% faster on a quadcore system and 2.5 times faster on a 1024-CPU massively parallel processing system. The broader impact of the redesign of higher-level matrix algorithms to avoid HPMOs in other computational science applications is also discussed.
  • Challenges of Scaling Algebraic Multigrid across Modern Multicore Architectures Authors: Allison Baker (Lawrence Livermore National Laboratory, USA); Todd Gamblin (Lawrence Livermore National Laboratory, USA); Martin
    Algebraic multigrid (AMG) is a popular solver for large-scale scienti?c computing and an essential component of many simulation codes. AMG has shown to be extremely ef?cient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can signi?cantly deteriorate AMG’s performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM BlueGene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.