TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 15: Distributed Systems and Networks

  • Critical Bubble Scheme: An Efficient Implementation of Globally-aware Network Flow Control Authors: Lizhong Chen (University of Southern California, USA); Ruisheng Wang (University of Southern California, USA); Timothy M. Pinkst
    Network ?ow control mechanisms that are aware of global conditions potentially can achieve higher performance than ?ow control mechanisms that are only locally aware. Owing to high implementation overhead, globally-aware ?ow control mechanisms in their purest form are seldom adopted in practice, leading to less ef?cient simpli?ed implementations. In this paper, we propose an ef?cient implementation of a globally-aware ?ow control mechanism, called Critical Bubble Scheme, and apply it successfully to k-ary n-cube networks for the general class of buffer occupancy-based network ?ow control techniques. Simulation results show that the proposed scheme can reduce the buffer access portion of packet latency by as much as 77%, leading to much lower average packet latency at medium and high network loads while sustaining 11% throughput improvement after network saturation.
  • A Scalable Reverse Lookup Scheme using Group-based Shifted Declustering Layout Authors: Junyao Zhang (University of Central Florida, USA); Pengju Shang (University of Central Florida, USA); Jun Wang (University of Ce
    Recent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabytescale that can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at large scale, a reverse lookup problem, namely ?nding the objects list for the failed node remains open. This is especially true for storage systems with high requirement of data integrity and availability, such as scienti?c research data clusters and etc. Existing solutions are either time consuming or expensive. Meanwhile, replication based block placement can be used to realize fast reverse lookup. However, they are designed for centralized, small-scale storage architectures. In this paper, we propose a fast and ef?cient reverse lookup scheme named Group- based Shifted Declustering (G-SD) layout that is able to locate the whole content of the failed node. G-SD extends our previous shifted declustering layout and applies to large-scale ?le systems. Our mathematical proofs and real-life experiments show that G- SD is a scalable reverse lookup scheme that is up to one order of magnitude faster than existing schemes.
  • Deadlock-Free Oblivious Routing for Arbitrary Topologies Authors: Jens Domke (TU Dresden, Germany); Torsten Hoefler (University of Illinois at Urbana-Champaign, USA); Wolfgang E. Nagel (Technisc
    Ef?cient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured highperformance networks. We investigate a novel routing strategy based on the single-source-shortest-path routing algorithm and extend it to use virtual channels to guarantee deadlock-freedom. We show that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. We demonstrate that the problem of ?nding the minimal number of virtual channels needed to route a general network deadlock-free is NP-complete and we propose different heuristics to solve the problem. We implement all proposed algorithms in the Open Subnet Manager of In?niBand and compare the number of needed virtual channels and the bandwidths of multiple real and arti?cial network topologies which are established in practice. Our approach allows to use the existing virtual channels more effectively to guarantee deadlock-freedom and increase the effective bandwidth of up to a factor of two. Application benchmarks show an improvement of up to 95%. Our routing scheme is not limited to In?niBand but can be deployed on existing In?niBand installations to increase network performance transparently without modi?cations to the user applications.
  • RDMA Capable iWARP over Datagrams Authors: Ryan E Grant (Queen's University, Canada); Mohammad J Rashti (Queen's University, Canada); Ahmad Afsahi (Queen's University, Can
    iWARP is a state of the art high-speed connection-based RDMA networking technology for Ethernet networks to provide In?niBand-like zero-copy and one-sided communication capabilities over Ethernet. Despite the bene?ts offered by iWARP, many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only de?ned over reliable, connection-oriented transports. This paper presents an RDMA model that functions over reliable and unreliable datagrams. The ability to use datagrams signi?cantly expands the application space serviced by iWARP and can bring the scalability advantages of a connectionless transport to iWARP. In our previous work, we had developed an iWARP datagram solution using send/receive semantics showing excellent memory scalability and performance bene?ts over the current TCP-based iWARP. In this paper, we demonstrate an improved iWARP design that provides true RDMA semantics over datagrams. Speci?cally, because traditional RDMA semantics do not map well to unreliable communication, we propose RDMA Write-Record, the ?rst and the only method capable of supporting RDMA Write over both unreliable and reliable datagrams. We demonstrate through a proof-of-concept software implementation that datagram-iWARP is feasible for realworld applications. Our proposed RDMA Write-Record method has been designed with data loss in mind and can provide superior performance under conditions of packet loss. It is shown through micro-benchmarks that by using RDMA capable datagram-iWARP a maximum of 256% increase in large message bandwidth and a maximum of 24.4% improvement in small message latency can be achieved over traditional iWARP. For application results we focus on streaming applications, showing a 24% improvement in memory usage and up to a 74% improvement in performance, although the proposed approach is also applicable to the HPC domain.