TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 13: Distributed Algorithms and Models

  • Adding a referee to an interconnection network: What can(not) be computed in one round. Authors: Florent Becker (LIFO, Universite d’Orleans, France); Martin Matamala (DIM, Universidad de Chile, Chile); Nicolas Nisse (I
    In this paper we ask which properties of a distributed network can be computed from a few amount of local information provided by its nodes. The distributed model we consider is a restriction of the classical CONGEST (distributed) model and it is close to the simultaneous messages (communication complexity) model de?ned by Babai, Kimmel and Lokam. More precisely, each of these n nodes -which only knows its own ID and the IDs of its neighbors- is allowed to send a message of O(log n) bits to some central entity, called the referee. Is it possible for the referee to decide some basic structural properties of the network topology G? We show that simple questions like, “does G contain a square?”, “does G contain a triangle?” or “Is the diameter of G at most 3?” cannot be solved in general. On the other hand, the referee can decode the messages in order to have full knowledge of G when G belongs to many graph classes such as planar graphs, bounded treewidth graphs and, more generally, bounded degeneracy graphs. We leave open questions related to the connectivity of arbitrary graphs.
  • Improved Algorithms for the Distributed Trigger Counting Problem Authors: Venkatesan T Chakaravarthy (IBM Research (India), India); Anamitra Roy Choudhury (IBM Research - India, India); Yogish Sabharwa
    Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-speci?ed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The ?rst algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.
  • The Weighted Byzantine Agreement Problem Authors: Vijay Garg (The University of Texas at Austin, USA); John Bridgman (The University of Texas at Austin, USA)
    This paper presents a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most f out of N machines fail, the algorithm assumes that the total weight of the machines that fail is at most f/N. When each machine has weight 1/N, this problem reduces to the standard Byzantine Generals Agreement Problem. By choosing weights appropriately, the weighted Byzantine Agreement Problem can be applied to situations where a subset of processes are more trusted. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than N/3 processes fail, so long as the total weight of the failed processes is less than 1/3. Also, a method to update the weights of the processes after execution of the weighted Byzantine Agreement is given. The update method guarantees that the weight of any correct process is never reduced and the weight of any faulty process, suspected by correct processes whose total weight is at least 1/4, is reduced to 0 for future instances. A short discussion of some weight assignment strategies is also given.
  • Leveraging Social Networks to Combat Collusion in Reputation Systems for Peer-to-Peer Networks Authors: Ze Li (Clemson University, USA); Haiying Shen (Clemson University, USA): Karan Sapra (Clemson University, USA)
    In peer-to-peer networks (P2Ps), many autonomous peers without preexisting trust relationships share resources with each other. Due to their open environment, the P2Ps usually employ reputation systems to provide guidance in selecting trustworthy resource providers for high reliability and security. However, node collusion impairs the effectiveness of reputation systems in trustworthy node selection. Although some reputation systems have certain mechanisms to counter collusion, the effectiveness of the mechanisms is not suf?ciently high. In this paper, we leverage social networks to enhance the capability of reputation systems in combating collusion. We ?rst analyzed real trace of the reputation system in the Overstock online auction platform which incorporates a social network. The analysis reveals the important impact of the social network on user purchasing and reputation rating patterns. We thus identi?ed suspicious collusion behavior patterns and propose a social network based mechanism, namely SocialTrust, to counter collusion. SocialTrust adaptively adjusts the weight of ratings based on the social distance and interest relationship between peers. Experimental results show that SocialTrust can signi?cantly strengthen the capability of current reputation systems in combating collusion.

SESSION 14: Parallel Graph and Particle Algorithms

  • Computing Strongly Connected Components in Parallel on CUDA Authors: Jiri Barnat (Masaryk University, Czech Republic); Petr Bauch (Masaryk University, Czech Republic); Lubos Brim (Masaryk Universi
    The problem of decomposing a directed graph into its strongly connected components is a fundamental graph problem inherently present in many scienti?c and commercial applications. In this paper we show how some of the existing parallel algorithms can be reformulated in order to be accelerated by NVIDIA CUDA technology. In particular, we design a new CUDA-aware procedure for pivot selection and we adapt selected parallel algorithms for CUDA accelerated computation. We also experimentally demonstrate that with a single GTX 480 GPU card we can easily outperform the optimal serial CPU implementation by an order of magnitude in most cases, 40 times on some suf?ciently big instances. This is an interesting result as unlike the serial CPU case, the asymptotic complexity of the parallel algorithms is not optimal.
  • On optimal tree traversals for sparse matrix factorization Authors: Mathias Jacquelin (ENS Lyon, France); Loris Marchal (CNRS, France); Yves Robert (ENS Lyon, France); Bora Ucar (CNRS, France)
    We study the complexity of traversing tree-shaped work?ows whose tasks require large I/O ?les. Such work?ows typically arise in the multifrontal method of sparse matrix factorization. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the work?ow can be processed if all its predecessors have been processed, and if its input and output ?les ?t in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. What is the minimum amount of main memory, over all postorder schemes, or over all possible traversals, that is needed for an in-core execution? We establish several complexity results that answer these questions. We propose a new, polynomial time, exact algorithm which runs faster than a reference algorithm. Next, we address the setting where the required memory renders a pure in-core solution unfeasible. In this setting, we ask the following question: what is the minimum amount of I/O that must be performed between the main memory and the secondary memory? We show that this latter problem is NP-hard, and propose ef?cient heuristics. All algorithms and heuristics are thoroughly evaluated on assembly trees arising in the context of sparse matrix factorizations.
  • Fast Community Detection Algorithm With GPUs and Multi-core Architectures Authors: Jyothish Soman (IIIT-Hyderabad, India); Ankur Narang (IBM India Research Labs, New Delhi, India)
    In this paper, we present the design of a novel scalable parallel algorithm for community detection optimized for multi-core and GPU architectures. Our algorithm is based on label propagation, which works solely on local information, thus giving it the scalability advantage over conventional approaches. We also show that weighted label propagation can overcome typical quality issues in communities detected with label propagation. Experimental results on well known massive scale graphs such as Wikipedia (100M edges) and also on RMAT graphs with 10M - 40M edges, demonstrate the superior performance and scalability of our algorithm compared to the well known approaches for community detection. On the hep-th graph (352K edges) and the wikipedia graph (100M edges), using Power 6 architecture with 32 cores, our algorithm achieves one to two orders of magnitude better performance compared to the best known prior results on parallel architectures with similar number of CPUs. Further, our GPGPU based algorithm achieves 8 improvement over the Power 6 performance on 40M edge R-MAT graph. Alongside, we achieve high quality (modularity) of communities detected, with experimental evidence from well-known graphs such as Zachary karate club, Dolphin network and Football club, where we achieve modularity that is close to the best known alternatives. To the best of our knowledge these are best known results for community detection on massive graphs (100M edges) in terms of performance and also quality vs. performance trade-off. This is also a unique work on community detection on GPGPUs with scalable performance.
  • A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields Authors: Tom Peterka (Argonne National Laboratory, USA); Robert Ross (Argonne National Laboratory, USA); Boonthanome Nouanesengsey (The
    Particle tracing for streamline and pathline generation is a common method of visualizing vector ?elds in scienti?c data, but it is dif?cult to parallelize ef?ciently because of demanding and widely varying computational and communication loads. In this paper we scale parallel particle tracing for visualizing steady and unsteady ?ow ?elds well beyond previously published results. We con?gure the 4D domain decomposition into spatial and temporal blocks that combine in-core and out-of-core execution in a ?exible way that favors faster run time or smaller memory. We also compare static and dynamic partitioning approaches. Strong and weak scaling curves are presented for tests conducted on an IBM Blue Gene/P machine at up to 32 K processes using a parallel ?ow visualization library that we are developing. Datasets are derived from computational ?uid dynamics simulations of thermal hydraulics, liquid mixing, and combustion.

SESSION 15: Distributed Systems and Networks

  • Critical Bubble Scheme: An Efficient Implementation of Globally-aware Network Flow Control Authors: Lizhong Chen (University of Southern California, USA); Ruisheng Wang (University of Southern California, USA); Timothy M. Pinkst
    Network ?ow control mechanisms that are aware of global conditions potentially can achieve higher performance than ?ow control mechanisms that are only locally aware. Owing to high implementation overhead, globally-aware ?ow control mechanisms in their purest form are seldom adopted in practice, leading to less ef?cient simpli?ed implementations. In this paper, we propose an ef?cient implementation of a globally-aware ?ow control mechanism, called Critical Bubble Scheme, and apply it successfully to k-ary n-cube networks for the general class of buffer occupancy-based network ?ow control techniques. Simulation results show that the proposed scheme can reduce the buffer access portion of packet latency by as much as 77%, leading to much lower average packet latency at medium and high network loads while sustaining 11% throughput improvement after network saturation.
  • A Scalable Reverse Lookup Scheme using Group-based Shifted Declustering Layout Authors: Junyao Zhang (University of Central Florida, USA); Pengju Shang (University of Central Florida, USA); Jun Wang (University of Ce
    Recent years have witnessed an increasing demand for super data clusters. The super data clusters have reached the petabytescale that can consist of thousands or tens of thousands storage nodes at a single site. For this architecture, reliability is becoming a great concern. In order to achieve a high reliability, data recovery and node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failures at large scale, a reverse lookup problem, namely ?nding the objects list for the failed node remains open. This is especially true for storage systems with high requirement of data integrity and availability, such as scienti?c research data clusters and etc. Existing solutions are either time consuming or expensive. Meanwhile, replication based block placement can be used to realize fast reverse lookup. However, they are designed for centralized, small-scale storage architectures. In this paper, we propose a fast and ef?cient reverse lookup scheme named Group- based Shifted Declustering (G-SD) layout that is able to locate the whole content of the failed node. G-SD extends our previous shifted declustering layout and applies to large-scale ?le systems. Our mathematical proofs and real-life experiments show that G- SD is a scalable reverse lookup scheme that is up to one order of magnitude faster than existing schemes.
  • Deadlock-Free Oblivious Routing for Arbitrary Topologies Authors: Jens Domke (TU Dresden, Germany); Torsten Hoefler (University of Illinois at Urbana-Champaign, USA); Wolfgang E. Nagel (Technisc
    Ef?cient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured highperformance networks. We investigate a novel routing strategy based on the single-source-shortest-path routing algorithm and extend it to use virtual channels to guarantee deadlock-freedom. We show that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. We demonstrate that the problem of ?nding the minimal number of virtual channels needed to route a general network deadlock-free is NP-complete and we propose different heuristics to solve the problem. We implement all proposed algorithms in the Open Subnet Manager of In?niBand and compare the number of needed virtual channels and the bandwidths of multiple real and arti?cial network topologies which are established in practice. Our approach allows to use the existing virtual channels more effectively to guarantee deadlock-freedom and increase the effective bandwidth of up to a factor of two. Application benchmarks show an improvement of up to 95%. Our routing scheme is not limited to In?niBand but can be deployed on existing In?niBand installations to increase network performance transparently without modi?cations to the user applications.
  • RDMA Capable iWARP over Datagrams Authors: Ryan E Grant (Queen's University, Canada); Mohammad J Rashti (Queen's University, Canada); Ahmad Afsahi (Queen's University, Can
    iWARP is a state of the art high-speed connection-based RDMA networking technology for Ethernet networks to provide In?niBand-like zero-copy and one-sided communication capabilities over Ethernet. Despite the bene?ts offered by iWARP, many datacenter and web-based applications, such as stock-market trading and media-streaming applications, that rely on datagram-based semantics (mostly through UDP/IP) cannot take advantage of it because the iWARP standard is only de?ned over reliable, connection-oriented transports. This paper presents an RDMA model that functions over reliable and unreliable datagrams. The ability to use datagrams signi?cantly expands the application space serviced by iWARP and can bring the scalability advantages of a connectionless transport to iWARP. In our previous work, we had developed an iWARP datagram solution using send/receive semantics showing excellent memory scalability and performance bene?ts over the current TCP-based iWARP. In this paper, we demonstrate an improved iWARP design that provides true RDMA semantics over datagrams. Speci?cally, because traditional RDMA semantics do not map well to unreliable communication, we propose RDMA Write-Record, the ?rst and the only method capable of supporting RDMA Write over both unreliable and reliable datagrams. We demonstrate through a proof-of-concept software implementation that datagram-iWARP is feasible for realworld applications. Our proposed RDMA Write-Record method has been designed with data loss in mind and can provide superior performance under conditions of packet loss. It is shown through micro-benchmarks that by using RDMA capable datagram-iWARP a maximum of 256% increase in large message bandwidth and a maximum of 24.4% improvement in small message latency can be achieved over traditional iWARP. For application results we focus on streaming applications, showing a 24% improvement in memory usage and up to a 74% improvement in performance, although the proposed approach is also applicable to the HPC domain.