TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 12: Compilers

  • Variable Granularity Access Tracking Scheme for Improving the Performance of Software Transactional Memory Authors: Sandya Mannarswamy (Hewlett Packard India, India); Govindarajan Ramaswamy (Indian Institute of Science, India)
    Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a con?ict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single ?xed access tracking granularity cannot meet the con?icting goals of reducing false con?icts without impacting concurrency adversely. A ?ne grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a ?xed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application’s data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
  • Automated architecture-aware mapping of streaming applications onto GPUs Authors: Andrei Hagiescu (National University of Singapore, Singapore); Huynh Phung Huynh (A*STAR Institute of High Performance Computin
    Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called warps and wavefronts, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a suf?cient number of alternative thread groups to select from. Many parallel general purpose applications have been ef?ciently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation ?ow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of compute and memory access threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. % small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the StreamIt programming language, and our results show a signi?cant speedup compared to the state-of-the-art solutions.
  • Automatic Loop Tiling for Direct Memory Access Authors: Haibo Lin (IBM Research - China, P.R. China); Tao Liu (IBM Research - China, P.R. China); Lakshminarayanan Renganarayana (IBM
    In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is ef?cient for regular accesses, while requiring precise analysis, detailed modeling of execution, and signi?cant code generation. In this paper we present the design and implementation of DMATiler which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more ef?cient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.
  • Tolerant Value Speculation in Coarse-Grain Streaming Computations Authors: Nathaniel Azuelos (Technion, Israel); Idit Keidar (Technion, Israel); Ayal Zaks (IBM Haifa Research Lab, Israel)
    Streaming applications are the subject of growing interest, as the need for fast access to data continues to grow. In this work, we present the design requirements and implementation of coarse-grain value speculation in streaming applications. We explain how this technique can be useful in cases where serial parts of applications constitute bottlenecks, and when slower I/O favors using available pre?xes of the data. Contrary to previous work, we show how allowing some tolerance can justify early predictions on a scale of a large window of values. We suggest a methodology for runtime support of speculation, along with the mechanisms required for rollback. We present resource management issues consequent to our technique. We study how validation and speculation frequencies impact the performance of the program. Finally, we present our implementation in the context of the Huffman encoder benchmark, running it in different con?gurations and on different architectures.

SESSION 13: Distributed Algorithms and Models

  • Adding a referee to an interconnection network: What can(not) be computed in one round. Authors: Florent Becker (LIFO, Universite d’Orleans, France); Martin Matamala (DIM, Universidad de Chile, Chile); Nicolas Nisse (I
    In this paper we ask which properties of a distributed network can be computed from a few amount of local information provided by its nodes. The distributed model we consider is a restriction of the classical CONGEST (distributed) model and it is close to the simultaneous messages (communication complexity) model de?ned by Babai, Kimmel and Lokam. More precisely, each of these n nodes -which only knows its own ID and the IDs of its neighbors- is allowed to send a message of O(log n) bits to some central entity, called the referee. Is it possible for the referee to decide some basic structural properties of the network topology G? We show that simple questions like, “does G contain a square?”, “does G contain a triangle?” or “Is the diameter of G at most 3?” cannot be solved in general. On the other hand, the referee can decode the messages in order to have full knowledge of G when G belongs to many graph classes such as planar graphs, bounded treewidth graphs and, more generally, bounded degeneracy graphs. We leave open questions related to the connectivity of arbitrary graphs.
  • Improved Algorithms for the Distributed Trigger Counting Problem Authors: Venkatesan T Chakaravarthy (IBM Research (India), India); Anamitra Roy Choudhury (IBM Research - India, India); Yogish Sabharwa
    Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-speci?ed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The ?rst algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.
  • The Weighted Byzantine Agreement Problem Authors: Vijay Garg (The University of Texas at Austin, USA); John Bridgman (The University of Texas at Austin, USA)
    This paper presents a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most f out of N machines fail, the algorithm assumes that the total weight of the machines that fail is at most f/N. When each machine has weight 1/N, this problem reduces to the standard Byzantine Generals Agreement Problem. By choosing weights appropriately, the weighted Byzantine Agreement Problem can be applied to situations where a subset of processes are more trusted. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than N/3 processes fail, so long as the total weight of the failed processes is less than 1/3. Also, a method to update the weights of the processes after execution of the weighted Byzantine Agreement is given. The update method guarantees that the weight of any correct process is never reduced and the weight of any faulty process, suspected by correct processes whose total weight is at least 1/4, is reduced to 0 for future instances. A short discussion of some weight assignment strategies is also given.
  • Leveraging Social Networks to Combat Collusion in Reputation Systems for Peer-to-Peer Networks Authors: Ze Li (Clemson University, USA); Haiying Shen (Clemson University, USA): Karan Sapra (Clemson University, USA)
    In peer-to-peer networks (P2Ps), many autonomous peers without preexisting trust relationships share resources with each other. Due to their open environment, the P2Ps usually employ reputation systems to provide guidance in selecting trustworthy resource providers for high reliability and security. However, node collusion impairs the effectiveness of reputation systems in trustworthy node selection. Although some reputation systems have certain mechanisms to counter collusion, the effectiveness of the mechanisms is not suf?ciently high. In this paper, we leverage social networks to enhance the capability of reputation systems in combating collusion. We ?rst analyzed real trace of the reputation system in the Overstock online auction platform which incorporates a social network. The analysis reveals the important impact of the social network on user purchasing and reputation rating patterns. We thus identi?ed suspicious collusion behavior patterns and propose a social network based mechanism, namely SocialTrust, to counter collusion. SocialTrust adaptively adjusts the weight of ratings based on the social distance and interest relationship between peers. Experimental results show that SocialTrust can signi?cantly strengthen the capability of current reputation systems in combating collusion.

SESSION 14: Parallel Graph and Particle Algorithms

  • Computing Strongly Connected Components in Parallel on CUDA Authors: Jiri Barnat (Masaryk University, Czech Republic); Petr Bauch (Masaryk University, Czech Republic); Lubos Brim (Masaryk Universi
    The problem of decomposing a directed graph into its strongly connected components is a fundamental graph problem inherently present in many scienti?c and commercial applications. In this paper we show how some of the existing parallel algorithms can be reformulated in order to be accelerated by NVIDIA CUDA technology. In particular, we design a new CUDA-aware procedure for pivot selection and we adapt selected parallel algorithms for CUDA accelerated computation. We also experimentally demonstrate that with a single GTX 480 GPU card we can easily outperform the optimal serial CPU implementation by an order of magnitude in most cases, 40 times on some suf?ciently big instances. This is an interesting result as unlike the serial CPU case, the asymptotic complexity of the parallel algorithms is not optimal.
  • On optimal tree traversals for sparse matrix factorization Authors: Mathias Jacquelin (ENS Lyon, France); Loris Marchal (CNRS, France); Yves Robert (ENS Lyon, France); Bora Ucar (CNRS, France)
    We study the complexity of traversing tree-shaped work?ows whose tasks require large I/O ?les. Such work?ows typically arise in the multifrontal method of sparse matrix factorization. We target a classical two-level memory system, where the main memory is faster but smaller than the secondary memory. A task in the work?ow can be processed if all its predecessors have been processed, and if its input and output ?les ?t in the currently available main memory. The amount of available memory at a given time depends upon the ordering in which the tasks are executed. What is the minimum amount of main memory, over all postorder schemes, or over all possible traversals, that is needed for an in-core execution? We establish several complexity results that answer these questions. We propose a new, polynomial time, exact algorithm which runs faster than a reference algorithm. Next, we address the setting where the required memory renders a pure in-core solution unfeasible. In this setting, we ask the following question: what is the minimum amount of I/O that must be performed between the main memory and the secondary memory? We show that this latter problem is NP-hard, and propose ef?cient heuristics. All algorithms and heuristics are thoroughly evaluated on assembly trees arising in the context of sparse matrix factorizations.
  • Fast Community Detection Algorithm With GPUs and Multi-core Architectures Authors: Jyothish Soman (IIIT-Hyderabad, India); Ankur Narang (IBM India Research Labs, New Delhi, India)
    In this paper, we present the design of a novel scalable parallel algorithm for community detection optimized for multi-core and GPU architectures. Our algorithm is based on label propagation, which works solely on local information, thus giving it the scalability advantage over conventional approaches. We also show that weighted label propagation can overcome typical quality issues in communities detected with label propagation. Experimental results on well known massive scale graphs such as Wikipedia (100M edges) and also on RMAT graphs with 10M - 40M edges, demonstrate the superior performance and scalability of our algorithm compared to the well known approaches for community detection. On the hep-th graph (352K edges) and the wikipedia graph (100M edges), using Power 6 architecture with 32 cores, our algorithm achieves one to two orders of magnitude better performance compared to the best known prior results on parallel architectures with similar number of CPUs. Further, our GPGPU based algorithm achieves 8 improvement over the Power 6 performance on 40M edge R-MAT graph. Alongside, we achieve high quality (modularity) of communities detected, with experimental evidence from well-known graphs such as Zachary karate club, Dolphin network and Football club, where we achieve modularity that is close to the best known alternatives. To the best of our knowledge these are best known results for community detection on massive graphs (100M edges) in terms of performance and also quality vs. performance trade-off. This is also a unique work on community detection on GPGPUs with scalable performance.
  • A Study of Parallel Particle Tracing for Steady-State and Time-Varying Flow Fields Authors: Tom Peterka (Argonne National Laboratory, USA); Robert Ross (Argonne National Laboratory, USA); Boonthanome Nouanesengsey (The
    Particle tracing for streamline and pathline generation is a common method of visualizing vector ?elds in scienti?c data, but it is dif?cult to parallelize ef?ciently because of demanding and widely varying computational and communication loads. In this paper we scale parallel particle tracing for visualizing steady and unsteady ?ow ?elds well beyond previously published results. We con?gure the 4D domain decomposition into spatial and temporal blocks that combine in-core and out-of-core execution in a ?exible way that favors faster run time or smaller memory. We also compare static and dynamic partitioning approaches. Strong and weak scaling curves are presented for tests conducted on an IBM Blue Gene/P machine at up to 32 K processes using a parallel ?ow visualization library that we are developing. Datasets are derived from computational ?uid dynamics simulations of thermal hydraulics, liquid mixing, and combustion.