TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 17: Parallel Algorithms

  • A New Data Layout For Set Intersection on GPUs Authors: Rasmus Amossen (IT University of Copenhagen, Denmark); Rasmus Pagh (University of Copenhagen, Denmark)
    Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is well-known that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing device. However, GPUs require highly regular control ?ow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, ”BatMap”, that is particularly well suited for parallel processing, and is compact even for sparse data. Frequent itemset mining is one of the most important applications of set intersection. As a case-study on the potential of BatMaps we focus on frequent pair mining, which is a core special case of frequent itemset mining. The main ?nding is that our method is able to achieve speedups over both Apriori and FP-growth when the number of distinct items is large, and the density of the problem instance is above 0.01. Previous implementations of frequent itemset mining on GPU have not been able to show speedups over the best single-threaded implementations.
  • Partitioning Spatially Located Computations using Rectangles Authors: Erik Saule (The Ohio State University, USA); Erdeniz O. Bas (The Ohio State University, USA); Umit V. Catalyurek (The Ohio Stat
    The ideal distribution of spatially located heterogeneous workloads is an important problem to address in parallel scienti?c computing. We investigate the problem of partitioning such workloads (represented as a matrix of positive integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since ?nding the optimal arbitrary rectangle-based partition is an NP-hard problem, we investigate particular classes of solutions, namely, rectilinear partitions, jagged partitions and hierarchical partitions. We present a new class of solutions called m-way jagged partitions, propose new optimal algorithms for m-way jagged partitions and hierarchical partitions, propose new heuristic algorithms, and provide worst case performance analyses for some existing and new heuristics. Moreover, the algorithms are tested in simulation on a wide set of instances. Results show that two of the algorithms we introduce lead to a much better load balance than the state-of-the-art algorithms.
  • Reduced-Bandwidth Multithreaded Algorithms for Sparse-Matrix Vector Multiplication Authors: Aydin Buluc (Lawrence Berkeley National Laboratory, USA); Samuel W. Williams (Lawrence Berkeley National Laboratory, USA); Leon
    On multicore architectures, the ratio of peak memory bandwidth to peak ?oating-point performance (byte:?op ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises signi?cant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional ?ll-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined bene?ts of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:?op ratio and larger sparse matrices) continue.

SESSION 1: Resource Management

  • Power-aware replica placement and update strategies in tree networks Authors: Anne Benoit (ENS Lyon, France); Paul Renaud-Goud (LIP, ENS Lyon, France); Yves Robert (ENS Lyon, France)
    This paper deals with optimal strategies to place replicas in tree networks, with the double objective to minimize the total cost of the servers, and/or to optimize power consumption. The client requests are known beforehand, and some servers are assumed to pre-exist in the tree. Without power consumption constraints, the total cost is an arbitrary function of the number of existing servers that are reused, and of the number of new servers. Whenever creating and operating a new server has higher cost than reusing an existing one (which is a very natural assumption), cost optimal strategies have to trade-off between reusing resources and load-balancing requests on new servers. We provide an optimal dynamic programming algorithm that returns the optimal cost, thereby extending known results without pre-existing servers. With power consumption constraints, we assume that servers operate under a set of M different modes depending upon the number of requests that they have to process. In practice M is a small number, typically 2 or 3, depending upon the number of allowed voltages. Power consumption includes a static part, proportional to the total number of servers, and a dynamic part, proportional to a constant exponent of the server mode, which depends upon the model for power. The cost function becomes a more complicated function that takes into account reuse and creation as before, but also upgrading or downgrading an existing server from one mode to another. We show that with an arbitrary number of modes, the power minimization problem is NP-complete, even without cost constraint, and without static power. Still, we provide an optimal dynamic programming algorithm that returns the minimal power, given a threshold value on the total cost; it has exponential complexity in the number of modes M, and its practical usefulness is limited to small values of M. Still, experiments conducted with this algorithm show that it can process large trees in reasonable time, despite its worst-case complexity.
  • Minimum Cost Resource Allocation for meeting job requirements Authors: Venkatesan T Chakaravarthy (IBM Research (India), India); Sambuddha Roy (IBM Research - India, India); Yogish Sabharwal (IBM Re
    We consider the problem of allocating resources for completing a collection of jobs. Each resource is speci?ed by a start-time, ?nish-time and the capacity of resource available and has an associated cost; and each job is speci?ed by a starttime, ?nish-time and the amount of the resource required (demand) during this interval. A feasible solution is a multiset of resources (i.e., multiple units of each resource may be picked) such that at any point of time, the sum of the capacities offered by the resources is at least the total demand of the jobs active at that point of time. The cost of the solution is the sum of the costs of the resources included in the solution (taking into account the units of the resources). The goal is to ?nd a feasible solution of minimum cost. This problem arises naturally in many scenarios. For example, given a set of jobs, we would like to allocate some resource such as machines, memory or bandwidth in order to complete all the jobs. This problem generalizes a covering version of the knapsack problem which is known to be NP-hard. We present a constant factor approximation algorithm for this problem based on a Primal-Dual approach.
  • Power and Performance Management in Priority-type Cluster Computing Systems Authors: Kaiqi Xiong (North Carolina State University, USA)
    Cluster computing not only improves performance but also increase power consumption. It is a challenge to increase the performance of a cluster computing system and reduce its power consumption simultaneously. In this paper, we consider a collection of cluster computing resources owned by a service provider to host an enterprise application for multiple class business customers where customer requests are distinguished, with different request characteristics and service requirements. We start with a development of computing an average end-to-end delay and an average energy consumption for multiple class customers in such an application. Then, we present approaches for optimizing the average end-to-end delay subject to the constraint of an average energy consumption and optimizing the average end-to-end energy consumption subject to the constraints of an average end-to-end delay for all class and each class customer requests respectively. Moreover, a service provider processes the service requests of customers according to a service level agreement (SLA), which is a contract agreed between a customer and a service provider. It becomes important and commonplace to prioritize multiple customer services in favor of customers who are willing to pay higher fees. We propose an approach for minimizing the total cost of cluster computing resources allocated to ensure multiple priority customer service guarantees by the service provider. It is demonstrated through our simulation that the proposed approaches are ef?cient and accurate for power management and performance guarantees in priority-type cluster computing systems
  • Willow: A Control System For Energy And Thermal Adaptive Computing Authors: Krishna Kant (National Science Foundation, USA); Muthukumar Murugan (University of Minnesota, USA); David Du (University of Min
    The increasing energy demand coupled with emerging sustainability concerns requires a re-examination of power/thermal issues in data centers from the perspective of short term energy de?ciencies. Such energy de?cient scenarios arise for a variety of reasons including variable energy supply from renewable sources and inadequate power, thermal and cooling capacities. In this paper we propose a hierarchical control scheme to adapt assignments of tasks to servers in a way that can cope with the varying energy limitations and still provide necessary QoS . The rescheduling of tasks on different servers has direct (migration related) and indirect (changed traf?c patterns) network energy impacts that we also consider. We show the stability of our scheme and evaluate its performance via detailed simulations and experiments.

SESSION 18: Distributed Systems

  • GRAL: A Grouping Algorithm to Optimize Application Placement in Wireless Embedded Systems Authors: Nikos Tziritas (University of Thessaly, Greece); Thanasis Loukopoulos (Technological Educational Institute of Lamia, Greece); S
    Recent embedded middleware initiatives enable the structuring of an application as a set of collaborating agents deployed in the various sensing/actuating entities of the system. Of particular importance is the incurred cost due to agent communication which in terms depends on agent positions in the system. In this paper we present GRAL a grouping algorithm that migrates groups of agents with the aim of minimizing communication. The algorithm works in a distributed fashion based on knowledge available locally at each node and can be used both for one-shot initial application deployment and for the continuous updating of agent placement. Through simulation experiments under various scenarios we evaluate the algorithm, comparing the solution quality reached against the optimal obtained from exhaustive search.
  • Vitis: A Gossip-based Hybrid Overlay for Internet-scale Publish/Subscribe Enabling Rendezvous Routing in Unstructured Overlay Networks Authors: Fatemeh Rahimian (KTH - Royal Institute of Technology, Sweden); Sarunas Girdzijauskas (Swedish Institute of Computer Science (S
    Peer-to-peer overlay networks are attractive solutions for building Internet-scale publish/subscribe systems. However, scalability comes with a cost: a message published on a certain topic often needs to traverse a large number of uninterested (unsubscribed) nodes before reaching all its subscribers. This might sharply increase resource consumption for such relay nodes (in terms of bandwidth transmission cost, CPU, etc) and could ultimately lead to rapid deterioration of the system’s performance once the relay nodes start dropping the messages or choose to permanently abandon the system. In this paper, we introduce Vitis, a gossip-based publish/subscribe system that signi?cantly decreases the number of relay messages, and scales to an unbounded number of nodes and topics. This is achieved by the novel approach of enabling rendezvous routing on unstructured overlays. We construct a hybrid system by injecting structure into an otherwise unstructured network. The resulting structure resembles a navigable small-world network, which spans along clusters of nodes that have similar subscriptions. The properties of such an overlay make it an ideal platform for ef?cient data dissemination in large-scale systems. We perform extensive simulations and evaluate Vitis by comparing its performance against two base-line publish/subscribe systems: one that is oblivious to node subscriptions, and another that exploits the subscription similarities. Our measurements show that Vitis signi?cantly outperforms the base-line solutions on various subscription and churn scenarios, from both synthetic models and real-world traces.
  • Moving the Code to the Data - Dynamic Code Deployment using ActiveSpaces Authors: Ciprian Docan (Rutgers, The State University of New Jersey, USA); Manish Parashar (Rutgers, The State University of New Jersey,
    Managing the large volumes of data produced by emerging scienti?c and engineering simulations running on leadership-class resources has become a critical challenge. The data has to be extracted off the computing nodes and transported to consumer nodes so that it can be processed, analyzed, visualized, archived, etc. Several recent research efforts have addressed datarelated challenges at different levels. One attractive approach is to of?oad expensive I/O operations to a smaller set of dedicated computing nodes known as a staging area. However, even using this approach, the data still has to be moved from the staging area to consumer nodes for processing, which continues to be a bottleneck. In this paper, we investigate an alternate approach, namely moving the data-processing code to the staging area rather than moving the data. Speci?cally, we present the ActiveSpaces framework, which provides (1) programming support for de?ning the data-processing routines to be downloaded to the staging area, and (2) run-time mechanisms for transporting binary codes associated with these routines to the staging area, executing the routines on the nodes of the staging area, and returning the results. We also present an experimen- tal performance evaluation of ActiveSpaces using applications running on the Cray XT5 at Oak Ridge National Laboratory. Finally, we use a coupled fusion application work?ow to explore the trade-offs between transporting data and transporting the code required for data processing during coupling, and we characterize the sweet spots for each option.