TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 26: Scheduling

  • Efficient Parallel Scheduling of Malleable Tasks Authors: Peter Sanders (University of Karlsruhe, Germany); Jochen Speck (KIT, Germany)
    We give an O(n + minfn; mg log m) work algorithm for scheduling n tasks with ?exible amount of parallelism on m processors, provided the speedup functions of the tasks are concave. We give ef?cient parallelizations of the algorithm that run in polylogarithmic time. Previous algorithms were sequential and required quadratic work. This is in some sense a best-possible result since the problem is NP-hard for more general speedup functions.
  • Offline Scheduling of Multi-threaded Request Streams on a Caching Server Authors: Veronika Rehn-Sonigo (University of Franche-Comté, France); Denis Trystram (University of Grenoble, France); Fréd&
    In this work, we are interested in the problem of satisfying multiple concurrent requests submitted to a computing server. Informally, there are users each sending a sequence of requests to the server. The requests consist of tasks linked by precedence constraints. Tasks may occur several times in the same sequence as well as in a request sequence of another user. The computing server has to execute tasks with variable processing times. The server owns a cache of limited size where intermediate results of the processing may be stored. If an intermediate result for a task is stored into the cache, no processing cost has to be paid and the result can directly be fetched from the cache. The goal of this work is to determine a schedule of the tasks such that an optimization function is minimized (the only objective studied up to now is the makespan). This problem is a variant of caching which considers only one sequence of requests. We then extend the study to the minimization of the mean completion time of the request sequences. Two models are considered. In the ?rst model, caching is forced whereas in the second model caching is optional and one can choose whether an intermediate result is stored in the cache or not. All combinations turn out to be NP-hard for ?xed cache sizes and we provide a formulation as dynamic program as well as bounds for inapproximation. We propose polynomial time approximation algorithms for some variants and analyze their approximation ratios. Finally, we also devise some heuristics and present experimental results.
  • Tight Analysis of Relaxed Multi-Organization Scheduling Algorithms Authors: Daniel Cordeiro (Grenoble University, France); Pierre-François Dutot (Grenoble University, France); Gregory Mounié
    The goal of this paper is to study how limited cooperation can impact the quality of the schedule obtained by multiple independent organizations in a typical grid computing platform. This relaxed version of the problem known as the MultiOrganization Scheduling Problem (MOSP) models an environment where organizations providing both resources and jobs tolerate a bounded degradation on the makespan of their own jobs in order to minimize the makespan over the entire platform. More precisely, the technical contributions are the following. First, we improve the existing inapproximation bounds for this problem proving that what was previously though as not polynomially approximable (unless P = NP) is actually not approximable at all. We achieve this using two families of instances whose Pareto optimal solutions are on par with the previous inaproximability bounds. Then, we present two algorithms that solve the problem with approximation ratios of (2; 3/2) and (3; 4/3) respectively. This means that when using the ?rst (second) algorithm, if an organization tolerates that the completion time of its last job cannot exceed twice (three times) the time it would have obtained by itself, then the algorithm provides a solution that is a 3/2- approximation (4/3-approximation) for the optimal global makespan. Both algorithms are ef?cient since their performance ratio correspond to the Pareto optimal solutions of the previously de?ned instances.
  • Scheduling Functional Heterogeneous Systems with Utilization Balancing Authors: Yuxiong He (Microsoft Research, USA); Jie Liu (Microsoft Research, USA); Hongyang Sun (Nanyang Technological University, Singap
    Heterogeneous systems become popular in both client and cloud. A parallel program can incur operations on multiple processing resources such as CPU, GPU, and vector processor units. This paper investigates scheduling problems on functionally heterogeneous systems with the objective of minimizing the completion time of parallel jobs. We ?rst present performance bounds of online scheduling and show that any online algorithm is at best around (K + 1)-competitive with respect to job completion time, where K is the total number of resource types. There exist ”bad” jobs that prevent any online algorithms from obtaining good interleaving of heterogeneous tasks. This lower bound suggests that the relative performance of online algorithms versus an of?ine optimal could degrade linearly as types of heterogeneous resources increase. The limitation of online scheduling motivates our study of how additional of?ine or lookahead information can help improve scheduling performance. We propose a Multi-Queue Balancing algorithm (MQB) that effectively transforms the problem of minimizing completion time to one of maximizing utilization of heterogeneous resources. It promotes interleaving of heterogeneous tasks through balancing the task queues of different types. Our simulation results suggest that MQB reduces the execution time of online greedy algorithms up to 40% over various workloads and outperforms other of?ine schemes in most cases. Furthermore, MQB can use limited and approximated of?ine information to improve scheduling decisions.

SESSION 27: Computational Biology and Simulations

  • Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space Authors: Edans Flavius de Oliveira Sandes (University of Brasilia, Brazil); Alba Cristina Magalhaes Alves de Melo (University of Brasili
    Cross-species chromosome alignments can reveal ancestral relationships and may be used to identify the peculiarities of the species. It is thus an important problem in Bioinformatics. So far, aligning huge sequences, such as whole chromosomes, with exact methods has been regarded as unfeasible, due to huge computing and memory requirements. However, high performance computing platforms such as GPUs are being able to change this scenario, making it possible to obtain the exact result for huge sequences in reasonable time. In this paper, we propose and evaluate a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations that are able to reduce signi?cantly the amount of data processed and that enforce full parallelism most of the time. Using the GTX 285 Board, our algorithm was able to produce the optimal alignment between sequences composed of 33 Millions of Base Pairs (MBP) and 47 MBP in 18.5 hours.
  • Accelerating Protein Sequence Search in a Heterogeneous Computing System Authors: Shucai Xiao (Virginia Tech, USA); Heshan Lin (Virginia Tech, USA); Wu-chun Feng (Virginia Tech, USA)
    The “Basic Local Alignment Search Tool” (BLAST) is arguably the most widely used computational tool in bioinformatics. However, the computational power required for routine BLAST analysis has been outstripping Moore’s Law due to the exponential growth in the size of the genomic sequence databases that BLAST searches on. To address the above issue, we propose the design and optimization of the BLAST algorithm for searching protein sequences (i.e., BLASTP) in a heterogeneous computing system. The end result is a BLASTP implementation that delivers a seven-fold speedup over the sequential BLASTP for the most computationally intensive phase (i.e., hit detection and ungapped extension) on a NVIDIA Fermi C2050 GPU. In addition, when pipelining the processing on a dual-core CPU and the NVIDIA Fermi GPU, our implementation can achieve a six-fold speedup for the overall program execution.
  • Parallel Metagenomic Sequence Clustering via Sketching and Quasi-clique Enumeration on Map-reduce Clouds Authors: Xiao Yang (Iowa State University, USA); Jaroslaw Zola (Iowa State University, USA); Srinivas Aluru (Iowa State University, USA)
    Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce signi?cant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classi?cation problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.
  • Large-scale lattice gas Monte Carlo simulations for the generalized Ising model Authors: Tobias C. Kerscher (Technische Universität Hamburg-Harburg, Germany); Stefan Müller (Technische Universität Hambu
    We present an ef?cient parallel algorithm for lattice gas Monte Carlo simulations in the framework of an Ising model that allows arbitrary interaction on any lattice, a model often called a cluster expansion. Thermodynamic Monte Carlo simulations strive for the equilibrium properties of a system by exchanging atoms over a long range, while preserving detailed balance. This long-range exchange of atoms renders other frequent parallelization techniques, like domain decomposition, unfavorable due to excessive communication cost. Our ansatz, based on the Metropolis algorithm, minimizes communication between parallel processes. We present this new “partial sequence preserving” (PSP) algorithm, as well as benchmark data for a physical alloy system (NiAl) comprised of one billion atoms.

SESSION 28: Cloud Computing

  • CATCH: A Cloud-based Adaptive Data Transfer Service for HPC Authors: Henry Monti (Virginia Tech, USA); Ali R Butt (Virginia Tech., USA); Sudharshan S Vazhkudai (Oak Ridge National Laboratory, USA)
    Modern High Performance Computing (HPC) applications process very large amounts of data. A critical research challenge lies in transporting input data to the HPC center from a number of distributed sources, e.g., scienti?c experiments and web repositories, etc., and of?oading the result data to geographically distributed, intermittently available end-users, often over under-provisioned connections. Such end-user data services are typically performed using point-to-point transfers that are designed for well-endowed sites and are unable to reconcile the center’s resource usage and users’ delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. To overcome these inef?ciencies, decentralized HPC data services are emerging as viable alternatives. In this paper, we develop and enhance such distributed data services by designing CATCH, a Cloud-based Adaptive data Transfer serviCe for HPC. CATCH leverages a bevy of cloud storage resources to orchestrate a decentralized data transport with fail-over capabilities. Our results demonstrate that CATCH is a feasible approach, and can help improve the data transfer times at the HPC center by as much as 81.1% for typical HPC workloads.
  • A Scalable and Elastic Publish/Subscribe Service Authors: Ming LI (IBM T.J Watson Research Center, USA); Fan Ye (IBM T. J. Watson Research Center, USA); Minkyong Kim (IBM T.J. Watson Re
    The rapid growth of sense-and-respond applications and the emerging cloud computing model present a new challenge: providing publish/subscribe as a scalable and elastic cloud service. This paper presents the BlueDove attribute based publish/subscribe service that seeks to address such a challenge. BlueDove uses a gossip-based one-hop overlay to organize servers into a scalable cluster. It proactively exploits skewness in data distribution to achieve high performance. By assigning each subscription to multiple servers through a multidimensional subscription space partitioning technique, it provides multiple candidate servers for each publication message. A message can be matched on any of its candidate servers with one hop forwarding. The performance-aware forwarding in BlueDove ensures that the message is sent to the least loaded candidate server for processing, leading to low latency and high throughput. The evaluation shows that BlueDove has a linear capacity increase as the system scales up, adapts to sudden workload changes within tens of seconds, and achieves multifold higher throughput than the techniques used in the existing enterprise and peer-to-peer pub/sub systems.
  • CABdedupe: A Causality-based Deduplication Performance Booster for Cloud Backup Services Authors: Yujuan Tan (Huazhong University of Science and Technology, P.R. China), Hong Jiang (University of Nebraska-Lincoln, USA), Dan F
    Due to the relatively low bandwidth of WAN (Wide Area Network) that supports cloud backup services, both the backup time and restore time in the cloud backup environment are in desperate need for reduction to make cloud backup a practical and affordable service for small businesses and telecommuters alike. Existing solutions that employ the deduplication technology for cloud backup services only focus on removing redundant data from transmission during backup operations to reduce the backup time, while paying little attention to the restore time that we argue is an important aspect and affects the overall quality of service of the cloud backup services. In this paper, we propose a CAusality-Based deduplication performance booster for both cloud backup and restore operations, called CABdedupe, which captures the causal relationship among chronological versions of datasets that are processed in multiple backups/restores, to remove the unmodi?ed data from transmission during not only backup operations but also restore operations, thus to improve both the backup and restore performances. CABdedupe is a middleware that is orthogonal to and can be integrated into any existing backup system. Our extensive experiments, where we integrate CABdedupe into two existing backup systems and feed real world datasets, show that both the backup time and restore time are signi?cantly reduced, with a reduction ratio of up to 103 : 1
  • DryadOpt: Branch-and-Bound on Distributed Data-Parallel Execution Engines Authors: Mihai Budiu (Microsoft Research Silicon Valley, USA); Daniel Delling (Microsoft Research Silicon Valley, USA); Renato Werneck (
    We introduce DryadOpt, a library that enables massively parallel and distributed execution of optimization algorithms for solving hard problems. DryadOpt performs an exhaustive search of the solution space using branch-and-bound, by recursively splitting the original problem into many simpler subproblems. It uses both parallelism (at the core level) and distributed execution (at the machine level). DryadOpt provides a simple yet powerful interface to its users, who only need to implement sequential code to process individual subproblems (either by solving them in full or generating new subproblems). The parallelism and distribution are handled automatically by DryadOpt, and are invisible to the user. The distinctive feature of our system is that it is implemented on top of DryadLINQ, a distributed data-parallel execution engine similar to Hadoop and Map-Reduce. Despite the fact that these engines offer a constrained application model, with restricted communication patterns, our experiments show that careful design choices allow DryadOpt to scale linearly with the number of machines, with very little overhead.