TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 23: Resource Utilization

  • Shared Resource Monitoring and Throughput Optimization in Cloud-Computing Datacenters Authors: Jaideep Moses (Intel Corp., USA); Ravishankar Iyer (Intel Corp, USA); Ramesh Illikkal (Intel Corporation, USA); Sadagopan Srin
    Many datacenters employ server consolidation to maximize the efficiency of platform resource usage. As a result, multiple virtual machines (VMs) simultaneously run on each datacenter platform. Contention for shared resources between these virtual machines has an undesirable and non-deterministic impact on their performance behavior in such platforms. This paper proposes the use of shared resource monitoring to (a) understand the resource usage of each virtual machine on each platform, (b) collect resource usage and performance across different platforms to correlate implications of usage to performance, and (c) migrate VMs that are resource-constrained to improve overall datacenter throughput and improve Quality of Service (QoS). We focus our efforts on monitoring and addressing shared cache contention and propose a new optimization metric that captures the priority of the VM and the overall weighted throughput of the datacenter. We conduct detailed experiments emulating datacenter scenarios including on-line transaction processing workloads (based on TPC-C) middle-tier workloads (based on SPECjbb and SPECjAppServer) and financial workloads (based on PARSEC). We show that monitoring shared resource contention (such as shared cache) is highly beneficial to better manage throughput and QoS in a cloud-computing datacenter environment.
  • The Impact of Soft Resource Allocation on n-Tier Application Scalability Authors: Qingyang Wang (Georgia Institute of Technology, USA); Simon Malkowski (Georgia Institute of Technology, USA); Deepal Jayasinghe
    Good performance and ef?ciency, in terms of high quality of service and resource utilization for example, are important goals in a cloud environment. Through extensive measurements of an n-tier application benchmark (RUBBoS), we show that overall system performance is surprisingly sensitive to appropriate allocation of soft resources (e.g., server thread pool size). Inappropriate soft resource allocation can quickly degrade overall application performance signi?cantly. Concretely, both under-allocation and over-allocation of thread pool can lead to bottlenecks in other resources because of non-trivial dependencies. We have observed some non-obvious phenomena due to these correlated bottlenecks. For instance, the number of threads in the Apache web server can limit the total useful throughput, causing the CPU utilization of the C-JDBC clustering middleware to decrease as the workload increases. We provide a practical iterative solution approach to this challenge through an algorithmic combination of operational queuing laws and measurement data. Our results show that soft resource allocation plays a central role in the performance scalability of complex systems such as n-tier applications in cloud environments.
  • Profiling Directed NUMA Optimisation on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code Authors: Rui Yang (University of Wollongong, Australia); Joseph Antony (Australian National University, Australia); Alistair P Rendell (A
    The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly in- ?uenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be dif?cult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind pro?ling tool which can be used to simplify this process. It extends the Valgrind binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traf?c for arbitrary NUMA topologies. Using NUMAgrind, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the NUMAgrind tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.
  • Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel Authors: Kevin Stock (The Ohio State University, USA); Thomas Henretty (The Ohio State University, USA); Iyyappa Murugandi (The Ohio Stat
    In this paper, we describe a model-driven compile-time code generator that transforms a class of tensor contraction expressions into highly optimized short-vector SIMD code. We use as a case study a multi-resolution tensor kernel from the MADNESS quantum chemistry application. Performance of a C-based implementation is low, and because the dimensions of the tensors are small, performance using vendor optimized BLAS libraries is also suboptimal. We develop a model-driven code generator that determines the optimal loop permutation and placement of vector load/store, transpose, and splat operations in the generated code, enabling portable performance on short-vector SIMD architectures. Experimental results on an SSE-based platform demonstrate the ef?ciency of the vector-code synthesizer.

SESSION 24: Parallel Programming Models and Languages

  • Multi-GPU MapReduce on GPU Clusters Authors: Jeffery Stuart (University of California, Davis, USA); John D. Owens (University of California, Davis, USA)
    We present GPMR, our stand-alone MapReduce library that leverages the power of GPU clusters for large-scale computing. To better utilize the GPU, we modify MapReduce by combining large amounts of map and reduce items into chunks and using partial reductions and accumulation. We use persistent map and reduce tasks and stress aspects of GPMR with a set of standard MapReduce benchmarks. We run these benchmarks on a GPU cluster and achieve desirable speedup and efficiency for all benchmarks. We compare our implementation to the current-best GPU-MapReduce library (runs only on a solo GPU) and a highly-optimized multi-core MapReduce to show the power of GPMR. We demonstrate how typical MapReduce tasks are easily modified to fit into GPMR and leverage a GPU cluster. We highlight how total and relative amounts of communication affect GPMR. We conclude with an exposition on the types of MapReduce tasks well-suited to GPMR, and why some tasks need more modi?cations than others to work well with GPMR.
  • X10 as a parallel language for scientific computation: practice and experience Authors: Josh Milthorpe (Australian National University, Australia); V. Ganesh (Australian National University, Australia); Alistair P Re
    X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase signi?cantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scienti?c application codes in X10. This paper reports our experiences writing three codes from the chemistry/material science domain: Fast Multipole Method (FMM), Particle Mesh Ewald (PME) and Hartree-Fock (HF), entirely in X10. Performance results are presented for up to 256 places on a Blue Gene/P system. During the course of this work our experiences have been shared with the X10 development team, so that application requirements could inform language design discussions as the language capabilities in?uenced algorithm design. This resulted in improvements in the language implementation and standard class libraries, including the design of the array API and support for complex math. Data constructs in X10 such as places and distributed arrays, and parallel constructs such as ?nish and async, simplify implementation of the applications in comparison with MPI. However, current implementation limitations in X10 2.1.2 make it dif?cult to achieve scalable performance using the most natural expressions of the algorithms. The most serious limitation is the use of point-to-point communication patterns, rather than collectives, to implement parallel constructs and array operations. This issue will be addressed in future releases of X10.
  • Implementation and Performance Evaluation of the HPC Challenge Benchmarks in Coarray Fortran 2.0 Authors: Guohua Jin (Rice University, USA); John Mellor-Crummey (Rice University, USA); Laksono Adhianto (Rice University, USA); William
    Today’s largest supercomputers have over two hundred thousand CPU cores and even larger systems are under development. Typically, these systems are programmed using message passing. Over the past decade, there has been considerable interest in developing simpler and more expressive programming models for them. Partitioned global address space (PGAS) languages are viewed as perhaps the most promising alternative. In this paper, we report on our experience developing a set of PGAS extensions to Fortran that we call Coarray Fortran 2.0 (CAF 2.0). Our design for CAF 2.0 goes well beyond the original 1998 design of Coarray Fortran (CAF) by Numrich and Reid. CAF 2.0 includes language support for many features including teams, collective communication, asynchronous communication, function shipping, and synchronization. We describe the implementation of these features and our experiences using them to implement the High Performance Computing Challenge (HPCC) benchmarks, including High Performance Linpack (HPL), RandomAccess, Fast Fourier Transform (FFT), and STREAM triad. On 4096 CPU cores of a Cray XT with 2.3 GHz single socket quad-core Opteron processors, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with RandomAccess, 125 GFLOP/s with FFT, and a bandwidth of 8.73 TByte/s with STREAM triad.
  • Communication Optimizations for Distributed-Memory X10 Programs Authors: Rajkishore Barik (Rice University, USA); JIsheng Zhao (Rice University, USA); David Grove (IBM Research, USA); Igor Peshansky (
    X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and communication structures with higher productivity than other distributed-memory programming models. However, this productivity often comes at the cost of high performance overhead when the language is used in its full generality. This paper introduces high-level compiler optimizations and transformations to reduce communication and synchronization overheads in distributed-memory implementations of X10 programs. Speci?cally, we focus on locality optimizations such as scalar replacement and task localization, combined with supporting transformations such as loop distribution, scalar expansion, loop tiling, and loop splitting. We have completed a prototype implementation of these high-level optimizations, and performed a performance evaluation that shows signi?cant improvements in performance, scalability, communication volume and number of tasks. We evaluated the communication optimizations on three platforms: a 128-node BlueGene/P cluster, a 32-node Nehalem cluster, and a 16-node Power7 cluster. On the BlueGene/P cluster, we observed a maximum performance improvement of 31.46x relative to the unoptimized case (for the MolDyn benchmark). On the Nehalem cluster, we observed a maximum performance improvement of 3.01x (for the NQueens benchmark) and on the Power7 cluster, we observed a maximum performance improvement of 2.73x (for the MolDyn benchmark). In addition, there was no case in which the optimized code was slower than the unoptimized case. We also believe that the optimizations presented in this paper will be necessary for any high-productivity PGAS language based on modern object-oriented principles, that is designed for execution on future Extreme Scale systems that place a high premium on locality improvement for performance and energy ef?ciency.

SESSION 25: Algorithms for Distributed Computing

  • I/O-optimal Algorithms for Orthogonal Problems for Private-Cache Chip Multiprocessors Authors: Deepak Ajwani (MADALGO, University of Aarhus, Denmark); Nodari Sitchinava (MADALGO, University of Aarhus, Denmark); Norbert Zeh
    The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-ef?cient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were ef?cient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of O(sortP(N) + K=PB) for a number of problems on axis-aligned objects; P denotes the number of cores/processors, B denotes the number of elements that ?t in a cache line, N and K denote the sizes of the input and output, respectively, and sortP(N) denotes the I/O complexity of sorting N items using P processors in the PEM model. To obtain the above improvement, we present a new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of O((N + K)=PB), where K is the sum of the counts of all the ranges. The key to achieving ef?cient load balancing among the processors in this algorithm is a new method to count the output without enumerating it, which might be of independent interest.
  • A Fast Algorithm for Constructing Inverted Files on Heterogeneous Platforms Authors: Zheng Wei (University of Maryland, USA); Joseph JaJa (University of Maryland, College Park, USA)
    Given a collection of documents residing on a disk, we develop a new strategy for processing these documents and building the inverted ?les extremely fast. Our approach is tailored for a heterogeneous platform consisting of a multicore CPU and a highly multithreaded GPU. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure in which the trie is represented by a table for fast look-up and each B-tree node contains string caches; (iii) allocation of parsed streams with frequent terms to CPU threads and the rest to GPU threads so as to match the throughput of parsed streams; and (iv) optimized CUDA indexer implementation that ensures coalesced memory accesses and effective use of shared memory. We have performed extensive tests of our algorithm on a single node (two Intel Xeon X5560 Quad-core) with two NVIDIA Tesla C1060 attached to it, and were able to achieve a throughput of more than 262 MB/s on the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those run on large clusters.
  • Graph Partitioning with Natural Cuts Authors: Daniel Delling (Microsoft Research Silicon Valley, USA); Andrew Goldberg (Microsoft Research Silicon Valley, USA); Ilya Razensh
    We present a novel approach to graph partitioning based on the notion of natural cuts. Our algorithm, called PUNCH, has two phases. The ?rst phase performs a series of minimum-cut computations to identify and contract dense regions of the graph. This reduces the graph size, but preserves its general structure. The second phase uses a combination of greedy and local search heuristics to assemble the ?nal partition. The algorithm performs especially well on road networks, which have an abundance of natural cuts (such as bridges, mountain passes, and ferries). In a few minutes, it obtains the best known partitions for continental-sized networks, signi?cantly improving on previous results.
  • Reader Activation Scheduling in Multi-Reader RFID Systems: A Study of General Case Authors: Shao-Jie Tang (Illinois Institute of Technology, USA); Cheng Wang (Tongji University, Shanghai, P.R. China); Xiang-Yang Li (Illi
    Radio frequency identi?cation (RFID) is a technology where a reader device can “sense” the presence of a close by object by reading a tag device attached to the object. To guarantee the coverage quality, multiple RFID readers can be deployed in the given region. In this paper, we consider the problem of activation schedule for readers in a multi-reader environment. In particular, we try to design a schedule for readers to maximize the number of served tags per time-slot while avoiding various interferences. We ?rst develop a centralized algorithm under the assumption that different readers may have different interference and interrogation radius. Next, we propose a novel algorithm which does not need any location information of the readers. Finally, we extend the previous algorithm in distributed manner in order to suit the case where no central entity exists. We conduct extensive simulations to study the performances of our proposed algorithm. And our evaluation results corroborate our theoretical analysis.