TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 10: GPU Acceleration

  • Design of MILC lattice QCD application for GPU clusters Authors: Guochun Shi (University of Illinois at Urbana-Champaign, USA); Steven Gottlieb (Indiana University, USA); Aaron Torok (Indiana U
    We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision ?oating-point GPU implementation of the multi-mass conjugate gradient solver. Our single GPU implementation of the conjugate gradient solver achieves a 9x performance improvement over the highly optimized code executed on a state-of-the-art eight-core CPU node. The overall application executes almost six times faster on a GPU-enabled cluster vs. a conventional multi-core cluster. The developed code is currently used for running production QCD calculations with electromagnetic corrections.
  • Multifrontal Factorization of Sparse SPD Matrices on GPUs Authors: Thomas George (IBM Research India, India); Vaibhav Saxena (IBM Research - India, New Delhi, India); Anshul Gupta (IBM T.J. Watso
    Solving large sparse linear systems is often the most computationally intensive component of many scienti?c computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-ef?ciency, and the potential to achieve signi?cant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored suf?ciently due to the complexity involved in an ef?cient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classi?er. We estimate the classi?er parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.
  • Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia Mining Authors: Mamadou Diao (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Jongman Kim (Georgi
    Media mining, the extraction of meaningful knowledge from multimedia content has become a major application and poses signi?cant computational challenges in today’s platforms. Media mining applications contain many sophisticated algorithms that include data-intensive analysis, classi?cation, and learning. This paper explores the use of Graphics Processing Units (GPU) in media mining. We are particularly focused on large-scale semantic concept detection, a state-of-the-art approach that maps media content to hight-level semantic concepts, and a building block in many Media mining applications. We present a fast, parallel, large-scale, high-level semantic concept detector that leverages the GPU for image/video retrieval and content analysis. Through ef?cient data partitioning and movement, we parallelize feature extraction routines. By interleaving feature extraction routines of different types, we increase the computational intensity and mitigate the negative effects of histogram-like reduction operations. To cope with the very large number of semantic concepts, we propose a data layout of concept models on a multi-GPU hybrid architecture for high throughput semantic concept detection. We achieve one to two orders of magnitude speedups compared to serial implementations and our experiments show that we can detect 374 semantic concepts at a rate of over 100 frames/sec. This is over 100 times faster than a LibSVM-based semantic concept detection.
  • Efficient GPU implementation for Particle in Cell Algorithm Authors: Rejith Joseph (University of Florida, USA); Girish Ravunnikutty (University of Florida, USA); Sanjay Ranka (University of Florid
    Particle in cell (PIC) algorithm is a widely used method in plasma physics to study the trajectories of charged particles under electromagnetic ?elds. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance trade-offs related to small shared memory and atomic operations on the GPU to achieve high performance.

SESSION 11: Multiprocessing and Concurrency

  • Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Authors: Junghee Lee (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Yongjae Lee (Georgi
    The seemingly interminable dwindle of technology feature sizes well into the nano-scale regime has afforded computer architects with an abundance of computational resources on a single chip. The Chip Multi-Processor (CMP) paradigm is now seen as the de facto architecture for years to come. However, in order to ef?ciently exploit the increasing number of on-chip processing cores, it is imperative to achieve and maintain ef?cient utilization of the resources at run time. Uneven and skewed distribution of workloads misuses the CMP resources and may even lead to such undesired effects as traf?c and temperature hotspots. While existing techniques rely mostly on software for the undertaking of load balancing duties and exploit hardware mainly for synchronization, we will demonstrate that there are wider opportunities for hardware support of load balancing in CMP systems. Based on this fact, this paper proposes IsoNet, a con?ict-free dynamic load distribution engine that exploits hardware aggressively to reinforce massively parallel computation in manycore settings. Moreover, the proposed architecture provides extensive fault-tolerance against both CPU faults and intra-IsoNet faults. The hardware takes charge of both (1) the management of the list of jobs to be executed, and (2) the transfer of jobs between processing elements to maintain load balance. Experimental results show that, unlike the existing popular techniques of blocking and job stealing, IsoNet is scalable with as many as 1024 processing cores.
  • HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors Authors: Javier Lira (Universitat Politècnica de Catalunya, Spain); Carlos Molina (Universitat Rovira i Virgili, Spain); Antonio
    The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it dif?cult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. Traditionally, NUCA organizations have been classi?ed as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain signi?cant performance bene?ts, they used a non-affordable access scheme mechanism to ?nd data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (Home Knows where to ?nd data within the NUCA cache). It exploits migration features by providing fast and power ef?cient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an ef?cient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40% energy consumed per each memory request, and achieves an average performance improvement of 6%.
  • Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads Authors: Juan M. Cebrián (University of Murcia, Spain); Juan L. Aragón (University of Murcia, Spain); Stefanos Kaxiras (Un
    In the recent years virtually all processor architectures employ multiple cores per chip (CMPs). It is possible to use legacy (i.e., single-core) power saving techniques in CMPs which run either sequential applications or independent multithreaded workloads. However, new challenges arise when running parallel shared-memory applications. In the later case, sacri?cing some performance in a single core (thread) in order to be more energy-ef?cient might unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), therefore, harming the performance of the whole application. CMPs increasingly face thermal and power-related problems during their typical use. Such problems can be solved by setting a power budget to the processor/core. This paper initially studies the behavior of different techniques to match a prede?ned power budget in a CMP processor. While legacy techniques properly work for thread independent/multi-programmed workloads, parallel workloads exhibit the problem of independently adapting the power of each core in a thread dependent scenario. In order to solve this problem we propose a novel mechanism, Power Token Balancing (PTB), aimed at accurately matching an external power constraint by balancing the power consumed among the different cores using a power tokenbased approach while optimizing the energy ef?ciency. We can use power (seen as tokens or coupons) from non-critical threads for the bene?t of critical threads. PTB runs transparent for thread independent / multiprogrammed workloads and can be also used as a spinlock detector based on power patterns. Results show that PTB matches more accurately a prede?ned power budget (total energy consumed over the budget is reduced to 8% for a 16-core CMP) than DVFS with only a 3% energy increase. Finally, we can trade accuracy on matching the power budget for energy-ef?ciency reducing the energy a 4% with a 20% of accuracy.
  • A Very Fast Simulator For Exploring The Many-Core Future Authors: Olivier Certner (INRIA, France); Zheng Li (INRIA, France); Arun Raman (Princeton University, USA); Olivier Temam (INRIA Futurs,
    Although multi-core architectures with a large number of cores (“many-cores”) are considered the future of computing systems, there are currently few practical tools to quickly explore both their design and general program scalability. In this paper, we present SiMany, a discrete-event-based many-core simulator able to support more than a thousand cores while being orders of magnitude faster than existing ?exible approaches. One of the dif?cult challenges for a reasonably realistic many-core simulation is to model faithfully the potentially high concurrency a program can exhibit. SiMany uses a novel virtual time synchronization technique, called spatial synchronization, to achieve this goal in a completely local and distributed fashion, which diminishes interactions and preserves locality. Compared to previous simulators, it raises the level of abstraction by focusing on modeling concurrent interactions between cores, which enables fast coarse comparisons of high-level architecture design choices and parallel programs performance. Sequential pieces of code are executed natively for maximal speed. We exercise the simulator with a set of dwarf-like task-based benchmarks with dynamic control ?ow and irregular data structures. Scalability results are validated through comparison with a cycle-level simulator up to 64 cores. They are also shown consistent with well-known benchmark characteristics. We ?nally demonstrate how SiMany can be used to ef?ciently compare the benchmarks’ behavior over a wide range of architectural organizations, such as polymorphic architectures and network of clusters

SESSION 12: Compilers

  • Variable Granularity Access Tracking Scheme for Improving the Performance of Software Transactional Memory Authors: Sandya Mannarswamy (Hewlett Packard India, India); Govindarajan Ramaswamy (Indian Institute of Science, India)
    Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a con?ict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single ?xed access tracking granularity cannot meet the con?icting goals of reducing false con?icts without impacting concurrency adversely. A ?ne grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a ?xed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application’s data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
  • Automated architecture-aware mapping of streaming applications onto GPUs Authors: Andrei Hagiescu (National University of Singapore, Singapore); Huynh Phung Huynh (A*STAR Institute of High Performance Computin
    Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called warps and wavefronts, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a suf?cient number of alternative thread groups to select from. Many parallel general purpose applications have been ef?ciently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation ?ow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of compute and memory access threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. % small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the StreamIt programming language, and our results show a signi?cant speedup compared to the state-of-the-art solutions.
  • Automatic Loop Tiling for Direct Memory Access Authors: Haibo Lin (IBM Research - China, P.R. China); Tao Liu (IBM Research - China, P.R. China); Lakshminarayanan Renganarayana (IBM
    In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is ef?cient for regular accesses, while requiring precise analysis, detailed modeling of execution, and signi?cant code generation. In this paper we present the design and implementation of DMATiler which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more ef?cient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.
  • Tolerant Value Speculation in Coarse-Grain Streaming Computations Authors: Nathaniel Azuelos (Technion, Israel); Idit Keidar (Technion, Israel); Ayal Zaks (IBM Haifa Research Lab, Israel)
    Streaming applications are the subject of growing interest, as the need for fast access to data continues to grow. In this work, we present the design requirements and implementation of coarse-grain value speculation in streaming applications. We explain how this technique can be useful in cases where serial parts of applications constitute bottlenecks, and when slower I/O favors using available pre?xes of the data. Contrary to previous work, we show how allowing some tolerance can justify early predictions on a scale of a large window of values. We suggest a methodology for runtime support of speculation, along with the mechanisms required for rollback. We present resource management issues consequent to our technique. We study how validation and speculation frequencies impact the performance of the program. Finally, we present our implementation in the context of the Huffman encoder benchmark, running it in different con?gurations and on different architectures.