IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION 10: GPU Acceleration
Design of MILC lattice QCD application for GPU clustersWe present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision ?oating-point GPU implementation of the multi-mass conjugate gradient solver. Our single GPU implementation of the conjugate gradient solver achieves a 9x performance improvement over the highly optimized code executed on a state-of-the-art eight-core CPU node. The overall application executes almost six times faster on a GPU-enabled cluster vs. a conventional multi-core cluster. The developed code is currently used for running production QCD calculations with electromagnetic corrections.
Multifrontal Factorization of Sparse SPD Matrices on GPUsSolving large sparse linear systems is often the most computationally intensive component of many scienti?c computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-ef?ciency, and the potential to achieve signi?cant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored suf?ciently due to the complexity involved in an ef?cient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classi?er. We estimate the classi?er parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.
Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia MiningMedia mining, the extraction of meaningful knowledge from multimedia content has become a major application and poses signi?cant computational challenges in todayâ€™s platforms. Media mining applications contain many sophisticated algorithms that include data-intensive analysis, classi?cation, and learning. This paper explores the use of Graphics Processing Units (GPU) in media mining. We are particularly focused on large-scale semantic concept detection, a state-of-the-art approach that maps media content to hight-level semantic concepts, and a building block in many Media mining applications. We present a fast, parallel, large-scale, high-level semantic concept detector that leverages the GPU for image/video retrieval and content analysis. Through ef?cient data partitioning and movement, we parallelize feature extraction routines. By interleaving feature extraction routines of different types, we increase the computational intensity and mitigate the negative effects of histogram-like reduction operations. To cope with the very large number of semantic concepts, we propose a data layout of concept models on a multi-GPU hybrid architecture for high throughput semantic concept detection. We achieve one to two orders of magnitude speedups compared to serial implementations and our experiments show that we can detect 374 semantic concepts at a rate of over 100 frames/sec. This is over 100 times faster than a LibSVM-based semantic concept detection.
Efficient GPU implementation for Particle in Cell AlgorithmParticle in cell (PIC) algorithm is a widely used method in plasma physics to study the trajectories of charged particles under electromagnetic ?elds. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance trade-offs related to small shared memory and atomic operations on the GPU to achieve high performance.