IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION 2: Communication & I/O Optimization
Communication-Avoiding QR Decomposition for GPUsThe increasing energy demand coupled with emerging sustainability concerns requires a re-examination of power/thermal issues in data centers from the perspective of short term energy de?ciencies. Such energy de?cient scenarios arise for a variety of reasons including variable energy supply from renewable sources and inadequate power, thermal and cooling capacities. In this paper we propose a hierarchical control scheme to adapt assignments of tasks to servers in a way that can cope with the varying energy limitations and still provide necessary QoS . The rescheduling of tasks on different servers has direct (migration related) and indirect (changed traf?c patterns) network energy impacts that we also consider. We show the stability of our scheme and evaluate its performance via detailed simulations and experiments.
Overlapping Computation and Communication for Advection on Hybrid Parallel ComputersWe describe computational experiments exploring the performance improvements from overlapping computation and communication on hybrid parallel computers. Our test case is explicit time integration of linear advection with constant uniform velocity in a three-dimensional periodic domain. The test systems include a Cray XT5, a Cray XE6, and two multicore In?niband clusters with different generations of NVIDIA graphics processing units (GPUs). We describe results for Fortran implementations using various combinations of MPI, OpenMP, and CUDA, with and without overlap of computation and communication. We ?nd that overlapping CPU computation, GPU computation, parallel communication, and CPU-GPU communication can provide performance improvements of more than a factor of two.
VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O SystemsPetascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visualization of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a signi?cant impediment to interactive visualization of time-varying data; as they are not con?gured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar to other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX ?le system semantics for compatibility with existing infrastructure, and (3) use of existing ?le formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed ?le system to provide linear scaling of I/O bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into ParaView was conducted using the Hadoop Distributed File System (HDFS) on TACCâ€™s Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.
Architectural constraints to attain 1 Exaflop/s on three scientific application classesThe first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to attain Exa?op/s performance. This paper presents a feasibility study of three important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes being considered in this paper are â€“ classical molecular dynamics, cosmological simulations and unstructured grid computations (?nite element solvers). We analyze the problem sizes required for representative algorithms in each class to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we also discuss the performance of these algorithms for much smaller problem sizes.