TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 2: Communication & I/O Optimization

  • Communication-Avoiding QR Decomposition for GPUs Authors: Michael Anderson (University of California, Berkeley, USA); Grey Ballard (UC Berkeley, USA); James Demmel (University of Califo
    The increasing energy demand coupled with emerging sustainability concerns requires a re-examination of power/thermal issues in data centers from the perspective of short term energy de?ciencies. Such energy de?cient scenarios arise for a variety of reasons including variable energy supply from renewable sources and inadequate power, thermal and cooling capacities. In this paper we propose a hierarchical control scheme to adapt assignments of tasks to servers in a way that can cope with the varying energy limitations and still provide necessary QoS . The rescheduling of tasks on different servers has direct (migration related) and indirect (changed traf?c patterns) network energy impacts that we also consider. We show the stability of our scheme and evaluate its performance via detailed simulations and experiments.
  • Overlapping Computation and Communication for Advection on Hybrid Parallel Computers Authors: James B White (National Center for Atmospheric Research, USA); Jack Dongarra (University of Tennessee, Knoxville, USA)
    We describe computational experiments exploring the performance improvements from overlapping computation and communication on hybrid parallel computers. Our test case is explicit time integration of linear advection with constant uniform velocity in a three-dimensional periodic domain. The test systems include a Cray XT5, a Cray XE6, and two multicore In?niband clusters with different generations of NVIDIA graphics processing units (GPUs). We describe results for Fortran implementations using various combinations of MPI, OpenMP, and CUDA, with and without overlap of computation and communication. We ?nd that overlapping CPU computation, GPU computation, parallel communication, and CPU-GPU communication can provide performance improvements of more than a factor of two.
  • VisIO: Enabling Interactive Visualization of Ultra-Scale, Time Series Data via High-Bandwidth Distributed I/O Systems Authors: Christopher Mitchell (University of Central Florida, USA); James Ahrens (Los Alamos National Laboratory, USA); Jun Wang (Univer
    Petascale simulations compute at resolutions ranging into billions of cells and write terabytes of data for visualization and analysis. Interactive visualization of this time series is a desired step before starting a new run. The I/O subsystem and associated network often are a signi?cant impediment to interactive visualization of time-varying data; as they are not con?gured or provisioned to provide necessary I/O read rates. In this paper, we propose a new I/O library for visualization applications: VisIO. Visualization applications commonly use N-to-N reads within their parallel enabled readers which provides an incentive for a shared-nothing approach to I/O, similar to other data-intensive approaches such as Hadoop. However, unlike other data-intensive applications, visualization requires: (1) interactive performance for large data volumes, (2) compatibility with MPI and POSIX ?le system semantics for compatibility with existing infrastructure, and (3) use of existing ?le formats and their stipulated data partitioning rules. VisIO, provides a mechanism for using a non-POSIX distributed ?le system to provide linear scaling of I/O bandwidth. In addition, we introduce a novel scheduling algorithm that helps to co-locate visualization processes on nodes with the requested data. Testing using VisIO integrated into ParaView was conducted using the Hadoop Distributed File System (HDFS) on TACC’s Longhorn cluster. A representative dataset, VPIC, across 128 nodes showed a 64.4% read performance improvement compared to the provided Lustre installation. Also tested, was a dataset representing a global ocean salinity simulation that showed a 51.4% improvement in read performance over Lustre when using our VisIO system. VisIO, provides powerful high-performance I/O services to visualization applications, allowing for interactive performance with ultra-scale, time-series data.
  • Architectural constraints to attain 1 Exaflop/s on three scientific application classes Authors: Abhinav Bhatele (University of Illinois at Urbana-Champaign, USA); Pritish Jetley (University of Illinois at Urbana-Champaign,
    The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to attain Exa?op/s performance. This paper presents a feasibility study of three important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes being considered in this paper are – classical molecular dynamics, cosmological simulations and unstructured grid computations (?nite element solvers). We analyze the problem sizes required for representative algorithms in each class to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we also discuss the performance of these algorithms for much smaller problem sizes.

SESSION 18: Distributed Systems

  • GRAL: A Grouping Algorithm to Optimize Application Placement in Wireless Embedded Systems Authors: Nikos Tziritas (University of Thessaly, Greece); Thanasis Loukopoulos (Technological Educational Institute of Lamia, Greece); S
    Recent embedded middleware initiatives enable the structuring of an application as a set of collaborating agents deployed in the various sensing/actuating entities of the system. Of particular importance is the incurred cost due to agent communication which in terms depends on agent positions in the system. In this paper we present GRAL a grouping algorithm that migrates groups of agents with the aim of minimizing communication. The algorithm works in a distributed fashion based on knowledge available locally at each node and can be used both for one-shot initial application deployment and for the continuous updating of agent placement. Through simulation experiments under various scenarios we evaluate the algorithm, comparing the solution quality reached against the optimal obtained from exhaustive search.
  • Vitis: A Gossip-based Hybrid Overlay for Internet-scale Publish/Subscribe Enabling Rendezvous Routing in Unstructured Overlay Networks Authors: Fatemeh Rahimian (KTH - Royal Institute of Technology, Sweden); Sarunas Girdzijauskas (Swedish Institute of Computer Science (S
    Peer-to-peer overlay networks are attractive solutions for building Internet-scale publish/subscribe systems. However, scalability comes with a cost: a message published on a certain topic often needs to traverse a large number of uninterested (unsubscribed) nodes before reaching all its subscribers. This might sharply increase resource consumption for such relay nodes (in terms of bandwidth transmission cost, CPU, etc) and could ultimately lead to rapid deterioration of the system’s performance once the relay nodes start dropping the messages or choose to permanently abandon the system. In this paper, we introduce Vitis, a gossip-based publish/subscribe system that signi?cantly decreases the number of relay messages, and scales to an unbounded number of nodes and topics. This is achieved by the novel approach of enabling rendezvous routing on unstructured overlays. We construct a hybrid system by injecting structure into an otherwise unstructured network. The resulting structure resembles a navigable small-world network, which spans along clusters of nodes that have similar subscriptions. The properties of such an overlay make it an ideal platform for ef?cient data dissemination in large-scale systems. We perform extensive simulations and evaluate Vitis by comparing its performance against two base-line publish/subscribe systems: one that is oblivious to node subscriptions, and another that exploits the subscription similarities. Our measurements show that Vitis signi?cantly outperforms the base-line solutions on various subscription and churn scenarios, from both synthetic models and real-world traces.
  • Moving the Code to the Data - Dynamic Code Deployment using ActiveSpaces Authors: Ciprian Docan (Rutgers, The State University of New Jersey, USA); Manish Parashar (Rutgers, The State University of New Jersey,
    Managing the large volumes of data produced by emerging scienti?c and engineering simulations running on leadership-class resources has become a critical challenge. The data has to be extracted off the computing nodes and transported to consumer nodes so that it can be processed, analyzed, visualized, archived, etc. Several recent research efforts have addressed datarelated challenges at different levels. One attractive approach is to of?oad expensive I/O operations to a smaller set of dedicated computing nodes known as a staging area. However, even using this approach, the data still has to be moved from the staging area to consumer nodes for processing, which continues to be a bottleneck. In this paper, we investigate an alternate approach, namely moving the data-processing code to the staging area rather than moving the data. Speci?cally, we present the ActiveSpaces framework, which provides (1) programming support for de?ning the data-processing routines to be downloaded to the staging area, and (2) run-time mechanisms for transporting binary codes associated with these routines to the staging area, executing the routines on the nodes of the staging area, and returning the results. We also present an experimen- tal performance evaluation of ActiveSpaces using applications running on the Cray XT5 at Oak Ridge National Laboratory. Finally, we use a coupled fusion application work?ow to explore the trade-offs between transporting data and transporting the code required for data processing during coupling, and we characterize the sweet spots for each option.

SESSION 3: Hardware-Software Interaction

  • A Novel Power management for CMP Systems in Data-intensive Environment Authors: Pengju Shang (University of Central Florida, USA); Jun Wang (University of Central Florida, USA)
    The emerging data-intensive applications of today are comprised of non-uniform CPU and I/O intensive workloads, thus imposing a requirement to consider both CPU and I/O effects in the power management strategies. Only scaling down the processor’s frequency based on its busy/idle ratio cannot fully exploit opportunities of saving power. Our experiments show that besides the busy and idle status, each processor may also have I/O wait phases waiting for I/O operations to complete. During this period, the completion time is decided by the I/O subsystem rather than the CPU thus scaling the processor to a lower frequency will not affect the performance but save more power. In addition, the CPU’s reaction to the I/O operations may be signi?cantly affected by several factors, such as I/O type (sync or unsync), instruction/job level parallelism; it cannot be accurately modeled via physics laws like mechanical or chemical systems. In this paper, we propose a novel power management scheme called MAR (modeless, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption under performance constraints. By using richer feedback factors, e.g. the I/O wait, MAR is able to accurately describe the relationships among core frequencies, performance and power consumption.We adopt a modeless control model to reduce the complexity of system modeling. MAR is designed for CMP (Chip Multi Processor) systems by employing multi-input/multi-output (MIMO) theory and percore level DVFS (Dynamic Voltage and Frequency Scaling). Our extensive experiments on a physical test bed demonstrate that, for the SPEC benchmark and data-intensive (TPC-C) benchmark, the ef?ciency of MAR is 93.6-96.2% accurate to the ideal power saving strategy calculated off-line. Compared with baseline solutions, MAR could save 22.5-32.5% more power while keeping the comparable performance loss of about 1.8-2.9%. In addition, simulation results show the ef?ciency of our design for various CMP con?gurations.
  • Characterization of System Services and Their Performance Impact in Multicore Nodes Authors: Seetharami R Seelam (IBM Research, USA); Liana L Fong (IBM TJ Watson Research Center, USA); John Divirgilio (IBM, USA); Brian F
    The performance of parallel applications on large scale systems is shown to disproportionately degrade due to interference from system services. This interference from system services is also known as jitter. However, there is limited understanding of sources and patterns of jitter on multi-core systems. In this paper, we identify and characterize jitter sources in terms of their amplitude and execution interval distributions on multi-core IBM Power systems with UNIX-based general purpose operating systems: AIX and Linux. Our analysis shows that there are various kinds of jitter sources and their execution varies drastically between different cores and between hardware threads within each core for practical reasons. This in-depth knowledge of jitter events is leveraged to devise effective approaches to mitigate the jitter impact on application performance in large scale systems. Moreover, such knowledge would provide useful insights to a new generation of operating system designs such as multikernel or satellite kernel for multi-core systems.
  • Automatic Recognition of Performance Idioms in Scientific Applications Authors: Jiahua He (University of California, San Diego, USA); Allan Snavely (University of California, San Diego, USA); Rob F Van der W
    Basic data ?ow patterns that we call performance idioms, such as stream, transpose, reduction, random access and stencil, are common in scienti?c numerical applications. We hypothesize that a small number of idioms can cover most programming constructs that dominate the execution time of scienti?c codes and can be used to approximate the application performance. To check these hypotheses, we proposed an automatic idioms recognition method and implemented the method, based on the open source compiler Open64. With the NAS Parallel Benchmark (NPB) as a case study, the prototype system is about 90% accurate compared with idiom classi?cation by a human expert. Our results showed that the above ?ve idioms suf?ce to cover 100% of the six NPB codes (MG, CG, FT, BT, SP and LU). We also compared the performance of our idiom benchmarks with their corresponding instances in the NPB codes on two different platforms with different methods. The approximation accuracy is up to 96:6%. The contribution is to show that a small set of idioms can cover more complex codes, that idioms can be recognized automatically, and that suitably de?ned idioms may approximate application performance.
  • Iso-energy-efficiency: An approach to power-constrained parallel computation Authors: Shuaiwen Song (Virginia Tech, USA); Chun-Yi Su (Virginia Tech, USA); Rong Ge (Marquette University, USA); Abhinav Vishnu (Pacif
    Future large scale high performance supercomputer systems require high energy ef?ciency to achieve exa?ops computational power and beyond. Despite the need to understand energy ef?ciency in high-performance systems, there are few techniques to evaluate energy ef?ciency at scale. In this paper, we propose a system-level iso-energy-ef?ciency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy ef?ciency and isolate ef?cient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-ef?ciency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.