TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 7: Numerical Algorithms

  • Automatic Library Generation for BLAS3 on GPUs Authors: Huimin Cui (Institute of Computing Technology, P.R. China); Lei Wang (Institute of Computing Technology, Chinese Academy of Sci
    High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential bene?ts of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).
  • Redesign of Higher-Level Matrix Algorithms for Multicore and Distributed Architectures and Applications in Quantum Monte Carlo Simulation Authors: Che-Rung Lee (National Tsing Hua University, Taiwan); Zhaojun Bai (University of California, Davis, USA)
    A matrix operation is referred to as a hard-to-parallel matrix operation (HPMO) if it has serial bottlenecks that are hardly parallelizable. Otherwise, it is referred to as an easy-to-parallel matrix operation (EPMO). Empirical evidences showed the performance scalability of an HPMO is signi?cantly poorer than an EPMO on multicore and distributed architectures. As the result, the design of higher-level algorithms for applications, for the performance considerations on multicore and distributed architectures, should avoid the use of HPMOs as the computational kernels. In this paper, as a case study, we present an HPMO-avoiding algorithm for the Green’s function calculation in quantum Monte Carlo simulation. The original algorithm utilizes the QR-decomposition with column pivoting (QRP) as its computational kernel. QRP is an HPMO. The redesigned algorithm maintains the same simulation stability but employs the standard QR decomposition without pivoting (QR), which is an EPMO. Different implementations of the redesigned algorithm on multicore and distributed architectures are investigated. Although some implementations of the redesigned method use about a factor of three more ?oating-point operations than the original algorithm, they are about 20% faster on a quadcore system and 2.5 times faster on a 1024-CPU massively parallel processing system. The broader impact of the redesign of higher-level matrix algorithms to avoid HPMOs in other computational science applications is also discussed.
  • Challenges of Scaling Algebraic Multigrid across Modern Multicore Architectures Authors: Allison Baker (Lawrence Livermore National Laboratory, USA); Todd Gamblin (Lawrence Livermore National Laboratory, USA); Martin
    Algebraic multigrid (AMG) is a popular solver for large-scale scienti?c computing and an essential component of many simulation codes. AMG has shown to be extremely ef?cient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can signi?cantly deteriorate AMG’s performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM BlueGene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.

SESSION 8: Reliability and Security

  • Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU Authors: Keun Soo Yim (University of Illinois at Urbana-Champaign, USA); Cuong Pham (University of Illinois at Urbana-Champaign, USA); M
    High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output correctness requirements than typical GPU applications (i.e., 3D graphics). This paper ?rst analyzes the error resiliency of GPGPU platforms using a fault injection tool we have developed for commodity GPU devices. On average, 16-33% of injected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is signi?-cantly higher than that measured in CPU programs (<2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented Hauberk technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (87%) with a small performance overhead (15%).
  • A Performance and Area Efficient Architecture for Intrusion Detection Systems Authors: Govind Sreekar Shenoy (Universitat Politecnica de Catalunya, Spain); Jordi Tubella (Universitat Politecnica de Catalunya, Spain)
    Intrusion Detection Systems (IDS) have emerged as one of the most promising ways to secure systems in network. An IDS operates by scanning packet-data for known signatures and accordingly takes requisite action. However, scanning bytes in the packet payload and checking for more than 20,000 signatures becomes a computationally intensive task. Additionally, with signatures doubling almost every 30 months, this complexity will aggravate further. IDS commonly uses the Aho-Corasick state machine based search to scan packets for signatures. However, the huge size of the state machine negatively impacts the performance and area ef?ciency of the underlying hardware. In this work, we propose novel mechanisms to compactly store the state machine thereby improving the area ef?ciency. We observe over 2X reduction in area for storing the state machine in comparison to BS-FSM [19]. We investigate various approaches to improve the performance ef?ciency. We pipeline the processing of consecutive bytes accessing the upper-most level, the frequently accessed level, of the state machine. In order to further enhance the performance ef?ciency, we use a dedicated hardware unit speci?cally tuned for traversal using our proposed storage mechanism. We observe that our proposed architecture outperforms BS-FSM based approaches [13, 14, 19]
  • Time-Ordered Event Traces: A New Debugging Primitive for Concurrency Bugs Authors: Martin Dimitrov (University of Central Florida, USA); Huiyang Zhou (North Carolina State University, USA)
    Non-determinism makes concurrent bugs extremely dif?cult to reproduce and to debug. In this work, we propose a new debugging primitive to facilitate the debugging process by exposing this non-deterministic behavior to the programmer. The key idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses across different threads. The architectural support for this primitive is lightweight, including a high-precision, frequency-invariant timestamp counter and an event trace buffer in each processor core. The proposed primitive continuously records and timestamps the last N function calls/returns per core by default, and can also be con?gured to watch speci?c memory addresses or code regions through a ?exible software interface. To examine the effectiveness of the proposed primitive, we studied a variety of concurrent bugs in large commercial software and our results show that exposing the time-ordered information, function calls/returns in particular, to the programmer is highly bene?cial for diagnosing the root causes of these bugs.

SESSION 9: Wireless and Sensor Networks

  • Singlehop Collaborative Feedback Primitives for Threshold Querying in Wireless Sensor Networks Authors: Murat Demirbas (University at Buffalo, SUNY, USA); Serafettin Tasci (SUNY Buffalo, USA); Hanifi Gunes (SUNY Buffalo, USA); Atri
    In wireless sensor network (WSN) deployments, Receiver-side Collision Detection (RCD) has been proposed for speeding up collaborative feedback collection from a singlehop neighborhood. Using RCD, an initiator node can query the existence of a predicate P in its neighborhood in constant time by making all P-positive nodes answer simultaneously. Despite the collisions, the initiator is still able to infer useful information from a broadcast using RCD: an activity in the network means the predicate P holds for at least one node while silence indicates that P does not hold at any queried node in the network. In this study we investigate the threshold querying problem, where the initiator has to learn whether P holds in the network for at least threshold t number of nodes in singlehop of the initiator. To answer the threshold queries in an ef?cient fashion, we present a number of adaptive RCD-based querying mechanisms that dynamically re-groups the queried nodes in the network. We evaluate our algorithms on a real sensor network implementation and also carry out several simulations to contrast our approach with the traditional techniques. The experiments reveal that our algorithms achieve signi?cant time improvements in threshold queries over traditional techniques.
  • Completely Distributed Particle Filters for Target Tracking in Sensor Networks Authors: Bo Jiang (Virginia Tech, USA); Binoy Ravindran (Virginia Tech, USA)
    Particle ?lters (or PFs) are widely used for the tracking problem in dynamic systems. Despite their remarkable tracking performance and ?exibility, PFs require intensive computation and communication, which are strictly constrained in wireless sensor networks (or WSNs). Thus, distributed particle ?lters (or DPFs) have been studied to distribute the computational workload onto multiple nodes while minimizing the communication among them. However, weight normalization and resampling in generic PFs cause signi?cant challenges in the distributed implementation. Few existing efforts on DPF could be implemented in a completely distributed manner. In this paper, we design a completely distributed particle ?lter (or CDPF) for target tracking in sensor networks, and further improve it with neighborhood estimation toward minimizing the communication cost. First, we describe the particle maintenance and propagation mechanism, by which particles are maintained on different sensor nodes and propagated along the target trajectory. Then, we design the CDPF algorithm by adjusting the order of PFs’ four steps and leveraging the data aggregation during particle propagation. Finally, we develop a neighborhood estimation method to replace the measurement broadcasting and the calculation of likelihood functions. With this approximate estimation, the communication cost of DPFs can be minimized. Our experimental evaluations show that although CDPF incurs about 50% more estimation error than semi-distributed particle ?lter (or SDPF), its communication cost is lower than that of SDPF by as much as 90%.
  • Maintaining Connectivity in 3D Wireless Sensor Networks using Directional Antennae Authors: Evangelos Kranakis (Carleton University, Canada); Danny Krizanc (Wesleyan University, USA); Ashish Modi (IIT, India); Oscar Mora
    We consider a 3D antenna orientation problem for maintaining connectivity of a wireless network in 3D space using only directional antennae. Sensors are located at points in 3D space and are equipped with directional antennae. The strong connectivity antenna orientation problem is concerned with deciding whether or not for given solid angle and range r it is possible to orient the antennae so as to ensure that the sensor network resulting from the induced transmissions is strongly connected. In this paper we 1) present an algorithm ensuring optimal antenna range for the case when ...
  • Distributed Fine-grained Access Control in Wireless Sensor Networks Authors: Sushmita Ruj (University of Ottawa, Canada); Amiya Nayak (SITE, University of Ottawa, Canada); Ivan Stojmenovic (University of
    In mission-critical activities, each user is allowed to access some speci?c, but not all, data gathered by wireless sensor networks. Yu et al [?] recently proposed a centralized ?ne grained data access control mechanism for sensor networks, which exploits a cryptographic primitive called attribute based encryption (ABE). There is only one trusted authority to distribute keys to the sensor nodes and the users. Compromising the single authority can undermine the whole network. We propose a fully distributed access control method, which has several authorities instead of one. Each sensor has a set of attributes and each user has an access structure of attributes. A message from a sensor is encrypted such that only a user with matching set of attributes can decrypt. Compared to [?], our schemes need simpler access structure which make secret key distribution more computation ef?cient, when user rights are modi?ed. We prove that our scheme can tolerate compromising all but one distribution centers, which independently distribute their contributions to a single user key. Our scheme do not increase the computation and communication costs of the sensors, making it highly desirable for ?ne grained access control.