TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 8: Reliability and Security

  • Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU Authors: Keun Soo Yim (University of Illinois at Urbana-Champaign, USA); Cuong Pham (University of Illinois at Urbana-Champaign, USA); M
    High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output correctness requirements than typical GPU applications (i.e., 3D graphics). This paper ?rst analyzes the error resiliency of GPGPU platforms using a fault injection tool we have developed for commodity GPU devices. On average, 16-33% of injected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is signi?-cantly higher than that measured in CPU programs (<2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented Hauberk technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (87%) with a small performance overhead (15%).
  • A Performance and Area Efficient Architecture for Intrusion Detection Systems Authors: Govind Sreekar Shenoy (Universitat Politecnica de Catalunya, Spain); Jordi Tubella (Universitat Politecnica de Catalunya, Spain)
    Intrusion Detection Systems (IDS) have emerged as one of the most promising ways to secure systems in network. An IDS operates by scanning packet-data for known signatures and accordingly takes requisite action. However, scanning bytes in the packet payload and checking for more than 20,000 signatures becomes a computationally intensive task. Additionally, with signatures doubling almost every 30 months, this complexity will aggravate further. IDS commonly uses the Aho-Corasick state machine based search to scan packets for signatures. However, the huge size of the state machine negatively impacts the performance and area ef?ciency of the underlying hardware. In this work, we propose novel mechanisms to compactly store the state machine thereby improving the area ef?ciency. We observe over 2X reduction in area for storing the state machine in comparison to BS-FSM [19]. We investigate various approaches to improve the performance ef?ciency. We pipeline the processing of consecutive bytes accessing the upper-most level, the frequently accessed level, of the state machine. In order to further enhance the performance ef?ciency, we use a dedicated hardware unit speci?cally tuned for traversal using our proposed storage mechanism. We observe that our proposed architecture outperforms BS-FSM based approaches [13, 14, 19]
  • Time-Ordered Event Traces: A New Debugging Primitive for Concurrency Bugs Authors: Martin Dimitrov (University of Central Florida, USA); Huiyang Zhou (North Carolina State University, USA)
    Non-determinism makes concurrent bugs extremely dif?cult to reproduce and to debug. In this work, we propose a new debugging primitive to facilitate the debugging process by exposing this non-deterministic behavior to the programmer. The key idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses across different threads. The architectural support for this primitive is lightweight, including a high-precision, frequency-invariant timestamp counter and an event trace buffer in each processor core. The proposed primitive continuously records and timestamps the last N function calls/returns per core by default, and can also be con?gured to watch speci?c memory addresses or code regions through a ?exible software interface. To examine the effectiveness of the proposed primitive, we studied a variety of concurrent bugs in large commercial software and our results show that exposing the time-ordered information, function calls/returns in particular, to the programmer is highly bene?cial for diagnosing the root causes of these bugs.

SESSION 9: Wireless and Sensor Networks

  • Singlehop Collaborative Feedback Primitives for Threshold Querying in Wireless Sensor Networks Authors: Murat Demirbas (University at Buffalo, SUNY, USA); Serafettin Tasci (SUNY Buffalo, USA); Hanifi Gunes (SUNY Buffalo, USA); Atri
    In wireless sensor network (WSN) deployments, Receiver-side Collision Detection (RCD) has been proposed for speeding up collaborative feedback collection from a singlehop neighborhood. Using RCD, an initiator node can query the existence of a predicate P in its neighborhood in constant time by making all P-positive nodes answer simultaneously. Despite the collisions, the initiator is still able to infer useful information from a broadcast using RCD: an activity in the network means the predicate P holds for at least one node while silence indicates that P does not hold at any queried node in the network. In this study we investigate the threshold querying problem, where the initiator has to learn whether P holds in the network for at least threshold t number of nodes in singlehop of the initiator. To answer the threshold queries in an ef?cient fashion, we present a number of adaptive RCD-based querying mechanisms that dynamically re-groups the queried nodes in the network. We evaluate our algorithms on a real sensor network implementation and also carry out several simulations to contrast our approach with the traditional techniques. The experiments reveal that our algorithms achieve signi?cant time improvements in threshold queries over traditional techniques.
  • Completely Distributed Particle Filters for Target Tracking in Sensor Networks Authors: Bo Jiang (Virginia Tech, USA); Binoy Ravindran (Virginia Tech, USA)
    Particle ?lters (or PFs) are widely used for the tracking problem in dynamic systems. Despite their remarkable tracking performance and ?exibility, PFs require intensive computation and communication, which are strictly constrained in wireless sensor networks (or WSNs). Thus, distributed particle ?lters (or DPFs) have been studied to distribute the computational workload onto multiple nodes while minimizing the communication among them. However, weight normalization and resampling in generic PFs cause signi?cant challenges in the distributed implementation. Few existing efforts on DPF could be implemented in a completely distributed manner. In this paper, we design a completely distributed particle ?lter (or CDPF) for target tracking in sensor networks, and further improve it with neighborhood estimation toward minimizing the communication cost. First, we describe the particle maintenance and propagation mechanism, by which particles are maintained on different sensor nodes and propagated along the target trajectory. Then, we design the CDPF algorithm by adjusting the order of PFs’ four steps and leveraging the data aggregation during particle propagation. Finally, we develop a neighborhood estimation method to replace the measurement broadcasting and the calculation of likelihood functions. With this approximate estimation, the communication cost of DPFs can be minimized. Our experimental evaluations show that although CDPF incurs about 50% more estimation error than semi-distributed particle ?lter (or SDPF), its communication cost is lower than that of SDPF by as much as 90%.
  • Maintaining Connectivity in 3D Wireless Sensor Networks using Directional Antennae Authors: Evangelos Kranakis (Carleton University, Canada); Danny Krizanc (Wesleyan University, USA); Ashish Modi (IIT, India); Oscar Mora
    We consider a 3D antenna orientation problem for maintaining connectivity of a wireless network in 3D space using only directional antennae. Sensors are located at points in 3D space and are equipped with directional antennae. The strong connectivity antenna orientation problem is concerned with deciding whether or not for given solid angle and range r it is possible to orient the antennae so as to ensure that the sensor network resulting from the induced transmissions is strongly connected. In this paper we 1) present an algorithm ensuring optimal antenna range for the case when ...
  • Distributed Fine-grained Access Control in Wireless Sensor Networks Authors: Sushmita Ruj (University of Ottawa, Canada); Amiya Nayak (SITE, University of Ottawa, Canada); Ivan Stojmenovic (University of
    In mission-critical activities, each user is allowed to access some speci?c, but not all, data gathered by wireless sensor networks. Yu et al [?] recently proposed a centralized ?ne grained data access control mechanism for sensor networks, which exploits a cryptographic primitive called attribute based encryption (ABE). There is only one trusted authority to distribute keys to the sensor nodes and the users. Compromising the single authority can undermine the whole network. We propose a fully distributed access control method, which has several authorities instead of one. Each sensor has a set of attributes and each user has an access structure of attributes. A message from a sensor is encrypted such that only a user with matching set of attributes can decrypt. Compared to [?], our schemes need simpler access structure which make secret key distribution more computation ef?cient, when user rights are modi?ed. We prove that our scheme can tolerate compromising all but one distribution centers, which independently distribute their contributions to a single user key. Our scheme do not increase the computation and communication costs of the sensors, making it highly desirable for ?ne grained access control.

SESSION 10: GPU Acceleration

  • Design of MILC lattice QCD application for GPU clusters Authors: Guochun Shi (University of Illinois at Urbana-Champaign, USA); Steven Gottlieb (Indiana University, USA); Aaron Torok (Indiana U
    We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision ?oating-point GPU implementation of the multi-mass conjugate gradient solver. Our single GPU implementation of the conjugate gradient solver achieves a 9x performance improvement over the highly optimized code executed on a state-of-the-art eight-core CPU node. The overall application executes almost six times faster on a GPU-enabled cluster vs. a conventional multi-core cluster. The developed code is currently used for running production QCD calculations with electromagnetic corrections.
  • Multifrontal Factorization of Sparse SPD Matrices on GPUs Authors: Thomas George (IBM Research India, India); Vaibhav Saxena (IBM Research - India, New Delhi, India); Anshul Gupta (IBM T.J. Watso
    Solving large sparse linear systems is often the most computationally intensive component of many scienti?c computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-ef?ciency, and the potential to achieve signi?cant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored suf?ciently due to the complexity involved in an ef?cient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classi?er. We estimate the classi?er parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.
  • Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia Mining Authors: Mamadou Diao (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Jongman Kim (Georgi
    Media mining, the extraction of meaningful knowledge from multimedia content has become a major application and poses signi?cant computational challenges in today’s platforms. Media mining applications contain many sophisticated algorithms that include data-intensive analysis, classi?cation, and learning. This paper explores the use of Graphics Processing Units (GPU) in media mining. We are particularly focused on large-scale semantic concept detection, a state-of-the-art approach that maps media content to hight-level semantic concepts, and a building block in many Media mining applications. We present a fast, parallel, large-scale, high-level semantic concept detector that leverages the GPU for image/video retrieval and content analysis. Through ef?cient data partitioning and movement, we parallelize feature extraction routines. By interleaving feature extraction routines of different types, we increase the computational intensity and mitigate the negative effects of histogram-like reduction operations. To cope with the very large number of semantic concepts, we propose a data layout of concept models on a multi-GPU hybrid architecture for high throughput semantic concept detection. We achieve one to two orders of magnitude speedups compared to serial implementations and our experiments show that we can detect 374 semantic concepts at a rate of over 100 frames/sec. This is over 100 times faster than a LibSVM-based semantic concept detection.
  • Efficient GPU implementation for Particle in Cell Algorithm Authors: Rejith Joseph (University of Florida, USA); Girish Ravunnikutty (University of Florida, USA); Sanjay Ranka (University of Florid
    Particle in cell (PIC) algorithm is a widely used method in plasma physics to study the trajectories of charged particles under electromagnetic ?elds. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance trade-offs related to small shared memory and atomic operations on the GPU to achieve high performance.