TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 9: Wireless and Sensor Networks

  • Singlehop Collaborative Feedback Primitives for Threshold Querying in Wireless Sensor Networks Authors: Murat Demirbas (University at Buffalo, SUNY, USA); Serafettin Tasci (SUNY Buffalo, USA); Hanifi Gunes (SUNY Buffalo, USA); Atri
    In wireless sensor network (WSN) deployments, Receiver-side Collision Detection (RCD) has been proposed for speeding up collaborative feedback collection from a singlehop neighborhood. Using RCD, an initiator node can query the existence of a predicate P in its neighborhood in constant time by making all P-positive nodes answer simultaneously. Despite the collisions, the initiator is still able to infer useful information from a broadcast using RCD: an activity in the network means the predicate P holds for at least one node while silence indicates that P does not hold at any queried node in the network. In this study we investigate the threshold querying problem, where the initiator has to learn whether P holds in the network for at least threshold t number of nodes in singlehop of the initiator. To answer the threshold queries in an ef?cient fashion, we present a number of adaptive RCD-based querying mechanisms that dynamically re-groups the queried nodes in the network. We evaluate our algorithms on a real sensor network implementation and also carry out several simulations to contrast our approach with the traditional techniques. The experiments reveal that our algorithms achieve signi?cant time improvements in threshold queries over traditional techniques.
  • Completely Distributed Particle Filters for Target Tracking in Sensor Networks Authors: Bo Jiang (Virginia Tech, USA); Binoy Ravindran (Virginia Tech, USA)
    Particle ?lters (or PFs) are widely used for the tracking problem in dynamic systems. Despite their remarkable tracking performance and ?exibility, PFs require intensive computation and communication, which are strictly constrained in wireless sensor networks (or WSNs). Thus, distributed particle ?lters (or DPFs) have been studied to distribute the computational workload onto multiple nodes while minimizing the communication among them. However, weight normalization and resampling in generic PFs cause signi?cant challenges in the distributed implementation. Few existing efforts on DPF could be implemented in a completely distributed manner. In this paper, we design a completely distributed particle ?lter (or CDPF) for target tracking in sensor networks, and further improve it with neighborhood estimation toward minimizing the communication cost. First, we describe the particle maintenance and propagation mechanism, by which particles are maintained on different sensor nodes and propagated along the target trajectory. Then, we design the CDPF algorithm by adjusting the order of PFs’ four steps and leveraging the data aggregation during particle propagation. Finally, we develop a neighborhood estimation method to replace the measurement broadcasting and the calculation of likelihood functions. With this approximate estimation, the communication cost of DPFs can be minimized. Our experimental evaluations show that although CDPF incurs about 50% more estimation error than semi-distributed particle ?lter (or SDPF), its communication cost is lower than that of SDPF by as much as 90%.
  • Maintaining Connectivity in 3D Wireless Sensor Networks using Directional Antennae Authors: Evangelos Kranakis (Carleton University, Canada); Danny Krizanc (Wesleyan University, USA); Ashish Modi (IIT, India); Oscar Mora
    We consider a 3D antenna orientation problem for maintaining connectivity of a wireless network in 3D space using only directional antennae. Sensors are located at points in 3D space and are equipped with directional antennae. The strong connectivity antenna orientation problem is concerned with deciding whether or not for given solid angle and range r it is possible to orient the antennae so as to ensure that the sensor network resulting from the induced transmissions is strongly connected. In this paper we 1) present an algorithm ensuring optimal antenna range for the case when ...
  • Distributed Fine-grained Access Control in Wireless Sensor Networks Authors: Sushmita Ruj (University of Ottawa, Canada); Amiya Nayak (SITE, University of Ottawa, Canada); Ivan Stojmenovic (University of
    In mission-critical activities, each user is allowed to access some speci?c, but not all, data gathered by wireless sensor networks. Yu et al [?] recently proposed a centralized ?ne grained data access control mechanism for sensor networks, which exploits a cryptographic primitive called attribute based encryption (ABE). There is only one trusted authority to distribute keys to the sensor nodes and the users. Compromising the single authority can undermine the whole network. We propose a fully distributed access control method, which has several authorities instead of one. Each sensor has a set of attributes and each user has an access structure of attributes. A message from a sensor is encrypted such that only a user with matching set of attributes can decrypt. Compared to [?], our schemes need simpler access structure which make secret key distribution more computation ef?cient, when user rights are modi?ed. We prove that our scheme can tolerate compromising all but one distribution centers, which independently distribute their contributions to a single user key. Our scheme do not increase the computation and communication costs of the sensors, making it highly desirable for ?ne grained access control.

SESSION 10: GPU Acceleration

  • Design of MILC lattice QCD application for GPU clusters Authors: Guochun Shi (University of Illinois at Urbana-Champaign, USA); Steven Gottlieb (Indiana University, USA); Aaron Torok (Indiana U
    We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision ?oating-point GPU implementation of the multi-mass conjugate gradient solver. Our single GPU implementation of the conjugate gradient solver achieves a 9x performance improvement over the highly optimized code executed on a state-of-the-art eight-core CPU node. The overall application executes almost six times faster on a GPU-enabled cluster vs. a conventional multi-core cluster. The developed code is currently used for running production QCD calculations with electromagnetic corrections.
  • Multifrontal Factorization of Sparse SPD Matrices on GPUs Authors: Thomas George (IBM Research India, India); Vaibhav Saxena (IBM Research - India, New Delhi, India); Anshul Gupta (IBM T.J. Watso
    Solving large sparse linear systems is often the most computationally intensive component of many scienti?c computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-ef?ciency, and the potential to achieve signi?cant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored suf?ciently due to the complexity involved in an ef?cient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classi?er. We estimate the classi?er parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.
  • Large-Scale Semantic Concept Detection on Manycore Platforms for Multimedia Mining Authors: Mamadou Diao (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Jongman Kim (Georgi
    Media mining, the extraction of meaningful knowledge from multimedia content has become a major application and poses signi?cant computational challenges in today’s platforms. Media mining applications contain many sophisticated algorithms that include data-intensive analysis, classi?cation, and learning. This paper explores the use of Graphics Processing Units (GPU) in media mining. We are particularly focused on large-scale semantic concept detection, a state-of-the-art approach that maps media content to hight-level semantic concepts, and a building block in many Media mining applications. We present a fast, parallel, large-scale, high-level semantic concept detector that leverages the GPU for image/video retrieval and content analysis. Through ef?cient data partitioning and movement, we parallelize feature extraction routines. By interleaving feature extraction routines of different types, we increase the computational intensity and mitigate the negative effects of histogram-like reduction operations. To cope with the very large number of semantic concepts, we propose a data layout of concept models on a multi-GPU hybrid architecture for high throughput semantic concept detection. We achieve one to two orders of magnitude speedups compared to serial implementations and our experiments show that we can detect 374 semantic concepts at a rate of over 100 frames/sec. This is over 100 times faster than a LibSVM-based semantic concept detection.
  • Efficient GPU implementation for Particle in Cell Algorithm Authors: Rejith Joseph (University of Florida, USA); Girish Ravunnikutty (University of Florida, USA); Sanjay Ranka (University of Florid
    Particle in cell (PIC) algorithm is a widely used method in plasma physics to study the trajectories of charged particles under electromagnetic ?elds. The PIC algorithm is computationally intensive and its time requirements are proportional to the number of charged particles involved in the simulation. The focus of the paper is to parallelize the PIC algorithm on Graphics Processing Unit (GPU). We present several performance trade-offs related to small shared memory and atomic operations on the GPU to achieve high performance.

SESSION 11: Multiprocessing and Concurrency

  • Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Authors: Junghee Lee (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Yongjae Lee (Georgi
    The seemingly interminable dwindle of technology feature sizes well into the nano-scale regime has afforded computer architects with an abundance of computational resources on a single chip. The Chip Multi-Processor (CMP) paradigm is now seen as the de facto architecture for years to come. However, in order to ef?ciently exploit the increasing number of on-chip processing cores, it is imperative to achieve and maintain ef?cient utilization of the resources at run time. Uneven and skewed distribution of workloads misuses the CMP resources and may even lead to such undesired effects as traf?c and temperature hotspots. While existing techniques rely mostly on software for the undertaking of load balancing duties and exploit hardware mainly for synchronization, we will demonstrate that there are wider opportunities for hardware support of load balancing in CMP systems. Based on this fact, this paper proposes IsoNet, a con?ict-free dynamic load distribution engine that exploits hardware aggressively to reinforce massively parallel computation in manycore settings. Moreover, the proposed architecture provides extensive fault-tolerance against both CPU faults and intra-IsoNet faults. The hardware takes charge of both (1) the management of the list of jobs to be executed, and (2) the transfer of jobs between processing elements to maintain load balance. Experimental results show that, unlike the existing popular techniques of blocking and job stealing, IsoNet is scalable with as many as 1024 processing cores.
  • HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors Authors: Javier Lira (Universitat Politècnica de Catalunya, Spain); Carlos Molina (Universitat Rovira i Virgili, Spain); Antonio
    The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it dif?cult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. Traditionally, NUCA organizations have been classi?ed as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain signi?cant performance bene?ts, they used a non-affordable access scheme mechanism to ?nd data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (Home Knows where to ?nd data within the NUCA cache). It exploits migration features by providing fast and power ef?cient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an ef?cient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40% energy consumed per each memory request, and achieves an average performance improvement of 6%.
  • Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads Authors: Juan M. Cebrián (University of Murcia, Spain); Juan L. Aragón (University of Murcia, Spain); Stefanos Kaxiras (Un
    In the recent years virtually all processor architectures employ multiple cores per chip (CMPs). It is possible to use legacy (i.e., single-core) power saving techniques in CMPs which run either sequential applications or independent multithreaded workloads. However, new challenges arise when running parallel shared-memory applications. In the later case, sacri?cing some performance in a single core (thread) in order to be more energy-ef?cient might unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), therefore, harming the performance of the whole application. CMPs increasingly face thermal and power-related problems during their typical use. Such problems can be solved by setting a power budget to the processor/core. This paper initially studies the behavior of different techniques to match a prede?ned power budget in a CMP processor. While legacy techniques properly work for thread independent/multi-programmed workloads, parallel workloads exhibit the problem of independently adapting the power of each core in a thread dependent scenario. In order to solve this problem we propose a novel mechanism, Power Token Balancing (PTB), aimed at accurately matching an external power constraint by balancing the power consumed among the different cores using a power tokenbased approach while optimizing the energy ef?ciency. We can use power (seen as tokens or coupons) from non-critical threads for the bene?t of critical threads. PTB runs transparent for thread independent / multiprogrammed workloads and can be also used as a spinlock detector based on power patterns. Results show that PTB matches more accurately a prede?ned power budget (total energy consumed over the budget is reduced to 8% for a 16-core CMP) than DVFS with only a 3% energy increase. Finally, we can trade accuracy on matching the power budget for energy-ef?ciency reducing the energy a 4% with a 20% of accuracy.
  • A Very Fast Simulator For Exploring The Many-Core Future Authors: Olivier Certner (INRIA, France); Zheng Li (INRIA, France); Arun Raman (Princeton University, USA); Olivier Temam (INRIA Futurs,
    Although multi-core architectures with a large number of cores (“many-cores”) are considered the future of computing systems, there are currently few practical tools to quickly explore both their design and general program scalability. In this paper, we present SiMany, a discrete-event-based many-core simulator able to support more than a thousand cores while being orders of magnitude faster than existing ?exible approaches. One of the dif?cult challenges for a reasonably realistic many-core simulation is to model faithfully the potentially high concurrency a program can exhibit. SiMany uses a novel virtual time synchronization technique, called spatial synchronization, to achieve this goal in a completely local and distributed fashion, which diminishes interactions and preserves locality. Compared to previous simulators, it raises the level of abstraction by focusing on modeling concurrent interactions between cores, which enables fast coarse comparisons of high-level architecture design choices and parallel programs performance. Sequential pieces of code are executed natively for maximal speed. We exercise the simulator with a set of dwarf-like task-based benchmarks with dynamic control ?ow and irregular data structures. Scalability results are validated through comparison with a cycle-level simulator up to 64 cores. They are also shown consistent with well-known benchmark characteristics. We ?nally demonstrate how SiMany can be used to ef?ciently compare the benchmarks’ behavior over a wide range of architectural organizations, such as polymorphic architectures and network of clusters