TechTalks from event: IEEE IPDPS 2011

Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.

SESSION 11: Multiprocessing and Concurrency

  • Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Authors: Junghee Lee (Georgia Institute of Technology, USA); Chrysostomos Nicopoulos (University of Cyprus, Cyprus); Yongjae Lee (Georgi
    The seemingly interminable dwindle of technology feature sizes well into the nano-scale regime has afforded computer architects with an abundance of computational resources on a single chip. The Chip Multi-Processor (CMP) paradigm is now seen as the de facto architecture for years to come. However, in order to ef?ciently exploit the increasing number of on-chip processing cores, it is imperative to achieve and maintain ef?cient utilization of the resources at run time. Uneven and skewed distribution of workloads misuses the CMP resources and may even lead to such undesired effects as traf?c and temperature hotspots. While existing techniques rely mostly on software for the undertaking of load balancing duties and exploit hardware mainly for synchronization, we will demonstrate that there are wider opportunities for hardware support of load balancing in CMP systems. Based on this fact, this paper proposes IsoNet, a con?ict-free dynamic load distribution engine that exploits hardware aggressively to reinforce massively parallel computation in manycore settings. Moreover, the proposed architecture provides extensive fault-tolerance against both CPU faults and intra-IsoNet faults. The hardware takes charge of both (1) the management of the list of jobs to be executed, and (2) the transfer of jobs between processing elements to maintain load balance. Experimental results show that, unlike the existing popular techniques of blocking and job stealing, IsoNet is scalable with as many as 1024 processing cores.
  • HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors Authors: Javier Lira (Universitat Politècnica de Catalunya, Spain); Carlos Molina (Universitat Rovira i Virgili, Spain); Antonio
    The exponential increase in the cache sizes of multicore processors (CMPs) accompanied by growing on-chip wire delays make it dif?cult to implement traditional caches with single and uniform access latencies. Non-Uniform Cache Architecture (NUCA) designs have been proposed to address this problem. NUCA divides the whole cache memory into smaller banks and allows nearer cache banks to have lower access latencies than farther banks, thus mitigating the effects of the cache’s internal wires. Traditionally, NUCA organizations have been classi?ed as static (S-NUCA) and dynamic (D-NUCA). While in S-NUCA a data block is mapped to a unique bank in the NUCA cache, D-NUCA allows a data block to be mapped in multiple banks. Besides, D-NUCA designs are dynamic in the sense that data blocks may migrate towards the cores that access them most frequently. Recent works consider D-NUCA as a promising design, however, in order to obtain signi?cant performance bene?ts, they used a non-affordable access scheme mechanism to ?nd data in the NUCA cache. In this paper, we propose a novel and implementable data search algorithm for D-NUCA designs in CMP architectures, which is called HK-NUCA (Home Knows where to ?nd data within the NUCA cache). It exploits migration features by providing fast and power ef?cient accesses to data which is located close to the requesting core. Moreover, HK-NUCA implements an ef?cient and cost-effective search mechanism to reduce miss latency and on-chip network contention. We show that using HK-NUCA as data search mechanism in a D-NUCA design reduces about 40% energy consumed per each memory request, and achieves an average performance improvement of 6%.
  • Power Token Balancing: Adapting CMPs to Power Constraints for Parallel Multithreaded Workloads Authors: Juan M. Cebrián (University of Murcia, Spain); Juan L. Aragón (University of Murcia, Spain); Stefanos Kaxiras (Un
    In the recent years virtually all processor architectures employ multiple cores per chip (CMPs). It is possible to use legacy (i.e., single-core) power saving techniques in CMPs which run either sequential applications or independent multithreaded workloads. However, new challenges arise when running parallel shared-memory applications. In the later case, sacri?cing some performance in a single core (thread) in order to be more energy-ef?cient might unintentionally delay the rest of cores (threads) due to synchronization points (locks/barriers), therefore, harming the performance of the whole application. CMPs increasingly face thermal and power-related problems during their typical use. Such problems can be solved by setting a power budget to the processor/core. This paper initially studies the behavior of different techniques to match a prede?ned power budget in a CMP processor. While legacy techniques properly work for thread independent/multi-programmed workloads, parallel workloads exhibit the problem of independently adapting the power of each core in a thread dependent scenario. In order to solve this problem we propose a novel mechanism, Power Token Balancing (PTB), aimed at accurately matching an external power constraint by balancing the power consumed among the different cores using a power tokenbased approach while optimizing the energy ef?ciency. We can use power (seen as tokens or coupons) from non-critical threads for the bene?t of critical threads. PTB runs transparent for thread independent / multiprogrammed workloads and can be also used as a spinlock detector based on power patterns. Results show that PTB matches more accurately a prede?ned power budget (total energy consumed over the budget is reduced to 8% for a 16-core CMP) than DVFS with only a 3% energy increase. Finally, we can trade accuracy on matching the power budget for energy-ef?ciency reducing the energy a 4% with a 20% of accuracy.
  • A Very Fast Simulator For Exploring The Many-Core Future Authors: Olivier Certner (INRIA, France); Zheng Li (INRIA, France); Arun Raman (Princeton University, USA); Olivier Temam (INRIA Futurs,
    Although multi-core architectures with a large number of cores (“many-cores”) are considered the future of computing systems, there are currently few practical tools to quickly explore both their design and general program scalability. In this paper, we present SiMany, a discrete-event-based many-core simulator able to support more than a thousand cores while being orders of magnitude faster than existing ?exible approaches. One of the dif?cult challenges for a reasonably realistic many-core simulation is to model faithfully the potentially high concurrency a program can exhibit. SiMany uses a novel virtual time synchronization technique, called spatial synchronization, to achieve this goal in a completely local and distributed fashion, which diminishes interactions and preserves locality. Compared to previous simulators, it raises the level of abstraction by focusing on modeling concurrent interactions between cores, which enables fast coarse comparisons of high-level architecture design choices and parallel programs performance. Sequential pieces of code are executed natively for maximal speed. We exercise the simulator with a set of dwarf-like task-based benchmarks with dynamic control ?ow and irregular data structures. Scalability results are validated through comparison with a cycle-level simulator up to 64 cores. They are also shown consistent with well-known benchmark characteristics. We ?nally demonstrate how SiMany can be used to ef?ciently compare the benchmarks’ behavior over a wide range of architectural organizations, such as polymorphic architectures and network of clusters

SESSION 12: Compilers

  • Variable Granularity Access Tracking Scheme for Improving the Performance of Software Transactional Memory Authors: Sandya Mannarswamy (Hewlett Packard India, India); Govindarajan Ramaswamy (Indian Institute of Science, India)
    Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a con?ict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single ?xed access tracking granularity cannot meet the con?icting goals of reducing false con?icts without impacting concurrency adversely. A ?ne grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a ?xed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the application’s data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.
  • Automated architecture-aware mapping of streaming applications onto GPUs Authors: Andrei Hagiescu (National University of Singapore, Singapore); Huynh Phung Huynh (A*STAR Institute of High Performance Computin
    Graphic Processing Units (GPUs) are made up of many streaming multiprocessors, each consisting of processing cores that interleave the execution of a large number of threads. Groups of threads - called warps and wavefronts, respectively, in nVidia and AMD literature - are selected by the hardware scheduler and executed in lockstep on the available cores. If threads in such a group access the slow off-chip global memory, the entire group has to be stalled, and another group is scheduled instead. The utilization of a given multiprocessor will remain high if there is a suf?cient number of alternative thread groups to select from. Many parallel general purpose applications have been ef?ciently mapped to GPUs. Unfortunately, many stream processing applications exhibit unfavorable data movement patterns and low computation-to-communication ratio that may lead to poor performance. In this paper, we describe an automated compilation ?ow that maps most stream processing applications onto GPUs by taking into consideration two important architectural features of nVidia GPUs, namely interleaved execution as well as the small amount of shared memory available in each streaming multiprocessors. In particular, we show that using a small number of compute threads such that the memory footprint is reduced, we can achieve high utilization of the GPU cores. Our scheme goes against the conventional wisdom of GPU programming which is to use a large number of homogeneous threads. Instead, it uses a mix of compute and memory access threads, together with a carefully crafted schedule that exploits parallelism in the streaming application, while maximizing the effectiveness of the unique memory hierarchy. % small on-chip memory located within each streaming multiprocessor. We have implemented our scheme in the compiler of the StreamIt programming language, and our results show a signi?cant speedup compared to the state-of-the-art solutions.
  • Automatic Loop Tiling for Direct Memory Access Authors: Haibo Lin (IBM Research - China, P.R. China); Tao Liu (IBM Research - China, P.R. China); Lakshminarayanan Renganarayana (IBM
    In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is ef?cient for regular accesses, while requiring precise analysis, detailed modeling of execution, and signi?cant code generation. In this paper we present the design and implementation of DMATiler which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques. The DMATiler chooses a data transfer friendly loop order and using a empirically validated DMA performance model, it formulates and solves a convex optimization problem to determine globally optimal tile sizes. Further, the DMATiler applies optimization techniques such compressed data transfers and DMA commands to achieve the best DMA performance for a given loop nest. We have implemented the DMATiler in the IBM XL Single Source Compiler (SSC), and have conducted experiments with a set of loop nest benchmarks. The results show that the DMATiler is much more ef?cient than software controlled cache (average speedup of 9.8x) and single level loop blocking (average speedup of 6.2x) on the Cell BE processor.
  • Tolerant Value Speculation in Coarse-Grain Streaming Computations Authors: Nathaniel Azuelos (Technion, Israel); Idit Keidar (Technion, Israel); Ayal Zaks (IBM Haifa Research Lab, Israel)
    Streaming applications are the subject of growing interest, as the need for fast access to data continues to grow. In this work, we present the design requirements and implementation of coarse-grain value speculation in streaming applications. We explain how this technique can be useful in cases where serial parts of applications constitute bottlenecks, and when slower I/O favors using available pre?xes of the data. Contrary to previous work, we show how allowing some tolerance can justify early predictions on a scale of a large window of values. We suggest a methodology for runtime support of speculation, along with the mechanisms required for rollback. We present resource management issues consequent to our technique. We study how validation and speculation frequencies impact the performance of the program. Finally, we present our implementation in the context of the Huffman encoder benchmark, running it in different con?gurations and on different architectures.

SESSION 13: Distributed Algorithms and Models

  • Adding a referee to an interconnection network: What can(not) be computed in one round. Authors: Florent Becker (LIFO, Universite d’Orleans, France); Martin Matamala (DIM, Universidad de Chile, Chile); Nicolas Nisse (I
    In this paper we ask which properties of a distributed network can be computed from a few amount of local information provided by its nodes. The distributed model we consider is a restriction of the classical CONGEST (distributed) model and it is close to the simultaneous messages (communication complexity) model de?ned by Babai, Kimmel and Lokam. More precisely, each of these n nodes -which only knows its own ID and the IDs of its neighbors- is allowed to send a message of O(log n) bits to some central entity, called the referee. Is it possible for the referee to decide some basic structural properties of the network topology G? We show that simple questions like, “does G contain a square?”, “does G contain a triangle?” or “Is the diameter of G at most 3?” cannot be solved in general. On the other hand, the referee can decode the messages in order to have full knowledge of G when G belongs to many graph classes such as planar graphs, bounded treewidth graphs and, more generally, bounded degeneracy graphs. We leave open questions related to the connectivity of arbitrary graphs.
  • Improved Algorithms for the Distributed Trigger Counting Problem Authors: Venkatesan T Chakaravarthy (IBM Research (India), India); Anamitra Roy Choudhury (IBM Research - India, India); Yogish Sabharwa
    Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-speci?ed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The ?rst algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.
  • The Weighted Byzantine Agreement Problem Authors: Vijay Garg (The University of Texas at Austin, USA); John Bridgman (The University of Texas at Austin, USA)
    This paper presents a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most f out of N machines fail, the algorithm assumes that the total weight of the machines that fail is at most f/N. When each machine has weight 1/N, this problem reduces to the standard Byzantine Generals Agreement Problem. By choosing weights appropriately, the weighted Byzantine Agreement Problem can be applied to situations where a subset of processes are more trusted. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than N/3 processes fail, so long as the total weight of the failed processes is less than 1/3. Also, a method to update the weights of the processes after execution of the weighted Byzantine Agreement is given. The update method guarantees that the weight of any correct process is never reduced and the weight of any faulty process, suspected by correct processes whose total weight is at least 1/4, is reduced to 0 for future instances. A short discussion of some weight assignment strategies is also given.
  • Leveraging Social Networks to Combat Collusion in Reputation Systems for Peer-to-Peer Networks Authors: Ze Li (Clemson University, USA); Haiying Shen (Clemson University, USA): Karan Sapra (Clemson University, USA)
    In peer-to-peer networks (P2Ps), many autonomous peers without preexisting trust relationships share resources with each other. Due to their open environment, the P2Ps usually employ reputation systems to provide guidance in selecting trustworthy resource providers for high reliability and security. However, node collusion impairs the effectiveness of reputation systems in trustworthy node selection. Although some reputation systems have certain mechanisms to counter collusion, the effectiveness of the mechanisms is not suf?ciently high. In this paper, we leverage social networks to enhance the capability of reputation systems in combating collusion. We ?rst analyzed real trace of the reputation system in the Overstock online auction platform which incorporates a social network. The analysis reveals the important impact of the social network on user purchasing and reputation rating patterns. We thus identi?ed suspicious collusion behavior patterns and propose a social network based mechanism, namely SocialTrust, to counter collusion. SocialTrust adaptively adjusts the weight of ratings based on the social distance and interest relationship between peers. Experimental results show that SocialTrust can signi?cantly strengthen the capability of current reputation systems in combating collusion.