IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring login. For other talks, you will need to login using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the selfrecorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION 13: Distributed Algorithms and Models

Adding a referee to an interconnection network: What can(not) be computed in one round.In this paper we ask which properties of a distributed network can be computed from a few amount of local information provided by its nodes. The distributed model we consider is a restriction of the classical CONGEST (distributed) model and it is close to the simultaneous messages (communication complexity) model de?ned by Babai, Kimmel and Lokam. More precisely, each of these n nodes which only knows its own ID and the IDs of its neighbors is allowed to send a message of O(log n) bits to some central entity, called the referee. Is it possible for the referee to decide some basic structural properties of the network topology G? We show that simple questions like, â€œdoes G contain a square?â€, â€œdoes G contain a triangle?â€ or â€œIs the diameter of G at most 3?â€ cannot be solved in general. On the other hand, the referee can decode the messages in order to have full knowledge of G when G belongs to many graph classes such as planar graphs, bounded treewidth graphs and, more generally, bounded degeneracy graphs. We leave open questions related to the connectivity of arbitrary graphs.

Improved Algorithms for the Distributed Trigger Counting ProblemConsider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a userspeci?ed input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. In this paper, we present two decentralized and randomized algorithms for the DTC problem. The ?rst algorithm has message complexity O(n log w) and no processor receives more than O(log w) messages with high probability. It does not provide any bound on the messages sent per processor. This algorithm assumes complete connectivity between the processors. The second algorithm has message complexity O(n log n log w) and no processor exchanges more than O(log n log w) messages with high probability. However, there is a negligible failure probability in raising the alert on receiving w triggers. This algorithm only requires that a constant degree tree be embeddable in the underlying communication graph.

The Weighted Byzantine Agreement ProblemThis paper presents a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most f out of N machines fail, the algorithm assumes that the total weight of the machines that fail is at most f/N. When each machine has weight 1/N, this problem reduces to the standard Byzantine Generals Agreement Problem. By choosing weights appropriately, the weighted Byzantine Agreement Problem can be applied to situations where a subset of processes are more trusted. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than N/3 processes fail, so long as the total weight of the failed processes is less than 1/3. Also, a method to update the weights of the processes after execution of the weighted Byzantine Agreement is given. The update method guarantees that the weight of any correct process is never reduced and the weight of any faulty process, suspected by correct processes whose total weight is at least 1/4, is reduced to 0 for future instances. A short discussion of some weight assignment strategies is also given.

Leveraging Social Networks to Combat Collusion in Reputation Systems for PeertoPeer NetworksIn peertopeer networks (P2Ps), many autonomous peers without preexisting trust relationships share resources with each other. Due to their open environment, the P2Ps usually employ reputation systems to provide guidance in selecting trustworthy resource providers for high reliability and security. However, node collusion impairs the effectiveness of reputation systems in trustworthy node selection. Although some reputation systems have certain mechanisms to counter collusion, the effectiveness of the mechanisms is not suf?ciently high. In this paper, we leverage social networks to enhance the capability of reputation systems in combating collusion. We ?rst analyzed real trace of the reputation system in the Overstock online auction platform which incorporates a social network. The analysis reveals the important impact of the social network on user purchasing and reputation rating patterns. We thus identi?ed suspicious collusion behavior patterns and propose a social network based mechanism, namely SocialTrust, to counter collusion. SocialTrust adaptively adjusts the weight of ratings based on the social distance and interest relationship between peers. Experimental results show that SocialTrust can signi?cantly strengthen the capability of current reputation systems in combating collusion.
SESSION 21: Numerical Algorithms

QR Factorization on a Multicore Node Enhanced with Multiple GPU AcceleratorsOne of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly ef?cient QR factorization for such a node. Our method is in three steps. The ?rst step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can ef?ciently adapt highlevel algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the bene?ts of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an ef?cient runtime system, StarPU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a ?ne interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.

TwoStage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore ArchitecturesWhile successful implementations have already been written for onesided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for twosided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and dif?cult research problem due to expensive memorybound operations occurring during the panel factorization. The processormemory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an ef?cient implementation of the tridiagonal reduction, which is the ?rst algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a twostage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a leftlooking bulge chasing technique to reduce memory traf?c and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction signi?cantly outperforms the stateoftheart numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.

An Autotuned Method for Solving Large Tridiagonal Systems on the GPUWe present a multistage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be ef?ciently solved due to the limitation of onchip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them onchip. The multistage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an autotuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that autotuning is a powerful tool that improves the performance by up to 5x, saves 17% and 32% of execution time on average respectively over static and dynamic tuning, and enables our multistage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 611x.

A communicationavoiding, hybridparallel, rankrevealing orthogonalization methodOrthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of GramSchmidt or Householder QR, have performance dominated by communication. Here, â€communicationâ€ includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block (â€normalizationâ€). TSQRâ€™s rankrevealing capability also makes it useful for detecting de?ation in block iterative methods, for which existing approaches sacri?ce performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributedmemory and sharedmemory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5â€“ 20) of very long vectors. The sharedmemory parallel component uses Intelâ€™s Threading Building Blocks, though its modular design supports other sharedmemory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.
SESSION 22: Fault Tolerance

Flease  Lease Coordination Without a Lock ServerLargescale distributed systems often require scalable and faulttolerant mechanisms to coordinate exclusive access to shared resources such as ?les, replicas or the primary role. The best known algorithms to implement distributed mutual exclusion with leases, such as Multipaxos, are complex, dif?cult to implement, and rely on stable storage to persist lease information. In this paper we present FLEASE, an algorithm for faulttolerant lease coordination in distributed systems that is simpler than Multipaxos and does not rely on stable storage. The evaluation shows that FLEASE can be used to implement scalable, decentralized lease coordination that outperforms a central lock service implementation by an order of magnitude.

Uncoordinated Checkpointing Without Domino Effect for SendDeterministic MPI ApplicationsAs reported by many recent studies, the mean time between failures of future postpetascale supercomputers is likely to reduce, compared to the current situation. The most popular fault tolerance approach for MPI applications on HPC Platforms relies on coordinated checkpointing which raises two major issues: a) global restart wastes energy since all processes are forced to rollback even in the case of a single failure; b) checkpoint coordination may slow down the application execution because of congestions on I/O resources. Alternative approaches based on uncoordinated checkpointing and message logging require logging all messages, imposing a high memory/storage occupation and a signi?cant overhead on communications. It has recently been observed that many MPI HPC applications are senddeterministic, allowing to design new fault tolerance protocols. In this paper, we propose an uncoordinated checkpointing protocol for senddeterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs. We ?rst describe our protocol and prove its correctness. Through experimental evaluations, we show that its implementation in MPICH2 has a negligible overhead on application performance. Then we perform a quantitative evaluation of the properties of our protocol using the NAS Benchmarks. Using a clustering approach, we demonstrate that this protocol actually succeeds to combine the two expected properties: a) it logs only a small fraction of the messages and b) it reduces by a factor approaching 2 the average number of processes to rollback compared to coordinated checkpointing.

Minimal Obstructions for the Coordinated Attack Problem and BeyondWe consider the well known Coordinated Attack Problem, where two generals have to decide on a common attack, when their messengers can be captured by the enemy. Informally, this problem represents the dif?culties to agree in the present of communication faults. We consider here only omission faults (loss of message), but contrary to previous studies, we do not to restrict the way messages can be lost, ie. we use no speci?c failure metric. Our contribution is threefold. First, we introduce the study of arbitrary patterns of failure (â€omission schemesâ€), proposing notions and notations that revealed very convenient to handle. In the large subclass of omission schemes where the double simultaneous omission can never happen, we characterize which one are obstructions for the Coordinated Attack Problem. We present then some interesting applications. We show for the ?rst time that the well studied omission scheme, where at most one message can be lost at each round, is a kind of least worst case environment for the Coordinated Attack Problem. We also extend our study to networks of arbitrary size. In particular, we address an open question of Santoro and Widmayer about the Consensus Problem in communication networks with omission faults.

Scheduling Parallel Iterative Applications on Volatile ResourcesIn this paper we study the execution of iterative applications on volatile processors such as those found on desktop grids. We develop masterworker scheduling schemes that attempt to achieve good tradeoffs between worker speed and worker availability. A key feature of our approach is that we consider a communication model where the bandwidth capacity of the master for sending application data to workers is limited. This limitation makes the scheduling problem more dif?cult both in a theoretical sense and in a practical sense. Furthermore, we consider that a processor can be in one of three states: available, down, or temporarily preempted by its owner. This preempted state also complicates the scheduling problem. In practical settings, e.g., desktop grids, master bandwidth is limited and processors are temporarily reclaimed. Consequently, addressing the aforementioned dif?culties is necessary for successfully deploying masterworker applications on volatile platforms. Our ?rst contribution is to determine the complexity of the scheduling problem in its offline version, i.e., when processor availability behaviors are known in advance. Even with this knowledge, the problem is NPhard, and cannot be approximated within a factor 8=7. Our second contribution is a closedform formula for the expectation of the time needed by a worker to complete a set of tasks. This formula relies on a Markovian assumption for the temporal availability of processors, and is at the heart of some heuristics that aim at favoring â€œreliableâ€ processors in a sensible manner. Our third contribution is a set of heuristics, which we evaluate in simulation. Our results provide guidance to selecting the best strategy as a function of processor state availability versus average task duration.
 All Sessions
 IPDPS 2011 Keynotes and Panels
 SESSION: Best Papers
 Special NSFSEES Presentation
 Intel Platinum Patron Night
 25th Year IPDPS Celebration
 SESSION 14: Parallel Graph and Particle Algorithms
 SESSION 15: Distributed Systems and Networks
 SESSION 16: Programming Environments and Tools
 SESSION 1: Resource Management
 SESSION 17: Parallel Algorithms
 SESSION 2: Communication & I/O Optimization
 SESSION 18: Distributed Systems
 SESSION 3: HardwareSoftware Interaction
 SESSION 19: Storage Systems and Memory
 SESSION 4: Runtime Systems
 SESSION 20: Operating Systems and Resource Management
 SESSION 5: Routing and Communication
 SESSION 6: Self Stabilization and Security
 SESSION 7: Numerical Algorithms
 SESSION 8: Reliability and Security
 SESSION 9: Wireless and Sensor Networks
 SESSION 10: GPU Acceleration
 SESSION 11: Multiprocessing and Concurrency
 SESSION 12: Compilers
 SESSION 13: Distributed Algorithms and Models
 SESSION 21: Numerical Algorithms
 SESSION 22: Fault Tolerance
 SESSION 23: Resource Utilization
 SESSION 24: Parallel Programming Models and Languages
 SESSION 25: Algorithms for Distributed Computing
 SESSION 26: Scheduling
 SESSION 27: Computational Biology and Simulations
 SESSION 28: Cloud Computing