IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION 19: Storage Systems and Memory
H-Code: A Hybrid MDS Array Code to Optimize Partial Stripe Writes in RAID-6RAID-6 is widely used to tolerate concurrent failures of any two disks to provide a higher level of reliability with the support of erasure codes. Among many implementations, one class of codes called Maximum Distance Separable (MDS) codes aims to offer data protection against disk failures with optimal storage ef?ciency. Typical MDS codes contain horizontal and vertical codes. Due to the horizontal parity, in the case of partial stripe write (refers to I/O operations that write new data or update data to a subset of disks in an array) in a row, horizontal codes may get less I/O operations in most cases, but suffer from unbalanced I/O distribution. They also have limitation on high single write complexity. Vertical codes improve single write complexity compared to horizontal codes, while they still suffer from poor performance in partial stripe writes. In this paper, we propose a new XOR-based MDS array code, named Hybrid Code (H-Code), which optimizes partial stripe writes for RAID-6 by taking advantages of both horizontal and vertical codes. H-Code is a solution for an array of (p + 1) disks, where p is a prime number. Unlike other codes taking a dedicated anti-diagonal parity strip, H-Code uses a special anti-diagonal parity layout and distributes the anti-diagonal parity elements among disks in the array, which achieves a more balanced I/O distribution. On the other hand, the horizontal parity of H-Code ensures a partial stripe write to continuous data elements in a row share the same row parity chain, which can achieve optimal partial stripe write performance. Not only within a row but also within a stripe, H-Code offers optimal partial stripe write complexity to two continuous data elements and optimal partial stripe write performance among all MDS codes to the best of our knowledge. Speci?cally, compared to RDP and EVENODD codes, H-Code reduces I/O cost by up to 15:54% and 22:17%. Overall, H-code has optimal storage ef?ciency, optimal encoding/decoding computational complexity, optimal complexity of both single write and partial stripe write.
LACIO: A New Collective I/O Strategy for Parallel I/O SystemsParallel applications bene?t considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical ?le layout to avoid multiple ?le system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We con?rm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous ?le system calls. It holds promise in improving the I/O performance for parallel systems.
Using Shared Memory to Accelerate MapReduce on Graphics Processing UnitsModern General Purpose Graphics Processing Units (GPGPUs) provide high degrees of parallelism in computation and memory access, making them suitable for data parallel applications such as those using the elastic MapReduce model. Yet designing a MapReduce framework for GPUs faces signi?cant challenges brought by their multi-level memory hierarchy. Due to the absence of atomic operations in the earlier generations of GPUs, existing GPU MapReduce frameworks have problems in handling input/output data with varied or unpredictable sizes. Also, existing frameworks utilize mostly a single level of memory, i.e., the relatively spacious yet slow global memory. In this work, we attempt to explore the potential bene?t of enabling a GPU MapReduce framework to use multiple levels of the GPU memory hierarchy. We propose a novel GPU data staging scheme for MapReduce workloads, tailored toward the GPU memory hierarchy. Centering around the ef?cient utilization of the fast but very small shared memory, we designed and implemented a GPU MapReduce framework, whose key techniques include (1) shared memory staging area management, (2) thread-role partitioning, and (3) intra-block thread synchronization. We carried out evaluation with ?ve popular MapReduce workloads and studied their performance under different GPU memory usage choices. Our results reveal that exploiting GPU shared memory is highly promising for the Map phase (with an average 2.85x speedup over using global memory only), while in the Reduce phase the bene?t of using shared memory is much less pronounced, due to the high input-to-output ratio. In addition, when compared to Mars, an existing GPU MapReduce framework, our system is shown to bring a signi?cant speedup in Map/Reduce phases.
Unified Signatures for Improving Performance in Transactional MemoryTransactional Memory (TM) promises to increase programmer productivity by making it easier to write correct parallel programs. In ful?lling this goal, a TM system should maximize its performance with limited hardware resources. Con?ict detection is an essential element for maintaining correctness among concurrent transactions in a TM system. Hardware signatures have been proposed as an area-ef?cient method for detecting con?icts. However, signatures can degrade TM performance by falsely declaring con?icts. Hence, increasing the quality of signatures within a given hardware budget is a crucial issue for TM to be adopted as a mainstream programming model. In this paper, we propose a simple and effective signature design, uni?ed signature. Instead of using separate read- and write-signatures, as is often done in TM systems, we implement a single signature to track all read- and write-accesses. By merging read- and write-signatures, a uni?ed signature can effectively enlarge the signature size without additional overhead. Within the constraints of a given hardware budget, a TM system with a uni?ed signature outperforms a baseline system with the same hardware budget by reducing the number of falsely detected con?icts. Even though the uni?ed signature scheme incurs read-after-read dependencies, we show that these false dependencies do not negate the bene?t of uni?ed signatures for practical signature sizes. A TM system with 2K-bit uni?ed signatures achieves average speedups of 22% over baseline TM systems.