IEEE IPDPS 2011
TechTalks from event: IEEE IPDPS 2011
Note 1: Only plenary sessions (keynotes, panels, and best papers) are accessible without requiring log-in. For other talks, you will need to log-in using the email you registered for IPDPS 2011. Note 2: Many of the talks (those without a thumbnail next to the their description below) are yet to be uploaded. Some of them were not recorded because of technical problems. We are working with the corresponding authors to upload the self-recorded versions here. We sincerely thank all authors for their efforts in making their videos available.
SESSION 23: Resource Utilization
Shared Resource Monitoring and Throughput Optimization in Cloud-Computing DatacentersMany datacenters employ server consolidation to maximize the efficiency of platform resource usage. As a result, multiple virtual machines (VMs) simultaneously run on each datacenter platform. Contention for shared resources between these virtual machines has an undesirable and non-deterministic impact on their performance behavior in such platforms. This paper proposes the use of shared resource monitoring to (a) understand the resource usage of each virtual machine on each platform, (b) collect resource usage and performance across different platforms to correlate implications of usage to performance, and (c) migrate VMs that are resource-constrained to improve overall datacenter throughput and improve Quality of Service (QoS). We focus our efforts on monitoring and addressing shared cache contention and propose a new optimization metric that captures the priority of the VM and the overall weighted throughput of the datacenter. We conduct detailed experiments emulating datacenter scenarios including on-line transaction processing workloads (based on TPC-C) middle-tier workloads (based on SPECjbb and SPECjAppServer) and financial workloads (based on PARSEC). We show that monitoring shared resource contention (such as shared cache) is highly beneficial to better manage throughput and QoS in a cloud-computing datacenter environment.
The Impact of Soft Resource Allocation on n-Tier Application ScalabilityGood performance and ef?ciency, in terms of high quality of service and resource utilization for example, are important goals in a cloud environment. Through extensive measurements of an n-tier application benchmark (RUBBoS), we show that overall system performance is surprisingly sensitive to appropriate allocation of soft resources (e.g., server thread pool size). Inappropriate soft resource allocation can quickly degrade overall application performance signi?cantly. Concretely, both under-allocation and over-allocation of thread pool can lead to bottlenecks in other resources because of non-trivial dependencies. We have observed some non-obvious phenomena due to these correlated bottlenecks. For instance, the number of threads in the Apache web server can limit the total useful throughput, causing the CPU utilization of the C-JDBC clustering middleware to decrease as the workload increases. We provide a practical iterative solution approach to this challenge through an algorithmic combination of operational queuing laws and measurement data. Our results show that soft resource allocation plays a central role in the performance scalability of complex systems such as n-tier applications in cloud environments.
Profiling Directed NUMA Optimisation on Linux Systems: A Case Study of the Gaussian Computational Chemistry CodeThe parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly in- ?uenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be dif?cult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind pro?ling tool which can be used to simplify this process. It extends the Valgrind binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traf?c for arbitrary NUMA topologies. Using NUMAgrind, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the NUMAgrind tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.
Model-Driven SIMD Code Generation for a Multi-Resolution Tensor KernelIn this paper, we describe a model-driven compile-time code generator that transforms a class of tensor contraction expressions into highly optimized short-vector SIMD code. We use as a case study a multi-resolution tensor kernel from the MADNESS quantum chemistry application. Performance of a C-based implementation is low, and because the dimensions of the tensors are small, performance using vendor optimized BLAS libraries is also suboptimal. We develop a model-driven code generator that determines the optimal loop permutation and placement of vector load/store, transpose, and splat operations in the generated code, enabling portable performance on short-vector SIMD architectures. Experimental results on an SSE-based platform demonstrate the ef?ciency of the vector-code synthesizer.