Please help transcribe this video using our simple transcription tool. You need to be logged in to do so.
On multicore architectures, the ratio of peak memory bandwidth to peak ?oating-point performance (byte:?op ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises signi?cant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional ?ll-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined bene?ts of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:?op ratio and larger sparse matrices) continue.
Questions and AnswersYou need to be logged in to be able to post here.