Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and...

Full description

Saved in:
Bibliographic Details
Published in:Parallel Computing Vol. 90; no. C; p. 102545
Main Authors: Nagasaka, Yusuke, Matsuoka, Satoshi, Azad, Ariful, Buluç, Aydın
Format: Journal Article
Language:English
Published: United States Elsevier B.V 01.12.2019
Elsevier BV
Elsevier
Subjects:
ISSN:0167-8191, 1872-7336
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. We build the performance model for hash-table and heap-based algorithms, which supports the recipe. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix. Finally, we integrate our implementations into a large-scale protein clustering code named HipMCL, accelerating its SpGEMM kernel by up to 10X and achieving an overall performance boost for the whole HipMCL application by 2.6X.
Bibliography:USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Japan Science and Technology Agency (JST)
USDOE National Nuclear Security Administration (NNSA)
AC02-05CH11231; JPMJCR1303; JPMJCR1687
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2019.102545