Fast multiplication of random dense matrices with sparse matrices

This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operatio...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings - IEEE International Parallel and Distributed Processing Symposium pp. 52 - 62
Main Authors:	Liang, Tianyu, Murray, Riley, Buluc, Aydin, Demmel, James
Format:	Conference Proceeding
Language:	English
Published:	IEEE 27.05.2024
Subjects:	Distributed processing HPC Instruction sets Libraries Memory architecture Memory management Numerical Linear Algebra Scalability Sketching algorithm Sparse matrices
ISSN:	1530-2075
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm's parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to 45%.We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of M, where M is the cache size. Finally, we incorporate our sketching method into a randomized algorithm for overdetermined least squares with sparse data matrices. Our results are competitive with SuiteSparse for highly overdetermined problems; in some cases, we obtain a speedup of 10x over SuiteSparse.
ISSN:	1530-2075
DOI:	10.1109/IPDPS57955.2024.00014