Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 842 - 853
Hlavní autoři:	Koanantakool, Penporn, Azad, Ariful, Buluc, Aydin, Morozov, Dmitriy, Sang-Yun Oh, Oliker, Leonid, Yelick, Katherine
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.05.2016
Témata:	Algorithm design and analysis communication-avoiding algorithms Covariance matrices linear algebra Machine learning algorithms Parallel algorithms Partitioning algorithms Sparse matrices sparse-dense matrix-matrix multiplication Three-dimensional displays
ISSN:	1530-2075
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants both theoretically and in practice for sparse-dense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.
ISSN:	1530-2075
DOI:	10.1109/IPDPS.2016.117