A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced pe...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) s. 908 - 919
Hlavní autori:	Sao, Piyush, Li, Xiaoye Sherry, Vuduc, Richard
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.05.2018
Predmet:	communication avoiding algorithm Matrices nested dissection Parallel processing Particle separators sparse direct solver sparse gaussian elimination Sparse matrices Three-dimensional displays Transmission line matrix methods Two dimensional displays
ISSN:	1530-2075
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O{log n} and latency by a factor of O{log n}. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O{n^1/3} times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in superLU. Our new 3D code achieves speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D superLU when run on 24,000 cores of a Cray XC30.
ISSN:	1530-2075
DOI:	10.1109/IPDPS.2018.00100