A parallel sparse triangular solve algorithm based on dependency elimination of the solution vector.

Gespeichert in:
Bibliographische Detailangaben
Titel: A parallel sparse triangular solve algorithm based on dependency elimination of the solution vector.
Autoren: Jin, Song, Pei, Songwei, Wang, Yu, Qi, Yincheng
Quelle: Cluster Computing; Jun2021, Vol. 24 Issue 2, p1317-1330, 14p
Schlagwörter: PARALLEL algorithms, GRAPHICS processing units, ALGORITHMS, SCIENTIFIC computing, MATRICES (Mathematics), SYNCHRONIZATION
Abstract: Sparse triangular solve (SpTRSV) is an important kernel in many scientific computing applications. In traditional viewpoints, accelerating SpTRSV by parallelizing the solution process is a challenging task. Dependencies among the variables that exist in the solution process not only restrict the parallelism that can be achieved, but also introduce large synchronization overhead among the parallel tasks. Moreover, a time-consuming pre-processing phase is commonly required to identify calculations that can be parallelized. However, we have observed that a large number of dependencies among the variables can be eliminated if we only calculate partial values of the variables first and add them together to obtain the final values later. By using such a strategy, starting to solve a variable does not need to wait for all of its prerequisite variables having been solved. In consequence, parallelism of the SpTRSV can be increased significantly. In this paper, we transform above mentioned observations into a subtree-based parallel algorithm to accelerate SpTRSV. The proposed algorithm calculates partial values of the variable along with an implicit subtree traversal and utilizes hardware atomic operation to implement accumulation of the partial values. This not only introduces no pre-processing overhead, but also avoids any barrier synchronization among the parallel threads. We evaluate the proposed algorithm on 2135 matrices from SuiteSparse Matrix Collection based on a generic GPU platform. Experimental results demonstrate that our scheme outperforms the state-of-the-art GPU and CPU vendor libraries in 1949 and 1782 matrices, respectively. Compared with the latest synchronization-free method, our scheme outperforms in 1779 matrices. [ABSTRACT FROM AUTHOR]
Copyright of Cluster Computing is the property of Springer Nature and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Datenbank: Complementary Index
Beschreibung
Abstract:Sparse triangular solve (SpTRSV) is an important kernel in many scientific computing applications. In traditional viewpoints, accelerating SpTRSV by parallelizing the solution process is a challenging task. Dependencies among the variables that exist in the solution process not only restrict the parallelism that can be achieved, but also introduce large synchronization overhead among the parallel tasks. Moreover, a time-consuming pre-processing phase is commonly required to identify calculations that can be parallelized. However, we have observed that a large number of dependencies among the variables can be eliminated if we only calculate partial values of the variables first and add them together to obtain the final values later. By using such a strategy, starting to solve a variable does not need to wait for all of its prerequisite variables having been solved. In consequence, parallelism of the SpTRSV can be increased significantly. In this paper, we transform above mentioned observations into a subtree-based parallel algorithm to accelerate SpTRSV. The proposed algorithm calculates partial values of the variable along with an implicit subtree traversal and utilizes hardware atomic operation to implement accumulation of the partial values. This not only introduces no pre-processing overhead, but also avoids any barrier synchronization among the parallel threads. We evaluate the proposed algorithm on 2135 matrices from SuiteSparse Matrix Collection based on a generic GPU platform. Experimental results demonstrate that our scheme outperforms the state-of-the-art GPU and CPU vendor libraries in 1949 and 1782 matrices, respectively. Compared with the latest synchronization-free method, our scheme outperforms in 1779 matrices. [ABSTRACT FROM AUTHOR]
ISSN:13867857
DOI:10.1007/s10586-020-03188-x