sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of parallel and distributed computing Ročník 138; s. 153 - 171
Hlavní autori:	Valero-Lara, Pedro, Catalán, Sandra, Martorell, Xavier, Usui, Tetsuzo, Labarta, Jesús
Médium:	Journal Article Publikácia
Jazyk:	English
Vydavateľské údaje:	Elsevier Inc 01.04.2020 Elsevier
Predmet:	Arquitectura de computadors Arquitectures paral·leles Auto-tuning Informàtica LASs Linear algebra OmpSs Parallel programming (Computer science) Programació en paral·lel (Informàtica) Àrees temàtiques de la UPC 99-00 00-01 LASs Auto-tuning OmpSs Linear algebra
ISSN:	0743-7315, 1096-0848
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the implementation of auto-tunable code for the BLAS-3 trsm routine and the LAPACK routines npgetrf and npgesv. All these implementations are part of the first prototype of sLASs library, a novel library for auto-tunable codes for linear algebra operations based on LASs library. In all these cases, the use of the OmpSs-2 features presents an improvement in terms of execution time against other reference libraries such as, the original LASs library, PLASMA, ATLAS and Intel MKL. These codes are able to reduce the execution time in about 18% on big matrices, by increasing the IPC on gemm and reducing the time of task instantiation. For a few medium matrices, benefits are also seen. For small matrices and a subset of medium matrices, specific optimizations that allow to increase the degree of parallelism in both, gemm and trsm tasks, are applied. This strategy achieves an increment in performance of up to 40%. •Development of a highly optimized auto-tuned library for BLAS-3 and LAPACK operations.•Use of OmpSs-2 and OpenMP features to implement auto-tunable codes.•Different strategies depending on the size of the matrices to be computed.•Close to 20% better performance with respect to our references on big matrices.•Close to 40% better performance with respect to our references on small matrices.
ISSN:	0743-7315 1096-0848
DOI:	10.1016/j.jpdc.2019.12.002