A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings - IEEE International Parallel and Distributed Processing Symposium S. 863 - 874
Hauptverfasser:	Li, Jialin, Ye, Huang, Tian, Shaobo, Li, Xinyuan, Zhang, Jian
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.05.2022
Schlagworte:	AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism
ISSN:	1530-2075
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.
ISSN:	1530-2075
DOI:	10.1109/IPDPS53621.2022.00089