A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 863 - 874
Hlavní autoři:	Li, Jialin, Ye, Huang, Tian, Shaobo, Li, Xinyuan, Zhang, Jian
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.05.2022
Témata:	AMD GCN Architecture DGEMM Graphics processing units High performance computing Libraries Mathematical models Parallel processing Performance gain Prefetching Register TLP Workgroup Parallelism
ISSN:	1530-2075
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.
ISSN:	1530-2075
DOI:	10.1109/IPDPS53621.2022.00089