GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving

GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction le...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autori:	Wu, Kan, Lin, Zejia, Xi, Mengyue, Zheng, Zhongchun, Pan, Wenxuan, Zhang, Xianwei, Lu, Yutong
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 22.06.2025
Predmet:	Data Hazard Design automation Flow graphs GPU Graphics processing units Hardware Hazards ILP Kernel Kernel Fusion Parallel processing Pipelines Resource management Warp Stall Weaving
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction level parallelism (ILP) in inter-kernel resource sharing. To address this issue, we introduce GOPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GOPTX establishes a merged control flow graph (CFG) from original kernels, enabling to interleaving of instructions that were sequentially executed by default and minimizing pipeline stalls on data hazards. We further propose a latency-aware instruction weaving algorithm for more efficient instruction scheduling and an adaptive code slicing method to enlarge the scheduling space. Experimental evaluation demonstrates that GOPTX achieves an average speedup of \mathbf{1 1. 2 \%} over the baseline concurrent execution, with a maximum improvement of 23%. The hardware resource utilization statistics show significant enhancements in eligible warps per cycle and resource use.
DOI:	10.1109/DAC63849.2025.11132627