GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving

GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction le...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Wu, Kan, Lin, Zejia, Xi, Mengyue, Zheng, Zhongchun, Pan, Wenxuan, Zhang, Xianwei, Lu, Yutong
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Data Hazard Design automation Flow graphs GPU Graphics processing units Hardware Hazards ILP Kernel Kernel Fusion Parallel processing Pipelines Resource management Warp Stall Weaving
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction level parallelism (ILP) in inter-kernel resource sharing. To address this issue, we introduce GOPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GOPTX establishes a merged control flow graph (CFG) from original kernels, enabling to interleaving of instructions that were sequentially executed by default and minimizing pipeline stalls on data hazards. We further propose a latency-aware instruction weaving algorithm for more efficient instruction scheduling and an adaptive code slicing method to enlarge the scheduling space. Experimental evaluation demonstrates that GOPTX achieves an average speedup of \mathbf{1 1. 2 \%} over the baseline concurrent execution, with a maximum improvement of 23%. The hardware resource utilization statistics show significant enhancements in eligible warps per cycle and resource use.
DOI:	10.1109/DAC63849.2025.11132627