GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving

GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction le...

Full description

Saved in:

Bibliographic Details
Published in:	2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors:	Wu, Kan, Lin, Zejia, Xi, Mengyue, Zheng, Zhongchun, Pan, Wenxuan, Zhang, Xianwei, Lu, Yutong
Format:	Conference Proceeding
Language:	English
Published:	IEEE 22.06.2025
Subjects:	Data Hazard Design automation Flow graphs GPU Graphics processing units Hardware Hazards ILP Kernel Kernel Fusion Parallel processing Pipelines Resource management Warp Stall Weaving
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	GPUs have been heavily utilized in diverse applications, and numerous approaches, including kernel fusion, have been proposed to boost GPU efficiency through concurrent kernel execution. However, these approaches generally overlook the opportunities to mitigate warp stalls and improve instruction level parallelism (ILP) in inter-kernel resource sharing. To address this issue, we introduce GOPTX, a novel design for kernel fusion that improves ILP through deliberate weaving instructions at the PTX level. GOPTX establishes a merged control flow graph (CFG) from original kernels, enabling to interleaving of instructions that were sequentially executed by default and minimizing pipeline stalls on data hazards. We further propose a latency-aware instruction weaving algorithm for more efficient instruction scheduling and an adaptive code slicing method to enlarge the scheduling space. Experimental evaluation demonstrates that GOPTX achieves an average speedup of \mathbf{1 1. 2 \%} over the baseline concurrent execution, with a maximum improvement of 23%. The hardware resource utilization statistics show significant enhancements in eligible warps per cycle and resource use.
DOI:	10.1109/DAC63849.2025.11132627