SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU

Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency o...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Yao, Yuanzheng, Zhang, Chen, Qi, Chunyu, Chen, Ruiyang, Wang, Jun, Fu, Zhihui, Jing, Naifeng, Liang, Xiaoyao, Song, Zhuoran
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Computer vision Feature extraction Graphics processing units Parallel processing Registers Tensors Transformer cores Transformers Vectors Videos
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bitsparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA and Tensor Cores. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixedpoint Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15 \times \sim 3.95 \times speedup and 2.49 \times \sim 3.81 \times compute density over A100 GPU.
AbstractList	Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bitsparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA and Tensor Cores. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixedpoint Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15 \times \sim 3.95 \times speedup and 2.49 \times \sim 3.81 \times compute density over A100 GPU.
Author	Fu, Zhihui Song, Zhuoran Zhang, Chen Qi, Chunyu Chen, Ruiyang Jing, Naifeng Wang, Jun Liang, Xiaoyao Yao, Yuanzheng
Author_xml	– sequence: 1 givenname: Yuanzheng surname: Yao fullname: Yao, Yuanzheng email: yzh_yao@sjtu.edu.cn organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 2 givenname: Chen surname: Zhang fullname: Zhang, Chen organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 3 givenname: Chunyu surname: Qi fullname: Qi, Chunyu organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 4 givenname: Ruiyang surname: Chen fullname: Chen, Ruiyang organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 5 givenname: Jun surname: Wang fullname: Wang, Jun organization: OPPO Research Institute – sequence: 6 givenname: Zhihui surname: Fu fullname: Fu, Zhihui organization: OPPO Research Institute – sequence: 7 givenname: Naifeng surname: Jing fullname: Jing, Naifeng organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 8 givenname: Xiaoyao surname: Liang fullname: Liang, Xiaoyao organization: Shanghai Jiao Tong University,Shanghai,China – sequence: 9 givenname: Zhuoran surname: Song fullname: Song, Zhuoran email: songzhuoran@sjtu.edu.cn organization: Shanghai Jiao Tong University,Shanghai,China
BookMark	eNo1T8tKw0AUHUEXWvsHIvMDqXdemcRdTG0VCgpN3Ibb5KYMpBOZZFO_3hEVDpwXHDg37NKPnhi7F7ASAvKHdVGmKtP5SoI0MRJKWqMu2DK3eaaUMKBAZ9es2Z_99r1-5JEpHN2X80de1uuCo-_4k5uTPQWHA6_IT2Pg5Rho4n1UH25yo-dVQD9Ff6LAi7algQLOP0VEHL5lVz0OEy3_eMHqzXNVviS7t-1rWewSFDafE9lpMICia6G1JpUatDUHJVPRp12WQdrn1gCQVZhp1CiVMRgPAEo6pNiqBbv73XVE1HwGd8Jwbv5_q2-REFBv
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/DAC63849.2025.11132753
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798331503048
EndPage	7
ExternalDocumentID	11132753
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809
GroupedDBID	6IE 6IH CBEJK RIE RIO
ID	FETCH-LOGICAL-a179t-2d4050a1dc0c756240475b3261f6d8806f97500e73a84a4a2355a0480a2eb6ac3
IEDL.DBID	RIE
IngestDate	Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a179t-2d4050a1dc0c756240475b3261f6d8806f97500e73a84a4a2355a0480a2eb6ac3
PageCount	7
ParticipantIDs	ieee_primary_11132753
PublicationCentury	2000
PublicationDate	2025-June-22
PublicationDateYYYYMMDD	2025-06-22
PublicationDate_xml	– month: 06 year: 2025 text: 2025-June-22 day: 22
PublicationDecade	2020
PublicationTitle	2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev	DAC
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	2.295287
Snippet	Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Computer vision Feature extraction Graphics processing units Parallel processing Registers Tensors Transformer cores Transformers Vectors Videos
Title	SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU
URI	https://ieeexplore.ieee.org/document/11132753
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PT8MgFMeJWzx4UuOMv8PBKxulQFtvtXN6Wpa4md0WoGB26UzXmehf74NuGg8eTJrQFAIJD_I-UL48hG5LYNZIZ4rIWCvCdWqJdpEhinFNTRpr4XgINpGMx-l8nk22YvWghbHWhsNntu9fw7_8cmU2fqtsEMKiA193UCdJZCvW2qp-I5oNhnkBo4l7-QkT_V3hX2FTgtcYHf6zvSPU-9Hf4cm3ZzlGe7Y6QYvnj-pxMrvDkNr6dfkJWbiYDXOsqhLfLxvS7nThKaxMVzUuoLU1BibFL0E_jqc7RrU1zo0Bf9NaH8MDFffQbPQwLZ7INjwCUTCLGsJKgC2qotJQkwDGcMoToQHHIidLmJbSZYAD1CaxSrniigFaKC8hV8xqqUx8irrVqrJnCAshlU6doBwWKFlSqkw46iJpNDMxM-Ic9XzvLN7aGzAWu465-OP7JTrwNvBHqhi7Qt2m3thrtG_em-W6vgl2-wL8IJk8
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PS8MwFMeDTkFPKk78bQ5es6Vp0h_eZuecOMfATnYrSZpKL510naB_va_ppnjwIBRSmpBAXsL7JM03D6HrFJjVUaEknqsk4SowRGWOJpJxRXXgKpFxG2zCH4-D2SycrMTqVgtjjLGHz0ynfrX_8tO5XtZbZV0bFh34ehNtCc4ZbeRaK92vQ8NuvxfBeOK1AIWJzrr4r8Ap1m8M9v7Z4j5q_yjw8OTbtxygDVMcouT5o7ifTG8wpKZ8zT8hC0fTfg_LIsW3eUWavS4cw9p0XuIIWltgoFL8YhXkOF5TqilxT2vwOI39MTxQcRtNB3dxNCSrAAlEwjyqCEsBt6h0Uk21DyDDKfeFAiBzMi-FiellIQABNb4rAy65ZAAXshaRS2aUJ7V7hFrFvDDHCAvhSRVkgnJYooR-KkOR0czxtGLaZVqcoHbdO8lbcwdGsu6Y0z--X6GdYfw0SkYP48cztFvboz5gxdg5alXl0lygbf1e5Yvy0trwCyaTnIM
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=SynGPU%3A+Synergizing+CUDA+and+Bit-Serial+Tensor+Cores+for+Vision+Transformer+Acceleration+on+GPU&rft.au=Yao%2C+Yuanzheng&rft.au=Zhang%2C+Chen&rft.au=Qi%2C+Chunyu&rft.au=Chen%2C+Ruiyang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132753&rft.externalDocID=11132753