SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU

Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency o...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři: Yao, Yuanzheng, Zhang, Chen, Qi, Chunyu, Chen, Ruiyang, Wang, Jun, Fu, Zhihui, Jing, Naifeng, Liang, Xiaoyao, Song, Zhuoran
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 22.06.2025
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bitsparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA and Tensor Cores. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixedpoint Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15 \times \sim 3.95 \times speedup and 2.49 \times \sim 3.81 \times compute density over A100 GPU.
AbstractList Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their self-attention mechanism suffers from quadratic time and memory complexity as image resolution or video duration increases, leading to inefficiency on GPUs. To accelerate ViTs, existing works mainly focus on pruning tokens based on value-level sparsity. However, they miss the chance to achieve peak performance as they overlook the bit-level sparsity. Instead, we propose Inter-token Bit-sparsity Awareness (IBA) algorithm to accelerate ViTs by exploring bit-sparsity from similar tokens. Next, we implement IBA on GPUs that synergize CUDA and Tensor Cores by addressing two issues: firstly, the bandwidth congestion of the Register File hinders the parallel ability of CUDA and Tensor Cores. Secondly, due to the varying exponent of floating-point vectors, it is hard to accelerate bitsparse matrix multiplication and accumulation (MMA) in Tensor Core through fixed-point-based bit-level circuits. Therefore, we present SynGPU, an algorithm-hardware co-design framework, to accelerate ViTs. SynGPU enhances data reuse by a novel data mapping to enable full parallelism of CUDA and Tensor Cores. Moreover, it introduces Bit-Serial Tensor Core (BSTC) that supports fixed- and floating-point MMA by combining the fixedpoint Bit-Serial Dot Product (BSDP) and exponent alignment techniques. Extensive experiments show that SynGPU achieves an average of 2.15 \times \sim 3.95 \times speedup and 2.49 \times \sim 3.81 \times compute density over A100 GPU.
Author Fu, Zhihui
Song, Zhuoran
Zhang, Chen
Qi, Chunyu
Chen, Ruiyang
Jing, Naifeng
Wang, Jun
Liang, Xiaoyao
Yao, Yuanzheng
Author_xml – sequence: 1
  givenname: Yuanzheng
  surname: Yao
  fullname: Yao, Yuanzheng
  email: yzh_yao@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 2
  givenname: Chen
  surname: Zhang
  fullname: Zhang, Chen
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 3
  givenname: Chunyu
  surname: Qi
  fullname: Qi, Chunyu
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 4
  givenname: Ruiyang
  surname: Chen
  fullname: Chen, Ruiyang
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 5
  givenname: Jun
  surname: Wang
  fullname: Wang, Jun
  organization: OPPO Research Institute
– sequence: 6
  givenname: Zhihui
  surname: Fu
  fullname: Fu, Zhihui
  organization: OPPO Research Institute
– sequence: 7
  givenname: Naifeng
  surname: Jing
  fullname: Jing, Naifeng
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 8
  givenname: Xiaoyao
  surname: Liang
  fullname: Liang, Xiaoyao
  organization: Shanghai Jiao Tong University,Shanghai,China
– sequence: 9
  givenname: Zhuoran
  surname: Song
  fullname: Song, Zhuoran
  email: songzhuoran@sjtu.edu.cn
  organization: Shanghai Jiao Tong University,Shanghai,China
BookMark eNo1T8tKw0AUHUEXWvsHIvMDqXdemcRdTG0VCgpN3Ibb5KYMpBOZZFO_3hEVDpwXHDg37NKPnhi7F7ASAvKHdVGmKtP5SoI0MRJKWqMu2DK3eaaUMKBAZ9es2Z_99r1-5JEpHN2X80de1uuCo-_4k5uTPQWHA6_IT2Pg5Rho4n1UH25yo-dVQD9Ff6LAi7algQLOP0VEHL5lVz0OEy3_eMHqzXNVviS7t-1rWewSFDafE9lpMICia6G1JpUatDUHJVPRp12WQdrn1gCQVZhp1CiVMRgPAEo6pNiqBbv73XVE1HwGd8Jwbv5_q2-REFBv
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/DAC63849.2025.11132753
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798331503048
EndPage 7
ExternalDocumentID 11132753
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IH
CBEJK
RIE
RIO
ID FETCH-LOGICAL-a179t-2d4050a1dc0c756240475b3261f6d8806f97500e73a84a4a2355a0480a2eb6ac3
IEDL.DBID RIE
IngestDate Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a179t-2d4050a1dc0c756240475b3261f6d8806f97500e73a84a4a2355a0480a2eb6ac3
PageCount 7
ParticipantIDs ieee_primary_11132753
PublicationCentury 2000
PublicationDate 2025-June-22
PublicationDateYYYYMMDD 2025-06-22
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-June-22
  day: 22
PublicationDecade 2020
PublicationTitle 2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev DAC
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
Score 2.295287
Snippet Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks by effectively extracting global features. However, their...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Computer vision
Feature extraction
Graphics processing units
Parallel processing
Registers
Tensors
Transformer cores
Transformers
Vectors
Videos
Title SynGPU: Synergizing CUDA and Bit-Serial Tensor Cores for Vision Transformer Acceleration on GPU
URI https://ieeexplore.ieee.org/document/11132753
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PT8MgFMeJWzx4UuOMv8PBKxulQFtvtXN6Wpa4md0WoGB26UzXmehf74NuGg8eTJrQFAIJD_I-UL48hG5LYNZIZ4rIWCvCdWqJdpEhinFNTRpr4XgINpGMx-l8nk22YvWghbHWhsNntu9fw7_8cmU2fqtsEMKiA193UCdJZCvW2qp-I5oNhnkBo4l7-QkT_V3hX2FTgtcYHf6zvSPU-9Hf4cm3ZzlGe7Y6QYvnj-pxMrvDkNr6dfkJWbiYDXOsqhLfLxvS7nThKaxMVzUuoLU1BibFL0E_jqc7RrU1zo0Bf9NaH8MDFffQbPQwLZ7INjwCUTCLGsJKgC2qotJQkwDGcMoToQHHIidLmJbSZYAD1CaxSrniigFaKC8hV8xqqUx8irrVqrJnCAshlU6doBwWKFlSqkw46iJpNDMxM-Ic9XzvLN7aGzAWu465-OP7JTrwNvBHqhi7Qt2m3thrtG_em-W6vgl2-wL8IJk8
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3PS8MwFMeDTkFPKk78bQ5es6Vp0h_eZuecOMfATnYrSZpKL510naB_va_ppnjwIBRSmpBAXsL7JM03D6HrFJjVUaEknqsk4SowRGWOJpJxRXXgKpFxG2zCH4-D2SycrMTqVgtjjLGHz0ynfrX_8tO5XtZbZV0bFh34ehNtCc4ZbeRaK92vQ8NuvxfBeOK1AIWJzrr4r8Ap1m8M9v7Z4j5q_yjw8OTbtxygDVMcouT5o7ifTG8wpKZ8zT8hC0fTfg_LIsW3eUWavS4cw9p0XuIIWltgoFL8YhXkOF5TqilxT2vwOI39MTxQcRtNB3dxNCSrAAlEwjyqCEsBt6h0Uk21DyDDKfeFAiBzMi-FiellIQABNb4rAy65ZAAXshaRS2aUJ7V7hFrFvDDHCAvhSRVkgnJYooR-KkOR0czxtGLaZVqcoHbdO8lbcwdGsu6Y0z--X6GdYfw0SkYP48cztFvboz5gxdg5alXl0lygbf1e5Yvy0trwCyaTnIM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=SynGPU%3A+Synergizing+CUDA+and+Bit-Serial+Tensor+Cores+for+Vision+Transformer+Acceleration+on+GPU&rft.au=Yao%2C+Yuanzheng&rft.au=Zhang%2C+Chen&rft.au=Qi%2C+Chunyu&rft.au=Chen%2C+Ruiyang&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132753&rft.externalDocID=11132753