ACRS: Adjacent Computation Resource Sharing among Partitioned GPU Sub-Cores

Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the pe...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7
Hlavní autoři:	Song, Penghao, Wang, Chongxi, Han, Chenji, Zhao, Haoyu, Zhang, Tingting, Liu, Tianyi, Wang, Jian
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 22.06.2025
Témata:	Computational efficiency Computer architecture Graphics processing units Hardware Matched filters Monitoring Performance gain Power demand Resource management Wiring
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the performance benefit of sharing hardware resources among sub-cores and identify functional units (FUs) as critical components for compute-intensive applications. Moreover, our observations reveal that instructions residing in operand collectors can be obstructed by back-end FUs, but there is a high probability that unoccupied FUs are available in adjacent sub-cores during such blockages. In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF_ISSUE) and Shared FU Write Back (SF_WriteBack). SF_ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF_WriteBack routes results back to the original sub-core.To minimize wiring overhead, each sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4 \%, with an average of 14.1 \% over the traditional partitioned architecture, while reducing energy consumption by 8.3 \%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method.
AbstractList	Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the performance benefit of sharing hardware resources among sub-cores and identify functional units (FUs) as critical components for compute-intensive applications. Moreover, our observations reveal that instructions residing in operand collectors can be obstructed by back-end FUs, but there is a high probability that unoccupied FUs are available in adjacent sub-cores during such blockages. In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF_ISSUE) and Shared FU Write Back (SF_WriteBack). SF_ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF_WriteBack routes results back to the original sub-core.To minimize wiring overhead, each sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4 \%, with an average of 14.1 \% over the traditional partitioned architecture, while reducing energy consumption by 8.3 \%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method.
Author	Han, Chenji Wang, Jian Wang, Chongxi Song, Penghao Zhao, Haoyu Zhang, Tingting Liu, Tianyi
Author_xml	– sequence: 1 givenname: Penghao surname: Song fullname: Song, Penghao email: songpenghao16@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 2 givenname: Chongxi surname: Wang fullname: Wang, Chongxi email: wangzhongxi15@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 3 givenname: Chenji surname: Han fullname: Han, Chenji email: hanchenji16@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 4 givenname: Haoyu surname: Zhao fullname: Zhao, Haoyu email: zhaohaoyu@loongson.cn organization: Loongson Technology Co. Ltd,Beijing,China – sequence: 5 givenname: Tingting surname: Zhang fullname: Zhang, Tingting email: zhangtingting@loongson.cn organization: Loongson Technology Co. Ltd,Beijing,China – sequence: 6 givenname: Tianyi surname: Liu fullname: Liu, Tianyi email: tianyi.liu@utsa.edu organization: University of Texas at San Antonio,United States – sequence: 7 givenname: Jian surname: Wang fullname: Wang, Jian email: jw@ict.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China
BookMark	eNo1j11LwzAYhSO4C537ByL5A51J36RpvCtRpzjYWN31eJsPrdhmpO2F_96JenMOPBweOJfkvI-9J-SGsyXnTN_eV6aAUuhlznJ5QhxyKdkZWWilSwAuGTBRXpCXyuzqO1q5D7S-H6mJ3XEacWxjT3d-iFOyntbvmNr-jWIXT7nFNLY_A-_oarun9dRkJiY_XJFZwM_BL_56TvaPD6_mKVtvVs-mWmfIlR4z6zA0XHEPYEEoWwrOpQtKByzQNk4GoXLNHKAWBceyAYlMihA0MsVcA3Ny_ettvfeHY2o7TF-H_4_wDTCySw0
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/DAC63849.2025.11132550
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798331503048
EndPage	7
ExternalDocumentID	11132550
Genre	orig-research
GroupedDBID	6IE 6IH CBEJK RIE RIO
ID	FETCH-LOGICAL-a179t-cdafb171e33c347c84115df79fa6acbd5f47290d3a9461a8b35a054ff9a070db3
IEDL.DBID	RIE
IngestDate	Wed Oct 01 07:05:15 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a179t-cdafb171e33c347c84115df79fa6acbd5f47290d3a9461a8b35a054ff9a070db3
PageCount	7
ParticipantIDs	ieee_primary_11132550
PublicationCentury	2000
PublicationDate	2025-June-22
PublicationDateYYYYMMDD	2025-06-22
PublicationDate_xml	– month: 06 year: 2025 text: 2025-June-22 day: 22
PublicationDecade	2020
PublicationTitle	2025 62nd ACM/IEEE Design Automation Conference (DAC)
PublicationTitleAbbrev	DAC
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	2.2953556
Snippet	Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Computational efficiency Computer architecture Graphics processing units Hardware Matched filters Monitoring Performance gain Power demand Resource management Wiring
Title	ACRS: Adjacent Computation Resource Sharing among Partitioned GPU Sub-Cores
URI	https://ieeexplore.ieee.org/document/11132550
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LTgMhFCXauHClxjG-w8It7QzPwV0zWk1MmonapLsGuJDoopq-vl-grcaFC3eEQCCX1wHuuQehm1LaQDVoUlltCOeWEwu-JCpeh6hRVDjpstiEGg7r8Vi3G7J65sJ477Pzme-mZP7Lhw-3TE9lvSyLLtINfVcpuSZrbVi_Val7d_0mziae6CdUdLeFf8mm5FNjcPDP9g5R8cO_w-33yXKEdvz0GD31m-eXW9yHd5M8KvFajyEbFm8f4XGKvxyr4KwhhNs0L3IsIsAP7QjHXYI0sRPzAo0G96_NI9lIIRATV8yCODDBVqryjDnGlat5RHIQlA5GGmdBBB5RcgnMaC4rU1smTARjIWgT1zRYdoI609jaKcIMqNI2gLSqiugJLBcgIeIUbmUQyp6hIlli8rmOdjHZGuH8j_wLtJ_sndynKL1EncVs6a_Qnlst3uaz6zxGX_O4k40
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECammuhJjTW-5eCVdpflUbw1q7Wmtdlom_TWAAOJHqrpw98v0FbjwYO3DVkCGV7fwHzzIXSTCeOpAkVyozRhzDBiwGVEBneIakm5FTaJTcjBoDUeq2pNVk9cGOdcCj5zjfiZ3vLh3S7jVVkzyaLz6KFvcxYcnxVda837zTPVvGuXYT6xSEChvLH5_ZdwSjo3Ovv_bPEA1X8YeLj6PlsO0ZabHqFeu3x-ucVteNMxphKvFBmSafHmGh7HDMyhCk4qQriKMyNlIwL8UI1w2CdIGToxr6NR535YdslaDIHosGYWxIL2Jpe5KwpbMGlbLGA58FJ5LbQ1wD0LODmDQismct0yBdcBjnmvdFjVYIpjVJuG1k4QLoBKZTwII_OAn8AwDgICUmFGeC7NKapHS0w-VvkuJhsjnP1Rfo12u8On_qT_OOido71o-xhMRekFqi1mS3eJduzn4nU-u0rj9QXRlpbU
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=ACRS%3A+Adjacent+Computation+Resource+Sharing+among+Partitioned+GPU+Sub-Cores&rft.au=Song%2C+Penghao&rft.au=Wang%2C+Chongxi&rft.au=Han%2C+Chenji&rft.au=Zhao%2C+Haoyu&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132550&rft.externalDocID=11132550