ACRS: Adjacent Computation Resource Sharing among Partitioned GPU Sub-Cores
Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the pe...
Uloženo v:
| Vydáno v: | 2025 62nd ACM/IEEE Design Automation Conference (DAC) s. 1 - 7 |
|---|---|
| Hlavní autoři: | , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
22.06.2025
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the performance benefit of sharing hardware resources among sub-cores and identify functional units (FUs) as critical components for compute-intensive applications. Moreover, our observations reveal that instructions residing in operand collectors can be obstructed by back-end FUs, but there is a high probability that unoccupied FUs are available in adjacent sub-cores during such blockages. In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF_ISSUE) and Shared FU Write Back (SF_WriteBack). SF_ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF_WriteBack routes results back to the original sub-core.To minimize wiring overhead, each sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4 \%, with an average of 14.1 \% over the traditional partitioned architecture, while reducing energy consumption by 8.3 \%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method. |
|---|---|
| AbstractList | Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this partitioned design prevents potential task distributions across sub-cores, impairing overall execution efficiency. In this paper, we explore the performance benefit of sharing hardware resources among sub-cores and identify functional units (FUs) as critical components for compute-intensive applications. Moreover, our observations reveal that instructions residing in operand collectors can be obstructed by back-end FUs, but there is a high probability that unoccupied FUs are available in adjacent sub-cores during such blockages. In response, we introduce the adjacent computation resource sharing (ACRS) framework to efficiently utilize these unoccupied units among sub-cores. ACRS has two key modules: Shared FU Issue (SF_ISSUE) and Shared FU Write Back (SF_WriteBack). SF_ISSUE monitors the status of operand collectors and functional units, and offloads instructions from blocked sub-cores to unoccupied resources. Meanwhile, SF_WriteBack routes results back to the original sub-core.To minimize wiring overhead, each sub-core is assigned a fixed target core for sharing. We design a series of matching policies and finally filter out the most effective sequential method. Evaluation results show that ACRS improves performance by up to 46.4 \%, with an average of 14.1 \% over the traditional partitioned architecture, while reducing energy consumption by 8.3 \%. Besides, ACRS achieves an additional 12.3% performance improvement compared with the SOTA method. |
| Author | Han, Chenji Wang, Jian Wang, Chongxi Song, Penghao Zhao, Haoyu Zhang, Tingting Liu, Tianyi |
| Author_xml | – sequence: 1 givenname: Penghao surname: Song fullname: Song, Penghao email: songpenghao16@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 2 givenname: Chongxi surname: Wang fullname: Wang, Chongxi email: wangzhongxi15@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 3 givenname: Chenji surname: Han fullname: Han, Chenji email: hanchenji16@mails.ucas.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China – sequence: 4 givenname: Haoyu surname: Zhao fullname: Zhao, Haoyu email: zhaohaoyu@loongson.cn organization: Loongson Technology Co. Ltd,Beijing,China – sequence: 5 givenname: Tingting surname: Zhang fullname: Zhang, Tingting email: zhangtingting@loongson.cn organization: Loongson Technology Co. Ltd,Beijing,China – sequence: 6 givenname: Tianyi surname: Liu fullname: Liu, Tianyi email: tianyi.liu@utsa.edu organization: University of Texas at San Antonio,United States – sequence: 7 givenname: Jian surname: Wang fullname: Wang, Jian email: jw@ict.ac.cn organization: Institute of Computing Technology, CAS,State Key Lab of Processors,Beijing,China |
| BookMark | eNo1j11LwzAYhSO4C537ByL5A51J36RpvCtRpzjYWN31eJsPrdhmpO2F_96JenMOPBweOJfkvI-9J-SGsyXnTN_eV6aAUuhlznJ5QhxyKdkZWWilSwAuGTBRXpCXyuzqO1q5D7S-H6mJ3XEacWxjT3d-iFOyntbvmNr-jWIXT7nFNLY_A-_oarun9dRkJiY_XJFZwM_BL_56TvaPD6_mKVtvVs-mWmfIlR4z6zA0XHEPYEEoWwrOpQtKByzQNk4GoXLNHKAWBceyAYlMihA0MsVcA3Ny_ettvfeHY2o7TF-H_4_wDTCySw0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/DAC63849.2025.11132550 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798331503048 |
| EndPage | 7 |
| ExternalDocumentID | 11132550 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH CBEJK RIE RIO |
| ID | FETCH-LOGICAL-a179t-cdafb171e33c347c84115df79fa6acbd5f47290d3a9461a8b35a054ff9a070db3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 01 07:05:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a179t-cdafb171e33c347c84115df79fa6acbd5f47290d3a9461a8b35a054ff9a070db3 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_11132550 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-June-22 |
| PublicationDateYYYYMMDD | 2025-06-22 |
| PublicationDate_xml | – month: 06 year: 2025 text: 2025-June-22 day: 22 |
| PublicationDecade | 2020 |
| PublicationTitle | 2025 62nd ACM/IEEE Design Automation Conference (DAC) |
| PublicationTitleAbbrev | DAC |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 2.2953556 |
| Snippet | Modern GPUs typically segment Streaming Multiprocessors (SMs) into sub-cores (e.g. 4 sub-cores) to reduce power consumption and chip area. However, this... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Computational efficiency Computer architecture Graphics processing units Hardware Matched filters Monitoring Performance gain Power demand Resource management Wiring |
| Title | ACRS: Adjacent Computation Resource Sharing among Partitioned GPU Sub-Cores |
| URI | https://ieeexplore.ieee.org/document/11132550 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LTgMhFCXauHClxjG-w8It7QzPwV0zWk1MmonapLsGuJDoopq-vl-grcaFC3eEQCCX1wHuuQehm1LaQDVoUlltCOeWEwu-JCpeh6hRVDjpstiEGg7r8Vi3G7J65sJ477Pzme-mZP7Lhw-3TE9lvSyLLtINfVcpuSZrbVi_Val7d_0mziae6CdUdLeFf8mm5FNjcPDP9g5R8cO_w-33yXKEdvz0GD31m-eXW9yHd5M8KvFajyEbFm8f4XGKvxyr4KwhhNs0L3IsIsAP7QjHXYI0sRPzAo0G96_NI9lIIRATV8yCODDBVqryjDnGlat5RHIQlA5GGmdBBB5RcgnMaC4rU1smTARjIWgT1zRYdoI609jaKcIMqNI2gLSqiugJLBcgIeIUbmUQyp6hIlli8rmOdjHZGuH8j_wLtJ_sndynKL1EncVs6a_Qnlst3uaz6zxGX_O4k40 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwMhECammuhJjTW-5eCVdpflUbw1q7Wmtdlom_TWAAOJHqrpw98v0FbjwYO3DVkCGV7fwHzzIXSTCeOpAkVyozRhzDBiwGVEBneIakm5FTaJTcjBoDUeq2pNVk9cGOdcCj5zjfiZ3vLh3S7jVVkzyaLz6KFvcxYcnxVda837zTPVvGuXYT6xSEChvLH5_ZdwSjo3Ovv_bPEA1X8YeLj6PlsO0ZabHqFeu3x-ucVteNMxphKvFBmSafHmGh7HDMyhCk4qQriKMyNlIwL8UI1w2CdIGToxr6NR535YdslaDIHosGYWxIL2Jpe5KwpbMGlbLGA58FJ5LbQ1wD0LODmDQismct0yBdcBjnmvdFjVYIpjVJuG1k4QLoBKZTwII_OAn8AwDgICUmFGeC7NKapHS0w-VvkuJhsjnP1Rfo12u8On_qT_OOido71o-xhMRekFqi1mS3eJduzn4nU-u0rj9QXRlpbU |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2025+62nd+ACM%2FIEEE+Design+Automation+Conference+%28DAC%29&rft.atitle=ACRS%3A+Adjacent+Computation+Resource+Sharing+among+Partitioned+GPU+Sub-Cores&rft.au=Song%2C+Penghao&rft.au=Wang%2C+Chongxi&rft.au=Han%2C+Chenji&rft.au=Zhao%2C+Haoyu&rft.date=2025-06-22&rft.pub=IEEE&rft.spage=1&rft.epage=7&rft_id=info:doi/10.1109%2FDAC63849.2025.11132550&rft.externalDocID=11132550 |