JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of...
Uložené v:
| Vydané v: | Proceedings / International Symposium on Code Generation and Optimization s. 448 - 459 |
|---|---|
| Hlavní autori: | , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
02.03.2024
|
| Predmet: | |
| ISSN: | 2643-2838 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively. |
|---|---|
| AbstractList | Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively. |
| Author | Fu, Qiang Rolinger, Thomas B. Huang, H. Howie |
| Author_xml | – sequence: 1 givenname: Qiang surname: Fu fullname: Fu, Qiang email: charlifu@amd.com organization: Advanced Micro Devices Inc.,Austin,TX,USA – sequence: 2 givenname: Thomas B. surname: Rolinger fullname: Rolinger, Thomas B. email: trolinger@nvidia.com organization: NVIDIA,Austin,TX,USA – sequence: 3 givenname: H. Howie surname: Huang fullname: Huang, H. Howie email: howie@gwu.edu organization: George Washington University,Washington, DC,USA |
| BookMark | eNo1kN1OAjEUhKvRREDewJi-QPH0v_WOEEUIG0zAa-wup0nNspDukujbq6BX32QyMxfTJ1fNvkFC7jmMOAf_MJkutTUSRgKEGnFQSjlhL8jQW--kBuk1eH9JesIoyYST7ob02_YDQFjFZY-8z2fr1WtRPNL5se1Yatg67ZDOmrbLx6pL-4ZOscEcTjLuMx1XFda_Bm7p6hByi7QIXU6f7AxaHOsuHepUnTq35DqGusXhHwfk7flpPXlhi-V0NhkvWBAeOqZEjE5zEU1QytpK2DLGslJGu6g0R8ASjBZb7kPw5U_QKeGNUxKMLXWIckDuzrsJETeHnHYhf23-H5HfcGFXQA |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/CGO57630.2024.10444827 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798350395099 |
| EISSN | 2643-2838 |
| EndPage | 459 |
| ExternalDocumentID | 10444827 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation grantid: 2127207 funderid: 10.13039/100000001 |
| GroupedDBID | 29O 6IE 6IF 6IK 6IL 6IN AAJGR ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL |
| ID | FETCH-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 3 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:18:31 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10444827 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-March-2 |
| PublicationDateYYYYMMDD | 2024-03-02 |
| PublicationDate_xml | – month: 03 year: 2024 text: 2024-March-2 day: 02 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings / International Symposium on Code Generation and Optimization |
| PublicationTitleAbbrev | CGO |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0027413 ssib057256076 |
| Score | 2.2826738 |
| Snippet | Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 448 |
| SubjectTerms | Codes Just-in-Time Instruction Generation Optimization Performance evaluation Performance Optimization Performance Profiling Registers Runtime Sparse matrices SpMM Throughput |
| Title | JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication |
| URI | https://ieeexplore.ieee.org/document/10444827 |
| WOSCitedRecordID | wos001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPor4lgdWl9ix44QNVRSKlFKpRepWbOcsdUmrUhA_H9tJWjEwMMXKEEV3sd6dc-89hG4jqllqLCNU25RwKyTRQFMCqdagXMZpbTYhR6N0NsvGNVk9cGEAIAyfQc8vw7_8Ymk-_VGZ2-Gce9nKFmpJmVRkrebjEdKDt8fWbbdF45oSTKPsrv_06krrOHItIeO95km_PFUCpAw6_3yZQ9TdkfPweAs7R2gPymPUadwZcL1ZT9D7y3A6Gef5PfaOXWRREk_3wMOdZCyuNKfD0tWu-MEYB0JeO6LAk5XreAHnXsH_m1QXnFfTh_UxXxe9DR6n_WdS-ykQxbJoQziz1hVYzCaKcykNk9pabbgrQiwXFCLQriBhBc2UyrTwyMWyJOW-q9BC2fgUtctlCWcIy4KKQgJIq4GbmCplWRIboZQxWcrsOer6iM1XlWTGvAnWxR_3L9GBz0sY7mJXqO0CAddo33xtFh_rm5DoH-OSqNw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIMFUPorKtwdWl9ix64QNVZQWmlKpRepWbOcsdUkrKIifj52PVgwMTLEyRNFdrHfn3HsPoZuAahYZywjVNiLcCkk00IhApDUol3Famk3I4TCaTuNRSVbPuTAAkA-fQcsv83_56cJ8-qMyt8M597KV22hHcM6Cgq5VfT5Cevj26Lrut2hYkoJpEN92Hl9ccR0GrilkvFU965erSg4q3fo_X-cANTb0PDxaA88h2oLsCNUrfwZcbtdj9PbUn4xHSXKHvWcXmWfEEz5wfyMaiwvV6Xzpqld8b4yDIa8ekeLx0vW8gBOv4f9NigtOivnD8qCvgV67D5NOj5SOCkSxOFgRzqx1JRazbcW5lIZJba023JUhlgsKAWhXkrCUxkrFWnjsYnE74r6v0ELZ8ATVskUGTYRlSkUqAaTVwE1IlbKsHRqhlDFxxOwpaviIzZaFaMasCtbZH_ev0V5vkgxmg_7w-Rzt-xzlo17sAtVcUOAS7Zqv1fzj_SpP-g_bAKwj |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Symposium+on+Code+Generation+and+Optimization&rft.atitle=JITSPMM%3A+Just-in-Time+Instruction+Generation+for+Accelerated+Sparse+Matrix-Matrix+Multiplication&rft.au=Fu%2C+Qiang&rft.au=Rolinger%2C+Thomas+B.&rft.au=Huang%2C+H.+Howie&rft.date=2024-03-02&rft.pub=IEEE&rft.eissn=2643-2838&rft.spage=448&rft.epage=459&rft_id=info:doi/10.1109%2FCGO57630.2024.10444827&rft.externalDocID=10444827 |