JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings / International Symposium on Code Generation and Optimization s. 448 - 459
Hlavní autori: Fu, Qiang, Rolinger, Thomas B., Huang, H. Howie
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 02.03.2024
Predmet:
ISSN:2643-2838
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
AbstractList Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
Author Fu, Qiang
Rolinger, Thomas B.
Huang, H. Howie
Author_xml – sequence: 1
  givenname: Qiang
  surname: Fu
  fullname: Fu, Qiang
  email: charlifu@amd.com
  organization: Advanced Micro Devices Inc.,Austin,TX,USA
– sequence: 2
  givenname: Thomas B.
  surname: Rolinger
  fullname: Rolinger, Thomas B.
  email: trolinger@nvidia.com
  organization: NVIDIA,Austin,TX,USA
– sequence: 3
  givenname: H. Howie
  surname: Huang
  fullname: Huang, H. Howie
  email: howie@gwu.edu
  organization: George Washington University,Washington, DC,USA
BookMark eNo1kN1OAjEUhKvRREDewJi-QPH0v_WOEEUIG0zAa-wup0nNspDukujbq6BX32QyMxfTJ1fNvkFC7jmMOAf_MJkutTUSRgKEGnFQSjlhL8jQW--kBuk1eH9JesIoyYST7ob02_YDQFjFZY-8z2fr1WtRPNL5se1Yatg67ZDOmrbLx6pL-4ZOscEcTjLuMx1XFda_Bm7p6hByi7QIXU6f7AxaHOsuHepUnTq35DqGusXhHwfk7flpPXlhi-V0NhkvWBAeOqZEjE5zEU1QytpK2DLGslJGu6g0R8ASjBZb7kPw5U_QKeGNUxKMLXWIckDuzrsJETeHnHYhf23-H5HfcGFXQA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CGO57630.2024.10444827
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350395099
EISSN 2643-2838
EndPage 459
ExternalDocumentID 10444827
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: 2127207
  funderid: 10.13039/100000001
GroupedDBID 29O
6IE
6IF
6IK
6IL
6IN
AAJGR
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
ID FETCH-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
IEDL.DBID RIE
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:18:31 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
PageCount 12
ParticipantIDs ieee_primary_10444827
PublicationCentury 2000
PublicationDate 2024-March-2
PublicationDateYYYYMMDD 2024-03-02
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-March-2
  day: 02
PublicationDecade 2020
PublicationTitle Proceedings / International Symposium on Code Generation and Optimization
PublicationTitleAbbrev CGO
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0027413
ssib057256076
Score 2.2826738
Snippet Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the...
SourceID ieee
SourceType Publisher
StartPage 448
SubjectTerms Codes
Just-in-Time Instruction Generation
Optimization
Performance evaluation
Performance Optimization
Performance Profiling
Registers
Runtime
Sparse matrices
SpMM
Throughput
Title JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
URI https://ieeexplore.ieee.org/document/10444827
WOSCitedRecordID wos001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPor4lgdWl9ix44QNVRSKlFKpRepWbOcsdUmrUhA_H9tJWjEwMMXKEEV3sd6dc-89hG4jqllqLCNU25RwKyTRQFMCqdagXMZpbTYhR6N0NsvGNVk9cGEAIAyfQc8vw7_8Ymk-_VGZ2-Gce9nKFmpJmVRkrebjEdKDt8fWbbdF45oSTKPsrv_06krrOHItIeO95km_PFUCpAw6_3yZQ9TdkfPweAs7R2gPymPUadwZcL1ZT9D7y3A6Gef5PfaOXWRREk_3wMOdZCyuNKfD0tWu-MEYB0JeO6LAk5XreAHnXsH_m1QXnFfTh_UxXxe9DR6n_WdS-ykQxbJoQziz1hVYzCaKcykNk9pabbgrQiwXFCLQriBhBc2UyrTwyMWyJOW-q9BC2fgUtctlCWcIy4KKQgJIq4GbmCplWRIboZQxWcrsOer6iM1XlWTGvAnWxR_3L9GBz0sY7mJXqO0CAddo33xtFh_rm5DoH-OSqNw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIMFUPorKtwdWl9ix64QNVZQWmlKpRepWbOcsdUkrKIifj52PVgwMTLEyRNFdrHfn3HsPoZuAahYZywjVNiLcCkk00IhApDUol3Famk3I4TCaTuNRSVbPuTAAkA-fQcsv83_56cJ8-qMyt8M597KV22hHcM6Cgq5VfT5Cevj26Lrut2hYkoJpEN92Hl9ccR0GrilkvFU965erSg4q3fo_X-cANTb0PDxaA88h2oLsCNUrfwZcbtdj9PbUn4xHSXKHvWcXmWfEEz5wfyMaiwvV6Xzpqld8b4yDIa8ekeLx0vW8gBOv4f9NigtOivnD8qCvgV67D5NOj5SOCkSxOFgRzqx1JRazbcW5lIZJba023JUhlgsKAWhXkrCUxkrFWnjsYnE74r6v0ELZ8ATVskUGTYRlSkUqAaTVwE1IlbKsHRqhlDFxxOwpaviIzZaFaMasCtbZH_ev0V5vkgxmg_7w-Rzt-xzlo17sAtVcUOAS7Zqv1fzj_SpP-g_bAKwj
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Symposium+on+Code+Generation+and+Optimization&rft.atitle=JITSPMM%3A+Just-in-Time+Instruction+Generation+for+Accelerated+Sparse+Matrix-Matrix+Multiplication&rft.au=Fu%2C+Qiang&rft.au=Rolinger%2C+Thomas+B.&rft.au=Huang%2C+H.+Howie&rft.date=2024-03-02&rft.pub=IEEE&rft.eissn=2643-2838&rft.spage=448&rft.epage=459&rft_id=info:doi/10.1109%2FCGO57630.2024.10444827&rft.externalDocID=10444827