JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings / International Symposium on Code Generation and Optimization s. 448 - 459
Hlavní autoři: Fu, Qiang, Rolinger, Thomas B., Huang, H. Howie
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 02.03.2024
Témata:
ISSN:2643-2838
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
AbstractList Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
Author Fu, Qiang
Rolinger, Thomas B.
Huang, H. Howie
Author_xml – sequence: 1
  givenname: Qiang
  surname: Fu
  fullname: Fu, Qiang
  email: charlifu@amd.com
  organization: Advanced Micro Devices Inc.,Austin,TX,USA
– sequence: 2
  givenname: Thomas B.
  surname: Rolinger
  fullname: Rolinger, Thomas B.
  email: trolinger@nvidia.com
  organization: NVIDIA,Austin,TX,USA
– sequence: 3
  givenname: H. Howie
  surname: Huang
  fullname: Huang, H. Howie
  email: howie@gwu.edu
  organization: George Washington University,Washington, DC,USA
BookMark eNo1kN1OAjEUhKvRREDewJi-QPH0v_WOEEUIG0zAa-wup0nNspDukujbq6BX32QyMxfTJ1fNvkFC7jmMOAf_MJkutTUSRgKEGnFQSjlhL8jQW--kBuk1eH9JesIoyYST7ob02_YDQFjFZY-8z2fr1WtRPNL5se1Yatg67ZDOmrbLx6pL-4ZOscEcTjLuMx1XFda_Bm7p6hByi7QIXU6f7AxaHOsuHepUnTq35DqGusXhHwfk7flpPXlhi-V0NhkvWBAeOqZEjE5zEU1QytpK2DLGslJGu6g0R8ASjBZb7kPw5U_QKeGNUxKMLXWIckDuzrsJETeHnHYhf23-H5HfcGFXQA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CGO57630.2024.10444827
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEL
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEL
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350395099
EISSN 2643-2838
EndPage 459
ExternalDocumentID 10444827
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: 2127207
  funderid: 10.13039/100000001
GroupedDBID 29O
6IE
6IF
6IK
6IL
6IN
AAJGR
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
ID FETCH-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
IEDL.DBID RIE
ISICitedReferencesCount 3
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:18:31 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
PageCount 12
ParticipantIDs ieee_primary_10444827
PublicationCentury 2000
PublicationDate 2024-March-2
PublicationDateYYYYMMDD 2024-03-02
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-March-2
  day: 02
PublicationDecade 2020
PublicationTitle Proceedings / International Symposium on Code Generation and Optimization
PublicationTitleAbbrev CGO
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0027413
ssib057256076
Score 2.2826738
Snippet Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the...
SourceID ieee
SourceType Publisher
StartPage 448
SubjectTerms Codes
Just-in-Time Instruction Generation
Optimization
Performance evaluation
Performance Optimization
Performance Profiling
Registers
Runtime
Sparse matrices
SpMM
Throughput
Title JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
URI https://ieeexplore.ieee.org/document/10444827
WOSCitedRecordID wos001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPor4lgdWl8RxmpgNVRSKSKnUgroV2zlLXdKqFMTPx-ckrRgYmGJlsKJzLu_d5e4dIdc5OIyAIGWguEFR7ZRpGTvHE0hehVHS5zvenpPhMJ1O5ahqVve9MADgi8-gg0v_Lz9fmE9MlTkPFwJlKxukkSTdslmrfnniBMEbsXUTbYVR1RIcBvKm9_DiqHUUuJCQi06906-ZKh5S-q1_Psw-aW-b8-hoAzsHZAeKQ9KqpzPQylmPyPvTYDIeZdktxYldbF4wbPegg61kLC01p_3ScVd6Z4wDIdSOyOl46SJeoBkq-H-z8kKzsvqwSvO1yWv_ftJ7ZNU8Baa4DNZMcGsdweK2q4RIEsMTba02wpEQK-IQAtCOkPA8lEpJHSNycdlNBUYVOlY2OibNYlHACaGRcZ-GmNtIaCWkF5QCqSKj3Q4KUn5K2mix2bKUzJjVxjr74_452cNz8cVd_II0nSHgkuyar_X8Y3XlD_oH8oynEQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIMFUPor4xgOrS-I4TcyGKkoLTanUgroVxzlLXdIKCuLn43OSVgwMTLEyWNE5l_fucveOkOsMLEaAFzNQXKOodsxSGVrHE0hehVbS5Tte-9FgEE8mclg2q7teGABwxWfQxKX7l5_N9SemyqyHC4GylZtkC0dnle1a1esTRgjfiK6reMsPyqZg35M37YdnS64DzwaFXDSrvX5NVXGg0qn_83H2SGPdnkeHK-DZJxuQH5B6NZ-Blu56SN4ee-PRMEluKc7sYrOcYcMH7a1FY2mhOu2Wlr3SO60tDKF6REZHCxvzAk1Qw_-bFReaFPWHZaKvQV469-N2l5UTFZji0lsywY2xFIublhIiijSPUmNSLSwNMSL0wYPUUhKe-VIpmYaIXVy2YoFxRRoqExyRWj7P4ZjQQNuPQ8hNIFIlpJOUAqkCndodFMT8hDTQYtNFIZoxrYx1-sf9K7LTHSf9ab83eDoju3hGrtSLn5OaNQpckG39tZx9vF-6Q_8B1xmqWg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Symposium+on+Code+Generation+and+Optimization&rft.atitle=JITSPMM%3A+Just-in-Time+Instruction+Generation+for+Accelerated+Sparse+Matrix-Matrix+Multiplication&rft.au=Fu%2C+Qiang&rft.au=Rolinger%2C+Thomas+B.&rft.au=Huang%2C+H.+Howie&rft.date=2024-03-02&rft.pub=IEEE&rft.eissn=2643-2838&rft.spage=448&rft.epage=459&rft_id=info:doi/10.1109%2FCGO57630.2024.10444827&rft.externalDocID=10444827