JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings / International Symposium on Code Generation and Optimization s. 448 - 459
Hlavní autoři:	Fu, Qiang, Rolinger, Thomas B., Huang, H. Howie
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 02.03.2024
Témata:	Codes Just-in-Time Instruction Generation Optimization Performance evaluation Performance Optimization Performance Profiling Registers Runtime Sparse matrices SpMM Throughput
ISSN:	2643-2838
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
AbstractList	Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JITSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JITSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JITSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JITSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JITSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JITSpMM provides an average improvement of 3.8× and 1.4×, respectively.
Author	Fu, Qiang Rolinger, Thomas B. Huang, H. Howie
Author_xml	– sequence: 1 givenname: Qiang surname: Fu fullname: Fu, Qiang email: charlifu@amd.com organization: Advanced Micro Devices Inc.,Austin,TX,USA – sequence: 2 givenname: Thomas B. surname: Rolinger fullname: Rolinger, Thomas B. email: trolinger@nvidia.com organization: NVIDIA,Austin,TX,USA – sequence: 3 givenname: H. Howie surname: Huang fullname: Huang, H. Howie email: howie@gwu.edu organization: George Washington University,Washington, DC,USA
BookMark	eNo1kN1OAjEUhKvRREDewJi-QPH0v_WOEEUIG0zAa-wup0nNspDukujbq6BX32QyMxfTJ1fNvkFC7jmMOAf_MJkutTUSRgKEGnFQSjlhL8jQW--kBuk1eH9JesIoyYST7ob02_YDQFjFZY-8z2fr1WtRPNL5se1Yatg67ZDOmrbLx6pL-4ZOscEcTjLuMx1XFda_Bm7p6hByi7QIXU6f7AxaHOsuHepUnTq35DqGusXhHwfk7flpPXlhi-V0NhkvWBAeOqZEjE5zEU1QytpK2DLGslJGu6g0R8ASjBZb7kPw5U_QKeGNUxKMLXWIckDuzrsJETeHnHYhf23-H5HfcGFXQA
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/CGO57630.2024.10444827
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEL IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEL url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798350395099
EISSN	2643-2838
EndPage	459
ExternalDocumentID	10444827
Genre	orig-research
GrantInformation_xml	– fundername: National Science Foundation grantid: 2127207 funderid: 10.13039/100000001
GroupedDBID	29O 6IE 6IF 6IK 6IL 6IN AAJGR ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL
ID	FETCH-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
IEDL.DBID	RIE
ISICitedReferencesCount	3
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:18:31 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a290t-42ff8512f6a4477c27bffbc4658f451e0eb0652d19aa9b51284296843067b5af3
PageCount	12
ParticipantIDs	ieee_primary_10444827
PublicationCentury	2000
PublicationDate	2024-March-2
PublicationDateYYYYMMDD	2024-03-02
PublicationDate_xml	– month: 03 year: 2024 text: 2024-March-2 day: 02
PublicationDecade	2020
PublicationTitle	Proceedings / International Symposium on Code Generation and Optimization
PublicationTitleAbbrev	CGO
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0027413 ssib057256076
Score	2.2826738
Snippet	Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the...
SourceID	ieee
SourceType	Publisher
StartPage	448
SubjectTerms	Codes Just-in-Time Instruction Generation Optimization Performance evaluation Performance Optimization Performance Profiling Registers Runtime Sparse matrices SpMM Throughput
Title	JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
URI	https://ieeexplore.ieee.org/document/10444827
WOSCitedRecordID	wos001179185400036&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPor4lgdWl8RxmpgNVRSKSKnUgroV2zlLXdKqFMTPx-ckrRgYmGJlsKJzLu_d5e4dIdc5OIyAIGWguEFR7ZRpGTvHE0hehVHS5zvenpPhMJ1O5ahqVve9MADgi8-gg0v_Lz9fmE9MlTkPFwJlKxukkSTdslmrfnniBMEbsXUTbYVR1RIcBvKm9_DiqHUUuJCQi06906-ZKh5S-q1_Psw-aW-b8-hoAzsHZAeKQ9KqpzPQylmPyPvTYDIeZdktxYldbF4wbPegg61kLC01p_3ScVd6Z4wDIdSOyOl46SJeoBkq-H-z8kKzsvqwSvO1yWv_ftJ7ZNU8Baa4DNZMcGsdweK2q4RIEsMTba02wpEQK-IQAtCOkPA8lEpJHSNycdlNBUYVOlY2OibNYlHACaGRcZ-GmNtIaCWkF5QCqSKj3Q4KUn5K2mix2bKUzJjVxjr74_452cNz8cVd_II0nSHgkuyar_X8Y3XlD_oH8oynEQ
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgIMFUPor4xgOrS-I4TcyGKkoLTanUgroVxzlLXdIKCuLn43OSVgwMTLEyWNE5l_fucveOkOsMLEaAFzNQXKOodsxSGVrHE0hehVbS5Tte-9FgEE8mclg2q7teGABwxWfQxKX7l5_N9SemyqyHC4GylZtkC0dnle1a1esTRgjfiK6reMsPyqZg35M37YdnS64DzwaFXDSrvX5NVXGg0qn_83H2SGPdnkeHK-DZJxuQH5B6NZ-Blu56SN4ee-PRMEluKc7sYrOcYcMH7a1FY2mhOu2Wlr3SO60tDKF6REZHCxvzAk1Qw_-bFReaFPWHZaKvQV469-N2l5UTFZji0lsywY2xFIublhIiijSPUmNSLSwNMSL0wYPUUhKe-VIpmYaIXVy2YoFxRRoqExyRWj7P4ZjQQNuPQ8hNIFIlpJOUAqkCndodFMT8hDTQYtNFIZoxrYx1-sf9K7LTHSf9ab83eDoju3hGrtSLn5OaNQpckG39tZx9vF-6Q_8B1xmqWg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Symposium+on+Code+Generation+and+Optimization&rft.atitle=JITSPMM%3A+Just-in-Time+Instruction+Generation+for+Accelerated+Sparse+Matrix-Matrix+Multiplication&rft.au=Fu%2C+Qiang&rft.au=Rolinger%2C+Thomas+B.&rft.au=Huang%2C+H.+Howie&rft.date=2024-03-02&rft.pub=IEEE&rft.eissn=2643-2838&rft.spage=448&rft.epage=459&rft_id=info:doi/10.1109%2FCGO57630.2024.10444827&rft.externalDocID=10444827