Distributed-Memory Sparse Kernels for Machine Learning

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM opera...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 47 - 58
Hlavní autoři:	Bharadwaj, Vivek, Buluc, Aydin, Demmel, James
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.05.2022
Témata:	Collaborative filtering Communication Avoiding Algorithms Costs FusedMM Inference algorithms Layout Machine learning Machine learning algorithms Runtime SDDMM SpMM
ISSN:	1530-2075
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.
AbstractList	Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.
Author	Bharadwaj, Vivek Buluc, Aydin Demmel, James
Author_xml	– sequence: 1 givenname: Vivek surname: Bharadwaj fullname: Bharadwaj, Vivek organization: University of California,EECS Department,Berkeley,USA – sequence: 2 givenname: Aydin surname: Buluc fullname: Buluc, Aydin organization: University of California,EECS Department,Berkeley,USA – sequence: 3 givenname: James surname: Demmel fullname: Demmel, James organization: University of California,EECS Department,Berkeley,USA
BookMark	eNotj8tOAjEUQKvRREC-wJjMD8x429vn0oAocYgk6Jp0mDtaAx3S4oK_dxJdnd05OWN2FftIjN1zqDgH97Bcz9cbhVrwSoAQFQBwecGmzliutZKWg3aXbMQVQinAqBs2zvkbQABKN2J6HvIphebnRG25okOfzsXm6FOm4pVSpH0uuj4VK7_7CpGKmnyKIX7esuvO7zNN_zlhH4un99lLWb89L2ePdRkG_6lspNcGLXXCOUBjGq3lUBZaW80dOW9UB9IDWtHufAtkVKsIZWMQG3AOJ-zuzxuIaHtM4eDTeevscIIWfwHfdEck
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/IPDPS53621.2022.00014
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781665481069 1665481064
EISSN	1530-2075
EndPage	58
ExternalDocumentID	9820738
Genre	orig-research
GrantInformation_xml	– fundername: Office of Science funderid: 10.13039/100006132
GroupedDBID	29O 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL
ID	FETCH-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993
IEDL.DBID	RIE
ISICitedReferencesCount	8
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:25:34 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993
PageCount	12
ParticipantIDs	ieee_primary_9820738
PublicationCentury	2000
PublicationDate	2022-May
PublicationDateYYYYMMDD	2022-05-01
PublicationDate_xml	– month: 05 year: 2022 text: 2022-May
PublicationDecade	2020
PublicationTitle	Proceedings - IEEE International Parallel and Distributed Processing Symposium
PublicationTitleAbbrev	IPDPS
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0020349
Score	1.8616028
Snippet	Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative...
SourceID	ieee
SourceType	Publisher
StartPage	47
SubjectTerms	Collaborative filtering Communication Avoiding Algorithms Costs FusedMM Inference algorithms Layout Machine learning Machine learning algorithms Runtime SDDMM SpMM
Title	Distributed-Memory Sparse Kernels for Machine Learning
URI	https://ieeexplore.ieee.org/document/9820738
WOSCitedRecordID	wos000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0A8eDJDzB-Zw8eXaksu9ueRaJRSBM04UZ2u4PhUgiCif_emaWiBy_eml6azrZ9723nvQG4ypzSSERcTomtkkBxhcy80dJ70_WWHoDUxRDXZzscpuNxltfgeuuFQcTYfIY3fBj_5Yd5seatsnZGcGVVWoe6tWbj1dqKK85ZqRw6t0nWfsx7-UjT15k1YCeGcrJR59cElQgg_b3_XXofWj9OPJFvMeYAalgewt73KAZRvZlNMD0OwOXZVRjkgLtnP8VoQaIVxRMuSwJAQexUDGLrJIoqVfWtBa_9-5e7B1mNRJAzusOV9F3HYzpxyiiirPXGsAAklE1JCWHmrJ4mXZeotBMKFxK0OmhUVHalPJFBdQSNcl7iMQgiDt75VPmCOJV2xhchdEjQECUj4ertCTS5DJPFJvViUlXg9O_TZ7DLdd60Ap5DY7Vc4wXsFB-r2fvyMi7VF0lFkxw
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0gmugJFYzf7sGjK7Xb7bZnkUD4SBMw4UZ2u4PhUgiCif_e2aWiBy_eml6azrZ9723nvQG4T7WQSEScz4itkkDROU9NLLkxcWQUPQCJ9iGufTUcJpNJmlXgYeeFQUTffIaP7tD_y7eLfOO2ypopwZUSyR7syygKg61bayevXNJK6dF5CtJmN2tlI0nfZ6cCQx_L6aw6v2aoeAhp1_538WNo_HjxWLZDmROoYHEKte9hDKx8N-sQt1wErptehZYPXP_sJxstSbYi6-GqIAhkxE_ZwDdPIitzVd8a8Np-GT93eDkUgc_pDtfcRNoN6sSZwxGhlIljJwEJZxPSQphqJWdBpAORhDbXNkAlrURBhRfCEB0UZ1AtFgWeAyPqYLRJhMmJVUkdm9zakCQNkTKSrkZdQN2VYbrc5l5Mywpc_n36Dg4740F_2u8Oe1dw5Gq-bQy8hup6tcEbOMg_1vP31a1fti8j7JZj
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+IEEE+International+Parallel+and+Distributed+Processing+Symposium&rft.atitle=Distributed-Memory+Sparse+Kernels+for+Machine+Learning&rft.au=Bharadwaj%2C+Vivek&rft.au=Buluc%2C+Aydin&rft.au=Demmel%2C+James&rft.date=2022-05-01&rft.pub=IEEE&rft.eissn=1530-2075&rft.spage=47&rft.epage=58&rft_id=info:doi/10.1109%2FIPDPS53621.2022.00014&rft.externalDocID=9820738