Distributed-Memory Sparse Kernels for Machine Learning

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM opera...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 47 - 58
Hlavní autoři: Bharadwaj, Vivek, Buluc, Aydin, Demmel, James
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.05.2022
Témata:
ISSN:1530-2075
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.
AbstractList Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.
Author Bharadwaj, Vivek
Buluc, Aydin
Demmel, James
Author_xml – sequence: 1
  givenname: Vivek
  surname: Bharadwaj
  fullname: Bharadwaj, Vivek
  organization: University of California,EECS Department,Berkeley,USA
– sequence: 2
  givenname: Aydin
  surname: Buluc
  fullname: Buluc, Aydin
  organization: University of California,EECS Department,Berkeley,USA
– sequence: 3
  givenname: James
  surname: Demmel
  fullname: Demmel, James
  organization: University of California,EECS Department,Berkeley,USA
BookMark eNotj8tOAjEUQKvRREC-wJjMD8x429vn0oAocYgk6Jp0mDtaAx3S4oK_dxJdnd05OWN2FftIjN1zqDgH97Bcz9cbhVrwSoAQFQBwecGmzliutZKWg3aXbMQVQinAqBs2zvkbQABKN2J6HvIphebnRG25okOfzsXm6FOm4pVSpH0uuj4VK7_7CpGKmnyKIX7esuvO7zNN_zlhH4un99lLWb89L2ePdRkG_6lspNcGLXXCOUBjGq3lUBZaW80dOW9UB9IDWtHufAtkVKsIZWMQG3AOJ-zuzxuIaHtM4eDTeevscIIWfwHfdEck
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IPDPS53621.2022.00014
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library (IEL) (UW System Shared)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781665481069
1665481064
EISSN 1530-2075
EndPage 58
ExternalDocumentID 9820738
Genre orig-research
GrantInformation_xml – fundername: Office of Science
  funderid: 10.13039/100006132
GroupedDBID 29O
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
ID FETCH-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993
IEDL.DBID RIE
ISICitedReferencesCount 8
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:25:34 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993
PageCount 12
ParticipantIDs ieee_primary_9820738
PublicationCentury 2000
PublicationDate 2022-May
PublicationDateYYYYMMDD 2022-05-01
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May
PublicationDecade 2020
PublicationTitle Proceedings - IEEE International Parallel and Distributed Processing Symposium
PublicationTitleAbbrev IPDPS
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0020349
Score 1.8616028
Snippet Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative...
SourceID ieee
SourceType Publisher
StartPage 47
SubjectTerms Collaborative filtering
Communication Avoiding Algorithms
Costs
FusedMM
Inference algorithms
Layout
Machine learning
Machine learning algorithms
Runtime
SDDMM
SpMM
Title Distributed-Memory Sparse Kernels for Machine Learning
URI https://ieeexplore.ieee.org/document/9820738
WOSCitedRecordID wos000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0A8eDJDzB-Zw8eXaksu9ueRaJRSBM04UZ2u4PhUgiCif_emaWiBy_eml6azrZ9723nvQG4ypzSSERcTomtkkBxhcy80dJ70_WWHoDUxRDXZzscpuNxltfgeuuFQcTYfIY3fBj_5Yd5seatsnZGcGVVWoe6tWbj1dqKK85ZqRw6t0nWfsx7-UjT15k1YCeGcrJR59cElQgg_b3_XXofWj9OPJFvMeYAalgewt73KAZRvZlNMD0OwOXZVRjkgLtnP8VoQaIVxRMuSwJAQexUDGLrJIoqVfWtBa_9-5e7B1mNRJAzusOV9F3HYzpxyiiirPXGsAAklE1JCWHmrJ4mXZeotBMKFxK0OmhUVHalPJFBdQSNcl7iMQgiDt75VPmCOJV2xhchdEjQECUj4ertCTS5DJPFJvViUlXg9O_TZ7DLdd60Ap5DY7Vc4wXsFB-r2fvyMi7VF0lFkxw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0gmugJFYzf7sGjK7Xb7bZnkUD4SBMw4UZ2u4PhUgiCif_e2aWiBy_eml6azrZ9723nvQG4T7WQSEScz4itkkDROU9NLLkxcWQUPQCJ9iGufTUcJpNJmlXgYeeFQUTffIaP7tD_y7eLfOO2ypopwZUSyR7syygKg61bayevXNJK6dF5CtJmN2tlI0nfZ6cCQx_L6aw6v2aoeAhp1_538WNo_HjxWLZDmROoYHEKte9hDKx8N-sQt1wErptehZYPXP_sJxstSbYi6-GqIAhkxE_ZwDdPIitzVd8a8Np-GT93eDkUgc_pDtfcRNoN6sSZwxGhlIljJwEJZxPSQphqJWdBpAORhDbXNkAlrURBhRfCEB0UZ1AtFgWeAyPqYLRJhMmJVUkdm9zakCQNkTKSrkZdQN2VYbrc5l5Mywpc_n36Dg4740F_2u8Oe1dw5Gq-bQy8hup6tcEbOMg_1vP31a1fti8j7JZj
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+IEEE+International+Parallel+and+Distributed+Processing+Symposium&rft.atitle=Distributed-Memory+Sparse+Kernels+for+Machine+Learning&rft.au=Bharadwaj%2C+Vivek&rft.au=Buluc%2C+Aydin&rft.au=Demmel%2C+James&rft.date=2022-05-01&rft.pub=IEEE&rft.eissn=1530-2075&rft.spage=47&rft.epage=58&rft_id=info:doi/10.1109%2FIPDPS53621.2022.00014&rft.externalDocID=9820738