Distributed-Memory Sparse Kernels for Machine Learning
Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM opera...
Uloženo v:
| Vydáno v: | Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 47 - 58 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.05.2022
|
| Témata: | |
| ISSN: | 1530-2075 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks. |
|---|---|
| AbstractList | Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a sub-sequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SD-DMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21 % of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks. |
| Author | Bharadwaj, Vivek Buluc, Aydin Demmel, James |
| Author_xml | – sequence: 1 givenname: Vivek surname: Bharadwaj fullname: Bharadwaj, Vivek organization: University of California,EECS Department,Berkeley,USA – sequence: 2 givenname: Aydin surname: Buluc fullname: Buluc, Aydin organization: University of California,EECS Department,Berkeley,USA – sequence: 3 givenname: James surname: Demmel fullname: Demmel, James organization: University of California,EECS Department,Berkeley,USA |
| BookMark | eNotj8tOAjEUQKvRREC-wJjMD8x429vn0oAocYgk6Jp0mDtaAx3S4oK_dxJdnd05OWN2FftIjN1zqDgH97Bcz9cbhVrwSoAQFQBwecGmzliutZKWg3aXbMQVQinAqBs2zvkbQABKN2J6HvIphebnRG25okOfzsXm6FOm4pVSpH0uuj4VK7_7CpGKmnyKIX7esuvO7zNN_zlhH4un99lLWb89L2ePdRkG_6lspNcGLXXCOUBjGq3lUBZaW80dOW9UB9IDWtHufAtkVKsIZWMQG3AOJ-zuzxuIaHtM4eDTeevscIIWfwHfdEck |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/IPDPS53621.2022.00014 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781665481069 1665481064 |
| EISSN | 1530-2075 |
| EndPage | 58 |
| ExternalDocumentID | 9820738 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Office of Science funderid: 10.13039/100006132 |
| GroupedDBID | 29O 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL |
| ID | FETCH-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 8 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:25:34 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-b4a6738ef2990377b6640202668619e9a75f04a0382dcad0e75d5e34b733b0993 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_9820738 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-May |
| PublicationDateYYYYMMDD | 2022-05-01 |
| PublicationDate_xml | – month: 05 year: 2022 text: 2022-May |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings - IEEE International Parallel and Distributed Processing Symposium |
| PublicationTitleAbbrev | IPDPS |
| PublicationYear | 2022 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0020349 |
| Score | 1.8616028 |
| Snippet | Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 47 |
| SubjectTerms | Collaborative filtering Communication Avoiding Algorithms Costs FusedMM Inference algorithms Layout Machine learning Machine learning algorithms Runtime SDDMM SpMM |
| Title | Distributed-Memory Sparse Kernels for Machine Learning |
| URI | https://ieeexplore.ieee.org/document/9820738 |
| WOSCitedRecordID | wos000854096200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0A8eDJDzB-Zw8eXaksu9ueRaJRSBM04UZ2u4PhUgiCif_emaWiBy_eml6azrZ9723nvQG4ypzSSERcTomtkkBxhcy80dJ70_WWHoDUxRDXZzscpuNxltfgeuuFQcTYfIY3fBj_5Yd5seatsnZGcGVVWoe6tWbj1dqKK85ZqRw6t0nWfsx7-UjT15k1YCeGcrJR59cElQgg_b3_XXofWj9OPJFvMeYAalgewt73KAZRvZlNMD0OwOXZVRjkgLtnP8VoQaIVxRMuSwJAQexUDGLrJIoqVfWtBa_9-5e7B1mNRJAzusOV9F3HYzpxyiiirPXGsAAklE1JCWHmrJ4mXZeotBMKFxK0OmhUVHalPJFBdQSNcl7iMQgiDt75VPmCOJV2xhchdEjQECUj4ertCTS5DJPFJvViUlXg9O_TZ7DLdd60Ap5DY7Vc4wXsFB-r2fvyMi7VF0lFkxw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0gmugJFYzf7sGjK7Xb7bZnkUD4SBMw4UZ2u4PhUgiCif_e2aWiBy_eml6azrZ9723nvQG4T7WQSEScz4itkkDROU9NLLkxcWQUPQCJ9iGufTUcJpNJmlXgYeeFQUTffIaP7tD_y7eLfOO2ypopwZUSyR7syygKg61bayevXNJK6dF5CtJmN2tlI0nfZ6cCQx_L6aw6v2aoeAhp1_538WNo_HjxWLZDmROoYHEKte9hDKx8N-sQt1wErptehZYPXP_sJxstSbYi6-GqIAhkxE_ZwDdPIitzVd8a8Np-GT93eDkUgc_pDtfcRNoN6sSZwxGhlIljJwEJZxPSQphqJWdBpAORhDbXNkAlrURBhRfCEB0UZ1AtFgWeAyPqYLRJhMmJVUkdm9zakCQNkTKSrkZdQN2VYbrc5l5Mywpc_n36Dg4740F_2u8Oe1dw5Gq-bQy8hup6tcEbOMg_1vP31a1fti8j7JZj |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+IEEE+International+Parallel+and+Distributed+Processing+Symposium&rft.atitle=Distributed-Memory+Sparse+Kernels+for+Machine+Learning&rft.au=Bharadwaj%2C+Vivek&rft.au=Buluc%2C+Aydin&rft.au=Demmel%2C+James&rft.date=2022-05-01&rft.pub=IEEE&rft.eissn=1530-2075&rft.spage=47&rft.epage=58&rft_id=info:doi/10.1109%2FIPDPS53621.2022.00014&rft.externalDocID=9820738 |