DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks

Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication v...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:SC21: International Conference for High Performance Computing, Networking, Storage and Analysis S. 1 - 14
Hauptverfasser: Md, Vasimuddin, Misra, Sanchit, Ma, Guixiang, Mohanty, Ramanarayan, Georganas, Evangelos, Heinecke, Alexander, Kalamkar, Dhiraj, Ahmed, Nesreen K., Avancha, Sasikanth
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: ACM 14.11.2021
Schlagworte:
ISSN:2167-4337
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7× speed-up using a single CPU socket and up to 97× speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket.
AbstractList Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible. It is challenging due to large memory capacity and bandwidth requirements on a single compute node and high communication volumes across multiple nodes. In this paper, we present DistGNN that optimizes the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters via an efficient shared memory implementation, communication reduction using a minimum vertex-cut graph partitioning algorithm and communication avoidance using a family of delayed-update algorithms. Our results on four common GNN benchmark datasets: Reddit, OGB-Products, OGB-Papers and Proteins, show up to 3.7× speed-up using a single CPU socket and up to 97× speed-up using 128 CPU sockets, respectively, over baseline DGL implementations running on a single CPU socket.
Author Georganas, Evangelos
Ahmed, Nesreen K.
Avancha, Sasikanth
Kalamkar, Dhiraj
Misra, Sanchit
Md, Vasimuddin
Heinecke, Alexander
Mohanty, Ramanarayan
Ma, Guixiang
Author_xml – sequence: 1
  givenname: Vasimuddin
  surname: Md
  fullname: Md, Vasimuddin
  email: vasimuddin.md@intel.com
  organization: Intel Corporation
– sequence: 2
  givenname: Sanchit
  surname: Misra
  fullname: Misra, Sanchit
  email: sanchit.misra@intel.com
  organization: Intel Corporation
– sequence: 3
  givenname: Guixiang
  surname: Ma
  fullname: Ma, Guixiang
  email: guixiang.ma@intel.com
  organization: Intel Corporation
– sequence: 4
  givenname: Ramanarayan
  surname: Mohanty
  fullname: Mohanty, Ramanarayan
  email: Ramanarayan.Mohanty@intel.com
  organization: Intel Corporation
– sequence: 5
  givenname: Evangelos
  surname: Georganas
  fullname: Georganas, Evangelos
  email: evangelos.georganas@intel.com
  organization: Intel Corporation
– sequence: 6
  givenname: Alexander
  surname: Heinecke
  fullname: Heinecke, Alexander
  email: alexander.heinecke@intel.com
  organization: Intel Corporation
– sequence: 7
  givenname: Dhiraj
  surname: Kalamkar
  fullname: Kalamkar, Dhiraj
  email: dhiraj.d.kalamkar@intel.com
  organization: Intel Corporation
– sequence: 8
  givenname: Nesreen K.
  surname: Ahmed
  fullname: Ahmed, Nesreen K.
  email: nesreen.k.ahmed@intel.com
  organization: Intel Corporation
– sequence: 9
  givenname: Sasikanth
  surname: Avancha
  fullname: Avancha, Sasikanth
  email: sasikanth.avancha@intel.com
  organization: Intel Corporation
BookMark eNotjk1Lw0AYhFdRsNacPXjZP5C6u-9-epNaqxDiwXou7242NRqTskkR_70pCgMPzAzDXJKzru8iIdecLTiX6hakspabBUjLrNInJHPGTgEDK6Xgp2QmuDa5BDAXJBuGD8aYsIaDYDNSPDTDuC7LO_oasEXfRnp0UuMPY6zoJmHTNd2O1n2iBaZdzI-9SNcJ9--0jIeE7YTxu0-fwxU5r7EdYvbPOXl7XG2WT3nxsn5e3hc5CqPGvFL1pDqC0w5A1dNN62wwIloTLDIXvDTBg0drNPJggtYBKx8kMiYRYE5u_nabGON2n5ovTD9b5zhj2sAvWLdPRA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3458817.3480856
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEL
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450384421
1450384420
EISSN 2167-4337
EndPage 14
ExternalDocumentID 9910067
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
ID FETCH-LOGICAL-a275t-d5fd5ffe3969335f844898c72e87c8a09cb47cb3ba876a1c7c66cadbc4a004a33
IEDL.DBID RIE
ISICitedReferencesCount 50
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000946520100028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:18:35 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a275t-d5fd5ffe3969335f844898c72e87c8a09cb47cb3ba876a1c7c66cadbc4a004a33
PageCount 14
ParticipantIDs ieee_primary_9910067
PublicationCentury 2000
PublicationDate 2021-Nov.-14
PublicationDateYYYYMMDD 2021-11-14
PublicationDate_xml – month: 11
  year: 2021
  text: 2021-Nov.-14
  day: 14
PublicationDecade 2020
PublicationTitle SC21: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC
PublicationYear 2021
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0002871320
ssj0003204180
Score 2.2267532
Snippet Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Clustering algorithms
Deep Graph Library
Deep Learning
Distributed Algorithm
Graph Neural Networks
Graph Partition
High performance computing
Memory management
Proteins
Social networking (online)
Sockets
Training
Title DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks
URI https://ieeexplore.ieee.org/document/9910067
WOSCitedRecordID wos000946520100028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA7b8OBp6ib-JgePZmubtkm8qpuHUnaYsNtI0xcQxib74d_ve1mdCF6EQtNHDyUheV_S73sfY_cR1HEVZVrIpNYCM14iTARWQK4jwITlpQ9C4UKVpZ7NzKTFHg5aGAAI5DMYUDP8y69XbkdHZUPEMrS6tllbqXyv1TqcpxDylw30oWdsp7GOmmo-cZoNJYkyYzWQqUackf-yUwnZZNT933ecsP6PLI9PDgnnlLVgeca6374MvJmmPVY849iNy_IRQ3ZB2ihOkWBtBTWfNq4QHPEqL4gJLug94GMqXs2pXIdd4C3wwzd99jZ6mT69isY1QdhEZVtRZx4vD9LkRsrMa9yAGe1UAlo5bSPjqlS5SlYWF0IbO-Xy3Nm6cqnFCWOlPGed5WoJF6TndhYRiLYe967Wq0oaCR4SpRDF6AQuWY86Z_6xL4wxb_rl6u_wNTtOiBBCHLr0hnW26x3csiP3uX3frO_CaH4BlCme5w
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zCnqaOsXf5uDRbG3TNqlXdZtYyw4Tdhtp-gLC2GQ__Pt9r8aJ4EUoNH30UBKS9yX9vvcxdhNAFZZBooWMKi0w40UiC8AISHUAmLCcdLVQOFdFocfjbNhgtxstDADU5DPoULP-l1_N7ZqOyrqIZWh13WLb5Jzl1VqbExXC_tKDH3rGdhzqwNfzCeOkK0mWGaqOjDUijfSXoUqdT3qt_33JPjv6Eebx4SblHLAGzA5Z69uZgfuJ2mb5A45evyjuMGSmpI7iFKnNraDiI-8LwRGx8py44ILeA96n8tWcCnaYKd5qhvjyiL32Hkf3A-F9E4SJVLISVeLwciCzNJMycRq3YJm2KgKtrDZBZstY2VKWBpdCE1pl09SaqrSxwSljpDxmzdl8Biek6LYGMYg2DnevxqlSZhIcREohjtERnLI2dc7k_as0xsT3y9nf4Wu2Oxi95JP8qXg-Z3sR0UOIURdfsOZqsYZLtmM_Vm_LxVU9sp8NA6Iw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC21%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=DistGNN%3A+Scalable+Distributed+Training+for+Large-Scale+Graph+Neural+Networks&rft.au=Md%2C+Vasimuddin&rft.au=Misra%2C+Sanchit&rft.au=Ma%2C+Guixiang&rft.au=Mohanty%2C+Ramanarayan&rft.date=2021-11-14&rft.pub=ACM&rft.eissn=2167-4337&rft.spage=1&rft.epage=14&rft_id=info:doi/10.1145%2F3458817.3480856&rft.externalDocID=9910067