Achieving Computation-Communication Overlap with Overdecomposition on GPU Systems

The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becomi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2) S. 1 - 10
Hauptverfasser: Choi, Jaemin, Richards, David F., Kale, Laxmikant V.
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.11.2020
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
AbstractList The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node communication capabilities. Consequently, the ability to tolerate communication latencies and maximize utilization of the compute hardware are becoming increasingly important in achieving high performance. Overdecomposition has been successfully adopted on traditional CPU-based systems to achieve computation-communication overlap, significantly reducing the impact of communication on application performance. However, it has been unclear whether overdecomposition can provide the same benefits on modern GPU systems. In this work, we address the challenges in achieving computation-communication overlap with overdecomposition on GPU systems using the Charm++ parallel programming system. By prioritizing communication with CUDA streams in the application and supporting asynchronous progress of GPU operations in the Charm++ runtime system, we obtain improvements in overall performance of up to 50% and 47% with proxy applications Jacobi3D and MiniMD, respectively.
Author Richards, David F.
Kale, Laxmikant V.
Choi, Jaemin
Author_xml – sequence: 1
  givenname: Jaemin
  surname: Choi
  fullname: Choi, Jaemin
  email: jchoi157@illinois.edu
  organization: University of Illinois at Urbana-Champaign,Department of Computer Science,Urbana,Illinois
– sequence: 2
  givenname: David F.
  surname: Richards
  fullname: Richards, David F.
  email: richards12@llnl.gov
  organization: Center for Applied Scientific Computing, Lawrence Livermore National Laboratory,Livermore,California
– sequence: 3
  givenname: Laxmikant V.
  surname: Kale
  fullname: Kale, Laxmikant V.
  email: kale@illinois.edu
  organization: University of Illinois at Urbana-Champaign,Department of Computer Science,Urbana,Illinois
BookMark eNotTl1Lw0AQPEEftPYXiJA_kLi7d7nrPZZQq1Bppfa5XHMbe9B8kKSV_ntDdBjYGWZY5kHcVnXFQjwjJIhgXxbbzQelaLVKCAgSGKBvxNSaGWqdKqKZgnvxOc-PgS-h-o6yumzOvetDXcWDLs9VyEcXrS_cnlwT_YT-OBrP-VCuuzDGA5ebXbS9dj2X3aO4K9yp4-n_nYjd6-Ire4tX6-V7Nl_FgUD2sSL06cxbf1A5olWpRQXWEenUsJTSKdKKYZjLRrGVHo3RHnVhDRwgL-REPP39Dcy8b9pQuva6txIMkpS_uvRNFw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ESPM251964.2020.00006
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781665422840
166542284X
EndPage 10
ExternalDocumentID 9307123
Genre orig-research
GrantInformation_xml – fundername: Office of Science
  funderid: 10.13039/100006132
– fundername: U.S. DOE
  grantid: DE-AC05-000R22725
  funderid: 10.13039/100000015
– fundername: Lawrence Livermore National Laboratory
  grantid: DE-AC52-07NA27344 (LLNL-CONF-814558)
  funderid: 10.13039/100006227
– fundername: U.S. Department of Energy (DOE)
  funderid: 10.13039/100000015
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i203t-421d58d9db4c1194591409a22657e333a4264e0978e74e93d1776d16f970b0cf3
IEDL.DBID RIE
ISICitedReferencesCount 18
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000674882700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Thu Jun 29 18:39:04 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-421d58d9db4c1194591409a22657e333a4264e0978e74e93d1776d16f970b0cf3
PageCount 10
ParticipantIDs ieee_primary_9307123
PublicationCentury 2000
PublicationDate 2020-Nov.
PublicationDateYYYYMMDD 2020-11-01
PublicationDate_xml – month: 11
  year: 2020
  text: 2020-Nov.
PublicationDecade 2020
PublicationTitle 2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)
PublicationTitleAbbrev ESPM2
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.851182
Snippet The landscape of high performance computing is shifting towards a collection of multi-GPU nodes, widening the gap between on-node compute and off-node...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms application program interfaces
asynchronous task-based runtime
Charm++ parallel programming system
communication latencies
computation-communication overlap
compute hardware
computer graphic equipment
coprocessors
Delays
GPU computing
graphics processing units
Hardware
high performance computing
Jacobian matrices
Kernel
message passing
modern GPU systems
multiGPU nodes
off-node communication capabilities
on-node compute
overde-composition
overdecomposition
parallel architectures
parallel processing
parallel programming
Runtime
Task analysis
traditional CPU-based systems
Title Achieving Computation-Communication Overlap with Overdecomposition on GPU Systems
URI https://ieeexplore.ieee.org/document/9307123
WOSCitedRecordID wos000674882700001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwED21FQMToBbxrQyMmNqxE8cjQgUGKEFQ1K1K7Ivo0lal7e_H50RFlViQMjgebPnF8sfLvXcA187PAc1dwbgqFVMZIjNJJpnRTqBNReqCl97nsx4Os_HY5C242WphEDEEn-EtFcO_fDe3a6LK-sZPSL_StqGtdVprtRpRjuCmP3jPX0iHmRJVEvPgTJjuJE0Je8bDwf96O4Ter_guyrfbyhG0cNaFtzv7NUW6_Ed1HoYAKNuRd0SvGyLnFhFxq-HFIUWMN2FZkX8e81HUeJT3YPQw-Lh_Yk02BDaNuVwxFQuXZM64UlkhjEoMeVUV_viUaJRSFnS2QZJloFZopBMeJCfSymheclvJY-jM5jM8gUg67htKqoSYxLjwV56SF0JVtvAjRoWn0CU4Jova8GLSIHH2d_U57BPetUDvAjqr5RovYc9uVtPv5VX4Sj_NDJR8
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEJ0gmuhJDRi_3YNHK-223W6PxqAYAdcIhhvZ3c5GLkAQ-P223Q2GxItJD20PbTpt-vE67w3ArbFrQFGTEioyQUSMSLSMOdHKMMwjFhmvpffZVf1-PBrppAZ3Gy4MInrnM7x3Wf-Xb2b5ykFlLW0XpN1pd2BXChHSkq1V0XIY1a32R9JzTMzIgSUh9dqE0VbYFH9qPB3-r78jaP7S74Jkc7AcQw2nDXh_yL8m6J7_QRmJwZuUbBE8gre1g-fmgUNXfcGg8xmvHLMCm56TYVCplDdh-NQePHZIFQ-BTELKl0SEzMjYaJOJnDEtpHZqVam9QEmFnPPU3W7QETNQCdTcMKUiw6JCK5rRvOAnUJ_OpngKATfUNiQL6bDEMLWPnoymTBR5akeMAs-g4cwxnpeSF-PKEud_V9_AfmfQ6467L_3XCzhwti_pepdQXy5WeAV7-Xo5-V5c-xn7AbaAl8M
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2020+IEEE%2FACM+Fifth+International+Workshop+on+Extreme+Scale+Programming+Models+and+Middleware+%28ESPM2%29&rft.atitle=Achieving+Computation-Communication+Overlap+with+Overdecomposition+on+GPU+Systems&rft.au=Choi%2C+Jaemin&rft.au=Richards%2C+David+F.&rft.au=Kale%2C+Laxmikant+V.&rft.date=2020-11-01&rft.pub=IEEE&rft.spage=1&rft.epage=10&rft_id=info:doi/10.1109%2FESPM251964.2020.00006&rft.externalDocID=9307123