Examining the Viability of Row-Scale Disaggregation for Production Applications

Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs a...

Full description

Saved in:
Bibliographic Details
Published in:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 1613 - 1621
Main Authors: Shorts, Curtis, Grant, Ryan E.
Format: Conference Proceeding
Language:English
Published: IEEE 17.11.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs are accessed over a network rather than existing in the same PCIe domain as the CPUs. Row-scale CDI expands the benefits and flexibility of rack-scaled CDI, while introducing new challenges. For example, with row-scale CDI, one must account for the effects of "slack", a latency in the CPU-to-GPU communication times due to network delays. This work seeks to assess potential challenges with row-scale CDI to determine which factors are most important to consider when deploying a CDI system. Our strong scaling application analyses reveal that there are two types of HPC workloads that may benefit from row-scale CDI; those that are CPU dominant and periodically call on the GPU to do highly parallel tasks and those that are GPU dominant and primarily rely on the CPU to coordinate work. We perform comparisons between the kernel and data transfer characteristics of each application to a slack proxy application which allowed for the development of a mathematical model to predict the performance penalty different applications can face as a result of slack. To illustrate this we profile two applications using our proposed method and find that they pessimistically would see a less than a 1% performance penalty above the effects of crossing the network in an environment which induced 100 µs of slack, or a distance of 20 km at the speed of light in a fibre optic network cable. This demonstrates that both row-scale and cluster-scale CDI are viable technologies from an application performance perspective.
AbstractList Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs are accessed over a network rather than existing in the same PCIe domain as the CPUs. Row-scale CDI expands the benefits and flexibility of rack-scaled CDI, while introducing new challenges. For example, with row-scale CDI, one must account for the effects of "slack", a latency in the CPU-to-GPU communication times due to network delays. This work seeks to assess potential challenges with row-scale CDI to determine which factors are most important to consider when deploying a CDI system. Our strong scaling application analyses reveal that there are two types of HPC workloads that may benefit from row-scale CDI; those that are CPU dominant and periodically call on the GPU to do highly parallel tasks and those that are GPU dominant and primarily rely on the CPU to coordinate work. We perform comparisons between the kernel and data transfer characteristics of each application to a slack proxy application which allowed for the development of a mathematical model to predict the performance penalty different applications can face as a result of slack. To illustrate this we profile two applications using our proposed method and find that they pessimistically would see a less than a 1% performance penalty above the effects of crossing the network in an environment which induced 100 µs of slack, or a distance of 20 km at the speed of light in a fibre optic network cable. This demonstrates that both row-scale and cluster-scale CDI are viable technologies from an application performance perspective.
Author Shorts, Curtis
Grant, Ryan E.
Author_xml – sequence: 1
  givenname: Curtis
  surname: Shorts
  fullname: Shorts, Curtis
  email: curtis.shorts@queensu.ca
  organization: Queen's University,Electrical and Computer Engineering Department,Kingston,Canada
– sequence: 2
  givenname: Ryan E.
  surname: Grant
  fullname: Grant, Ryan E.
  email: ryan.grant@queensu.ca
  organization: Queen's University,Electrical and Computer Engineering Department,Kingston,Canada
BookMark eNotjttKw0AURUdQUGu-QB_mB1LP3DKTxxLrBQoVW_SxnEzOxIE0CUlE-_eW6tNmszaLfc3O264lxm4FzIWA_H5TfGRKaphLkHoOIEGcsSS3uVMGlDFGq0uWjGMsIQPjNDhzxdbLH9zHNrY1nz6Jv0csYxOnA-8Cf-u-043HhvhDHLGuB6pxil3LQzfw16GrvvypLvq-if6Exht2EbAZKfnPGds-LrfFc7paP70Ui1WK0mRT6tE6bUSFx7-lC2Xujc9AO68VApB3FnUwFSkblJUiaK0UHafeK_Rk1Izd_WkjEe36Ie5xOOwEOAlWOvULObhQcw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00201
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 1621
ExternalDocumentID 10820728
Genre orig-research
GrantInformation_xml – fundername: Natural Sciences and Engineering Research Council of Canada
  funderid: 10.13039/501100000038
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a256t-ca78451da240b8fb9c5c6048c43a00ec87a4f5de37f3721f4433e40bcc3ace53
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300164&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:59:32 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a256t-ca78451da240b8fb9c5c6048c43a00ec87a4f5de37f3721f4433e40bcc3ace53
PageCount 9
ParticipantIDs ieee_primary_10820728
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8895143
Snippet Row-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single...
SourceID ieee
SourceType Publisher
StartPage 1613
SubjectTerms cdi
composable disaggregated infrastructure
Computer architecture
cuda
Graphics processing units
High performance computing
hpc
Kernel
Mathematical models
Optical fiber cables
Production
Resource management
row-scaled cdi
slack insertion
Software
Testing
Title Examining the Viability of Row-Scale Disaggregation for Production Applications
URI https://ieeexplore.ieee.org/document/10820728
WOSCitedRecordID wos001451792300164&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcAEiCDe8sAacGK7dkZUWjGgUtGqdKsc51xloEF98Pj3nJ0WujCwRVGUKJ99ubvcfd8Rcp2kwI1VEr9-mK6iC0CT4r7JMVU6t4XLdWiQHT2qXk-Px1l_TVYPXBgACM1ncOMPQy2_qOzK_ypDC0d_pVLdIA2lWjVZa7N5fHnPq3WtlYUSlt0O2i9ejJxhFph6jezUT37ZmqESXEh3_58PPyDRLxmP9n_czCHZgdkReep8mtcw24FiBEdHZS23_UUrR5-rj3iA0AO9Lxdmign1NMBPMT71dypqwVh6t1W7jsiw2xm2H-L1bITYYJCyjK1RWsikMPh-uXZ5ZqVtoTVawQ1jYLUywskCuHIckzwnBOeAl1qLSwOSH5PmrJrBCaGOGZc46Yxo5cImzFjDtEpdVuRMgmSnJPJgTN5q9YvJBoezP86fkz2Pt-frJeqCNJfzFVySXfu-LBfzq7Bm34-XmMA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDI5gIMEJEEO8yYFrIc1jSY9obBpijIlNY7cpTZNph61oDx7_HifdYBcO3KqoalQ7ru3a32eErmNqmTZSwNcP0lVwAWBSzDc5UqlSk7lUhQbZXlO2WqrfT9pLsHrAwlhrQ_OZvfGXoZaf5Wbhf5WBhYO_klRtoi3BOSUFXGt1fHyBz_N1LbmFYpLcdqqvno6cQB5IPUs29bNf1qaoBCdS3_vn9vuo_AvHw-0fR3OANuzkED3XPvU4THfAEMPh3qgg3P7CucMv-UfUAeFbfD-a6SGk1MOgAAwRqn9SVlDG4ru16nUZdeu1brURLacjRBrClHlktFRcxJmG90uVSxMjTAXs0XCmCbFGSc2dyCyTjkGa5zhnzMKtxoByrGBHqDTJJ_YYYUe0i51wmldSbmKijSZKUpdkKRFWkBNU9sIYvBX8F4OVHE7_WL9CO43uU3PQfGg9nqFdL3uP3ovlOSrNpwt7gbbN-3w0m14G_X0DsG2cBw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Examining+the+Viability+of+Row-Scale+Disaggregation+for+Production+Applications&rft.au=Shorts%2C+Curtis&rft.au=Grant%2C+Ryan+E.&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=1613&rft.epage=1621&rft_id=info:doi/10.1109%2FSCW63240.2024.00201&rft.externalDocID=10820728