Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks

Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tan...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 472 - 483
Hlavní autori: Camarero, Cristobal, Cano, Alejandro, Martinez, Carmen, Beivide, Ramon
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 17.11.2024
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter direct networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topologies is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and uses a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This scheme not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
AbstractList Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter direct networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topologies is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and uses a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This scheme not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
Author Camarero, Cristobal
Beivide, Ramon
Martinez, Carmen
Cano, Alejandro
Author_xml – sequence: 1
  givenname: Cristobal
  surname: Camarero
  fullname: Camarero, Cristobal
  email: cristobal.camarero@unican.es
  organization: Universidad de Cantabria,Spain
– sequence: 2
  givenname: Alejandro
  surname: Cano
  fullname: Cano, Alejandro
  email: alejandro.cano@unican.es
  organization: Universidad de Cantabria,Spain
– sequence: 3
  givenname: Carmen
  surname: Martinez
  fullname: Martinez, Carmen
  email: carmen.martinez@unican.es
  organization: Universidad de Cantabria,Spain
– sequence: 4
  givenname: Ramon
  surname: Beivide
  fullname: Beivide, Ramon
  email: ramon.beivide@unican.es
  organization: Universidad de Cantabria,Spain
BookMark eNotzM1OAjEUQOGaaKLiPIEu5gUGb3-nXRIiQkLUKEZ3pJRbaBxa0ika3l6Nrs7my7kkpzFFJOSawpBSMLcv4zfFmYAhAyaGAKDMCalMazSXwKWUgp-Tqu_DChRILUDLC7IYuW3AzxA39TRsts0TZp_yzkaH9cQeutIsUofZxlI_p0P5dSHW0-Me83s9iwWzSzGiKyHF-gHLV8of_RU587brsfrvgLxO7hbjaTN_vJ-NR_PGMqlKwwxog1qsvRZUeaWd05KBpVbCuvXUIMeVcOxHcc-0degcN631kqpWKeQDcvP3DYi43Oews_m4pKAZKGb4N2qBUlY
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00069
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 483
ExternalDocumentID 10820629
Genre orig-research
GrantInformation_xml – fundername: Barcelona Supercomputing Center
  funderid: 10.13039/501100006433
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a256t-29089e84df8416f68cc8520a1a50d7f19e3eb4c20893f28acecc397af516766e3
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300052&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:01:54 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a256t-29089e84df8416f68cc8520a1a50d7f19e3eb4c20893f28acecc397af516766e3
PageCount 12
ParticipantIDs ieee_primary_10820629
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8894744
Snippet Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are...
SourceID ieee
SourceType Publisher
StartPage 472
SubjectTerms adaptive routing
deadlock avoidance
Fault tolerance
Fault tolerant systems
hyperx network
Multiprocessor interconnection
Network topology
Routing
Supercomputers
System recovery
Topology
Traffic control
Title Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
URI https://ieeexplore.ieee.org/document/10820629
WOSCitedRecordID wos001451792300052&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB1s8eBJxYrf5OA1mmQ_khylWHqQUrBibyWbnWChbKXd-vvNpNWePHgLIRCYSZjJzHsvAPeutNY7YXjMRTXPq8JwV2fIrahs5XONvkrq-i96NDLTqR3vyOqJC4OICXyGDzRMvfx66TdUKos3nNTGle1AR-tyS9b6OTzU3iO1rp2ykBT28bX_TmLkIr4CFWlki4Rq3v-hkkLI4Pifm59Ab0_GY-PfMHMKB9icweTJf8yRigGMkBp8vMf_s4HbLFo-WS4wxqGWEeaH1s0bNoyPztWUpSqgJ4RLIjWw0RYKvu7B2-B50h_y3QcJ3MVMpeWKmnZo8jpQ8zCUxntTKOGkK0Stg7SYYZV7FVdlQRnno79i_uFCIUtdlpidQ7dZNngBTJHuX8BM22grrWUVnYXSyxjAjQmZuYQemWT2udXAmP1Y4-qP-Ws4IqsTa0_qG-i2qw3ewqH_aufr1V3y3Df9O5o_
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1aBT2pWPHbHLxGk_1KcpRiqViXgiv2VrLZCRbKtrRbf7-ZtLUnD96WJbAws-FNZt57IeTeZFpbwxXztahkSZkqZqoYmOalLm0iwZbBXb8v81wNh3qwFqsHLQwABPIZPOBjmOVXU7vEVpnf4eg2Huldspd6IOUrudbm98EBH_p1rb2FBNeP751PtCPn_hwYoUs2D7zm7S0qAUS6R__8_DFpb-V4dPALNCdkB-pTUjzZrzFgO4AiV4MNtgoA2jXLScOK6QQ8EjUUWT-4blzTnj92zoc09AEtclyCrIHmKzL4ok0-us9Fp8fWVyQw42uVhkU4tgOVVA7Hhy5T1qo04kaYlFfSCQ0xlImN_KrYRcpYnzFfgRiXikxmGcRnpFVPazgnNELnPwex1D5WUorSpwuEFR7ClXKxuiBtDMlotnLBGG2icfnH-zty0Cve-qP-S_56RQ4xA6jhE_KatJr5Em7Ivv1uxov5bcjiDz8FnYw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Achieving+High-Performance+Fault-Tolerant+Routing+in+HyperX+Interconnection+Networks&rft.au=Camarero%2C+Cristobal&rft.au=Cano%2C+Alejandro&rft.au=Martinez%2C+Carmen&rft.au=Beivide%2C+Ramon&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=472&rft.epage=483&rft_id=info:doi/10.1109%2FSCW63240.2024.00069&rft.externalDocID=10820629