Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tan...
Uložené v:
| Vydané v: | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 472 - 483 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
17.11.2024
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter direct networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topologies is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and uses a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This scheme not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios. |
|---|---|
| AbstractList | Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter direct networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topologies is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and uses a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This scheme not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios. |
| Author | Camarero, Cristobal Beivide, Ramon Martinez, Carmen Cano, Alejandro |
| Author_xml | – sequence: 1 givenname: Cristobal surname: Camarero fullname: Camarero, Cristobal email: cristobal.camarero@unican.es organization: Universidad de Cantabria,Spain – sequence: 2 givenname: Alejandro surname: Cano fullname: Cano, Alejandro email: alejandro.cano@unican.es organization: Universidad de Cantabria,Spain – sequence: 3 givenname: Carmen surname: Martinez fullname: Martinez, Carmen email: carmen.martinez@unican.es organization: Universidad de Cantabria,Spain – sequence: 4 givenname: Ramon surname: Beivide fullname: Beivide, Ramon email: ramon.beivide@unican.es organization: Universidad de Cantabria,Spain |
| BookMark | eNotzM1OAjEUQOGaaKLiPIEu5gUGb3-nXRIiQkLUKEZ3pJRbaBxa0ika3l6Nrs7my7kkpzFFJOSawpBSMLcv4zfFmYAhAyaGAKDMCalMazSXwKWUgp-Tqu_DChRILUDLC7IYuW3AzxA39TRsts0TZp_yzkaH9cQeutIsUofZxlI_p0P5dSHW0-Me83s9iwWzSzGiKyHF-gHLV8of_RU587brsfrvgLxO7hbjaTN_vJ-NR_PGMqlKwwxog1qsvRZUeaWd05KBpVbCuvXUIMeVcOxHcc-0degcN631kqpWKeQDcvP3DYi43Oews_m4pKAZKGb4N2qBUlY |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/SCW63240.2024.00069 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350355543 |
| EndPage | 483 |
| ExternalDocumentID | 10820629 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Barcelona Supercomputing Center funderid: 10.13039/501100006433 |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIL |
| ID | FETCH-LOGICAL-a256t-29089e84df8416f68cc8520a1a50d7f19e3eb4c20893f28acecc397af516766e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300052&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:01:54 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a256t-29089e84df8416f68cc8520a1a50d7f19e3eb4c20893f28acecc397af516766e3 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_10820629 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Nov.-17 |
| PublicationDateYYYYMMDD | 2024-11-17 |
| PublicationDate_xml | – month: 11 year: 2024 text: 2024-Nov.-17 day: 17 |
| PublicationDecade | 2020 |
| PublicationTitle | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
| PublicationTitleAbbrev | SC-W |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060584085 |
| Score | 1.8894744 |
| Snippet | Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 472 |
| SubjectTerms | adaptive routing deadlock avoidance Fault tolerance Fault tolerant systems hyperx network Multiprocessor interconnection Network topology Routing Supercomputers System recovery Topology Traffic control |
| Title | Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks |
| URI | https://ieeexplore.ieee.org/document/10820629 |
| WOSCitedRecordID | wos001451792300052&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEB1s8eBJxYrf5OA1mmQ_khylWHqQUrBibyWbnWChbKXd-vvNpNWePHgLIRCYSZjJzHsvAPeutNY7YXjMRTXPq8JwV2fIrahs5XONvkrq-i96NDLTqR3vyOqJC4OICXyGDzRMvfx66TdUKos3nNTGle1AR-tyS9b6OTzU3iO1rp2ykBT28bX_TmLkIr4CFWlki4Rq3v-hkkLI4Pifm59Ab0_GY-PfMHMKB9icweTJf8yRigGMkBp8vMf_s4HbLFo-WS4wxqGWEeaH1s0bNoyPztWUpSqgJ4RLIjWw0RYKvu7B2-B50h_y3QcJ3MVMpeWKmnZo8jpQ8zCUxntTKOGkK0Stg7SYYZV7FVdlQRnno79i_uFCIUtdlpidQ7dZNngBTJHuX8BM22grrWUVnYXSyxjAjQmZuYQemWT2udXAmP1Y4-qP-Ws4IqsTa0_qG-i2qw3ewqH_aufr1V3y3Df9O5o_ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1aBT2pWPHbHLxGk_1KcpRiqViXgiv2VrLZCRbKtrRbf7-ZtLUnD96WJbAws-FNZt57IeTeZFpbwxXztahkSZkqZqoYmOalLm0iwZbBXb8v81wNh3qwFqsHLQwABPIZPOBjmOVXU7vEVpnf4eg2Huldspd6IOUrudbm98EBH_p1rb2FBNeP751PtCPn_hwYoUs2D7zm7S0qAUS6R__8_DFpb-V4dPALNCdkB-pTUjzZrzFgO4AiV4MNtgoA2jXLScOK6QQ8EjUUWT-4blzTnj92zoc09AEtclyCrIHmKzL4ok0-us9Fp8fWVyQw42uVhkU4tgOVVA7Hhy5T1qo04kaYlFfSCQ0xlImN_KrYRcpYnzFfgRiXikxmGcRnpFVPazgnNELnPwex1D5WUorSpwuEFR7ClXKxuiBtDMlotnLBGG2icfnH-zty0Cve-qP-S_56RQ4xA6jhE_KatJr5Em7Ivv1uxov5bcjiDz8FnYw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Achieving+High-Performance+Fault-Tolerant+Routing+in+HyperX+Interconnection+Networks&rft.au=Camarero%2C+Cristobal&rft.au=Cano%2C+Alejandro&rft.au=Martinez%2C+Carmen&rft.au=Beivide%2C+Ramon&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=472&rft.epage=483&rft_id=info:doi/10.1109%2FSCW63240.2024.00069&rft.externalDocID=10820629 |