Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks

Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tan...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 472 - 483
Hlavní autoři: Camarero, Cristobal, Cano, Alejandro, Martinez, Carmen, Beivide, Ramon
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 17.11.2024
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter direct networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topologies is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and uses a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This scheme not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
DOI:10.1109/SCW63240.2024.00069