Experiences in Managing High-performance Computing Management and Support Tools while Upgrading a Campus Cluster

The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance compu...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 607 - 612
Hlavní autoři: Chen, Yuwu, Cooper, Trevor, Irving, Christopher, Tatineni, Mahidhar, Wolter, Nicole, Mishin, Dmitry, Sivagnanam, Subhashini
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 17.11.2024
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance computing (HPC) infrastructure components and management strategies. We detail our approach to overcoming challenges posed by node heterogeneity, enhancing job scheduling efficiency, and improving resource allocation and billing fairness.The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
DOI:10.1109/SCW63240.2024.00084