Experiences in Managing High-performance Computing Management and Support Tools while Upgrading a Campus Cluster

The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance compu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis S. 607 - 612
Hauptverfasser: Chen, Yuwu, Cooper, Trevor, Irving, Christopher, Tatineni, Mahidhar, Wolter, Nicole, Mishin, Dmitry, Sivagnanam, Subhashini
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 17.11.2024
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance computing (HPC) infrastructure components and management strategies. We detail our approach to overcoming challenges posed by node heterogeneity, enhancing job scheduling efficiency, and improving resource allocation and billing fairness.The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
DOI:10.1109/SCW63240.2024.00084