Experiences in Managing High-performance Computing Management and Support Tools while Upgrading a Campus Cluster

The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance compu...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 607 - 612
Hlavní autori: Chen, Yuwu, Cooper, Trevor, Irving, Christopher, Tatineni, Mahidhar, Wolter, Nicole, Mishin, Dmitry, Sivagnanam, Subhashini
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 17.11.2024
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance computing (HPC) infrastructure components and management strategies. We detail our approach to overcoming challenges posed by node heterogeneity, enhancing job scheduling efficiency, and improving resource allocation and billing fairness.The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
AbstractList The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance computing (HPC) infrastructure components and management strategies. We detail our approach to overcoming challenges posed by node heterogeneity, enhancing job scheduling efficiency, and improving resource allocation and billing fairness.The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
Author Chen, Yuwu
Sivagnanam, Subhashini
Irving, Christopher
Wolter, Nicole
Mishin, Dmitry
Tatineni, Mahidhar
Cooper, Trevor
Author_xml – sequence: 1
  givenname: Yuwu
  surname: Chen
  fullname: Chen, Yuwu
  email: ychen64@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 2
  givenname: Trevor
  surname: Cooper
  fullname: Cooper, Trevor
  email: tcooper@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 3
  givenname: Christopher
  surname: Irving
  fullname: Irving, Christopher
  email: cirving@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 4
  givenname: Mahidhar
  surname: Tatineni
  fullname: Tatineni, Mahidhar
  email: mahidhar@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 5
  givenname: Nicole
  surname: Wolter
  fullname: Wolter, Nicole
  email: nickel@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 6
  givenname: Dmitry
  surname: Mishin
  fullname: Mishin, Dmitry
  email: dmishin@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
– sequence: 7
  givenname: Subhashini
  surname: Sivagnanam
  fullname: Sivagnanam, Subhashini
  email: sivagnan@sdsc.edu
  organization: University of California San Diego,San Diego Supercomputer Center,San Diego,USA
BookMark eNotkE1OwzAUhI0EElB6Alj4AinPf6m9RFGhSEUs2opl5SQvqaXEsexUwO1Jgc3M4puZxdySSz94JOSewYIxMI_b4iMXXMKCA5cLANDygszN0mihQCilpLgm85RcCTkoLUGrGxJWXwGjQ19hos7TN-tt63xL1649ZhNqhtjbidJi6MNpPKPfDPboR2p9TbenEIY40t0wdIl-Hl2HdB_aaOtz2NLCTsVEi-6URox35KqxXcL5v8_I_nm1K9bZ5v3ltXjaZJarfMw4t7qSNSyZAImNsFLWthSlQgnGaI4GGUyS5wimUbJsmEHNgDNdVYKBmJGHv12HiIcQXW_j94GB5pBPT_wAWcxcFw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00084
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 612
ExternalDocumentID 10820654
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a256t-22a8c4d071304ef3a44dab3b5e409982e9e10e9e66e09f54bf19e810218cc3103
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300064&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:59:34 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a256t-22a8c4d071304ef3a44dab3b5e409982e9e10e9e66e09f54bf19e810218cc3103
PageCount 6
ParticipantIDs ieee_primary_10820654
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8898883
Snippet The Triton Shared Computing Cluster (TSCC) [1] is the San Diego Supercomputer Center ("Center" in the remaining text)'s primary campus research computing...
SourceID ieee
SourceType Publisher
StartPage 607
SubjectTerms Best practices
campus cluster
Conferences
Focusing
High performance computing
HPC
Processor scheduling
Resource management
Supercomputers
upgrade
User support
Title Experiences in Managing High-performance Computing Management and Support Tools while Upgrading a Campus Cluster
URI https://ieeexplore.ieee.org/document/10820654
WOSCitedRecordID wos001451792300064&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcAEiCLe8sBqcFInjueKigFVlWihW-XHGSpVSdQ08PfxJS2ZGFgiKx4s-fy48933fYTcp1waCdIyFbxRJrxXzJihYs5Za4wX3DdA2rcXOZlki4Wa7sDqDRYGAJriM3jAZpPLd4Wt8aks7HBkG09Ej_SklC1Ya794ML2HbF07ZqGIq8fX0TuSkfMQBcbIkd1SmHYaKs0VMj7-5-AnZNCB8ej095o5JQeQn5Gyoyiu6Cqne7khinUbrOzQALSVbcCurtKF6txR1PMMvjedFcW6ot-f4Xig8_Jj0xTVU00xK1FXdLSukUthQObjp9nome3EE5gOXsyWxbHOrHAYhHIBfqiFcNoMTQIholNZDAoiHj5pClz5RBgfKchQ6DuzFsXHzkk_L3K4IFRy5Sx3qXFCC2MTY3VwVDIV7C8T79QlGeB0LcuWH2O5n6mrP_5fkyO0CCL6InlD-ttNDbfk0H5tV9XmrrHqDxGTpfo
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ27T8MwEMYtKEgwAaKI8vTAanBS5-G5oiqiVJVooVvlJ1Sqkqhp4N_H57RkYmCJomSIZDvxXe6-34fQXUwTmZhEEe6iUcKs5UTKLidaKyWlZdR6Ie3bMBmN0tmMjzdida-FMcb45jNzD6e-lq9zVcGvMveGA208YrtoL2IsDGq51nb5QIEPeF0btlBA-cNr7x1w5NTlgSFQsmuIaeOi4jeR_tE_H3-M2o0cD49_N5oTtGOyU1Q0kOISLzK8NRzC0LlBikYPgGvjBrjV9LpgkWkMjp4u-saTPF-W-PvTfSDwtPhY-bZ6LDDUJaoS95YV0BTaaNp_nPQGZGOfQISLY9YkDEWqmIY0lDJju4IxLWRXRsbldDwNDTcBdYc4NpTbiEkbcJOC1XeqFNiPnaFWlmfmHOGEcq2ojqVmgkkVSSVcqJJytwKSyGreQW0YrnlREzLm25G6-OP6LToYTF6G8-HT6PkSHcLsgL4vSK5Qa72qzDXaV1_rRbm68TP8A-AZqUE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Experiences+in+Managing+High-performance+Computing+Management+and+Support+Tools+while+Upgrading+a+Campus+Cluster&rft.au=Chen%2C+Yuwu&rft.au=Cooper%2C+Trevor&rft.au=Irving%2C+Christopher&rft.au=Tatineni%2C+Mahidhar&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=607&rft.epage=612&rft_id=info:doi/10.1109%2FSCW63240.2024.00084&rft.externalDocID=10820654