A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems
Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TRO...
Saved in:
| Published in: | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 621 - 628 |
|---|---|
| Main Authors: | , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
17.11.2024
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs' priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions. |
|---|---|
| AbstractList | Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs' priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions. |
| Author | Rodenbeck, Sarah Ashish Lovell, Austin Wisniewski, Philip |
| Author_xml | – sequence: 1 givenname: Austin surname: Lovell fullname: Lovell, Austin email: lovella@purdue.edu organization: Purdue University,Department of Computer Science,Indiana,USA – sequence: 2 givenname: Philip surname: Wisniewski fullname: Wisniewski, Philip email: pwisnie@purdue.edu organization: Purdue University,Department of Computer Science,Indiana,USA – sequence: 3 givenname: Sarah surname: Rodenbeck fullname: Rodenbeck, Sarah email: srodenb@purdue.edu organization: Purdue University,Rosen Center for Advanced Computing,Indiana,USA – sequence: 4 surname: Ashish fullname: Ashish email: ashish@purdue.edu organization: Purdue University,Rosen Center for Advanced Computing,Indiana,USA |
| BookMark | eNotj9FKwzAUhiMoqLNPoBd5gc1zkiZNLkfVTSk46cTLkSanLrC1Jd0u9vZO9OqH7-Lj-2_ZZdd3xNg9wgwR7GNdfmkpcpgJEPkMAIy-YJktrJEKpFIql9csG8fYgAZlcjDqhtVzvoyUXPLb6N2OPxENvCKXuth98_kwpN75LW_7xFeJQvSHX_7WN_zjSEfi67inkceOL1clr0_jgfbjHbtq3W6k7H8n7PPleV0up9X74rWcV1MnlD5MMaBosUGQ52A0VkjVQCtMaC1S0CGgal0gaT2QK4qCivM9Y7V36POGQE7Yw583EtFmSHHv0mmDYARoFPIH0kdQGw |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/SCW63240.2024.00086 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350355543 |
| EndPage | 628 |
| ExternalDocumentID | 10820612 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Science Foundation funderid: 10.13039/100000001 |
| GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIL |
| ID | FETCH-LOGICAL-a256t-1d12f1b103202189235b0f28df91ed6dd15fade39c0ea777e7632896ca1c4be03 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300066&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:01:54 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a256t-1d12f1b103202189235b0f28df91ed6dd15fade39c0ea777e7632896ca1c4be03 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_10820612 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Nov.-17 |
| PublicationDateYYYYMMDD | 2024-11-17 |
| PublicationDate_xml | – month: 11 year: 2024 text: 2024-Nov.-17 day: 17 |
| PublicationDecade | 2020 |
| PublicationTitle | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
| PublicationTitleAbbrev | SC-W |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib060584085 |
| Score | 1.8893526 |
| Snippet | Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 621 |
| SubjectTerms | Accuracy computational efficiency Computational modeling Conferences Deep learning High performance computing machine learning neural networks operations research performance optimization Predictive models queue management Queueing analysis resource allocation Resource management User experience |
| Title | A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems |
| URI | https://ieeexplore.ieee.org/document/10820612 |
| WOSCitedRecordID | wos001451792300066&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgYmACRBHf8sAasJPGH2NVqCKGKqggulX-uKBKKK1Kwu_H56bAwsBmWbIsnZ34ns_vPUJunHJKe5Em1kpIsLKUKDAhkWOVqFLnA8z10WxCTiZqNtNlR1aPXBgAiI_P4BabsZbvl67Fq7LwhaPaOHoK70opNmSt7ebB8h6qdXXKQpzpu-noFcXIWUCBKWpkMyRM__JQiUfI-OCfkx-S_g8Zj5bfx8wR2YH6mEyHtFggdzhambzTe4AV7bRS3-iwEwqnISMNY7EWg6-b6ePS0qcWWqCR-UEXNS3KEe1Uy_vkZfzwPCqSzh8hMSFRaRLueVpxy6MHOlchVcstq1LlK83BC-95XhkPmXYMjJQSwr8k4CvhDHcDCyw7Ib16WcMpoQOpDWPOQu4QceUqC7guFVqJzJiwimekjxGZrzYSGPNtMM7_6L8g-xh0JO1xeUl6zbqFK7LnPpvFx_o6LtwXmf6YPA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT2pWPHbHLyuJrvdTXIs1bJqLZVW7K3kY1YKsi216-83k27ViwdvIRACmexmXibvPUKurLRSuSyOjBEQYWUpkqB9IseKrIit8zDXBbMJ0e_L8VgNarJ64MIAQHh8BtfYDLV8N7MVXpX5LxzVxtFTeDNteeCzomuttw8W-FCvq9YW4kzdDDuvKEfOPA6MUSWbIWX6l4tKOES6u_-cfo80f-h4dPB90OyTDSgPyLBN8ymyh4OZyTu9BZjTWi31jbZrqXDqc1I_Fqsx-L6ZPswMfa6gAhq4H3Ra0nzQobVueZO8dO9GnTyqHRIi7VOVZcQdjwtueHBB59Ina6lhRSxdoTi4zDmeFtpBoiwDLYQA_zfxCCuzmtuWAZYckkY5K-GI0JZQmjFrILWIuVKZeGQXZ0pmidY-jsekiSsyma9EMCbrxTj5o_-SbOejp96kd99_PCU7GACk8HFxRhrLRQXnZMt-Lqcfi4sQxC-4YZuD |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=A+Hierarchical+Deep+Learning+Approach+for+Predicting+Job+Queue+Times+in+HPC+Systems&rft.au=Lovell%2C+Austin&rft.au=Wisniewski%2C+Philip&rft.au=Rodenbeck%2C+Sarah&rft.au=Ashish&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=621&rft.epage=628&rft_id=info:doi/10.1109%2FSCW63240.2024.00086&rft.externalDocID=10820612 |