A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems

Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TRO...

Full description

Saved in:
Bibliographic Details
Published in:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 621 - 628
Main Authors: Lovell, Austin, Wisniewski, Philip, Rodenbeck, Sarah, Ashish
Format: Conference Proceeding
Language:English
Published: IEEE 17.11.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs' priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
AbstractList Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs' priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
Author Rodenbeck, Sarah
Ashish
Lovell, Austin
Wisniewski, Philip
Author_xml – sequence: 1
  givenname: Austin
  surname: Lovell
  fullname: Lovell, Austin
  email: lovella@purdue.edu
  organization: Purdue University,Department of Computer Science,Indiana,USA
– sequence: 2
  givenname: Philip
  surname: Wisniewski
  fullname: Wisniewski, Philip
  email: pwisnie@purdue.edu
  organization: Purdue University,Department of Computer Science,Indiana,USA
– sequence: 3
  givenname: Sarah
  surname: Rodenbeck
  fullname: Rodenbeck, Sarah
  email: srodenb@purdue.edu
  organization: Purdue University,Rosen Center for Advanced Computing,Indiana,USA
– sequence: 4
  surname: Ashish
  fullname: Ashish
  email: ashish@purdue.edu
  organization: Purdue University,Rosen Center for Advanced Computing,Indiana,USA
BookMark eNotj9FKwzAUhiMoqLNPoBd5gc1zkiZNLkfVTSk46cTLkSanLrC1Jd0u9vZO9OqH7-Lj-2_ZZdd3xNg9wgwR7GNdfmkpcpgJEPkMAIy-YJktrJEKpFIql9csG8fYgAZlcjDqhtVzvoyUXPLb6N2OPxENvCKXuth98_kwpN75LW_7xFeJQvSHX_7WN_zjSEfi67inkceOL1clr0_jgfbjHbtq3W6k7H8n7PPleV0up9X74rWcV1MnlD5MMaBosUGQ52A0VkjVQCtMaC1S0CGgal0gaT2QK4qCivM9Y7V36POGQE7Yw583EtFmSHHv0mmDYARoFPIH0kdQGw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00086
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 628
ExternalDocumentID 10820612
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a256t-1d12f1b103202189235b0f28df91ed6dd15fade39c0ea777e7632896ca1c4be03
IEDL.DBID RIE
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001451792300066&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:01:54 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a256t-1d12f1b103202189235b0f28df91ed6dd15fade39c0ea777e7632896ca1c4be03
PageCount 8
ParticipantIDs ieee_primary_10820612
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8893526
Snippet Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the...
SourceID ieee
SourceType Publisher
StartPage 621
SubjectTerms Accuracy
computational efficiency
Computational modeling
Conferences
Deep learning
High performance computing
machine learning
neural networks
operations research
performance optimization
Predictive models
queue management
Queueing analysis
resource allocation
Resource management
User experience
Title A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems
URI https://ieeexplore.ieee.org/document/10820612
WOSCitedRecordID wos001451792300066&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELWgYmACRBHf8sAasJPGH2NVqCKGKqggulX-uKBKKK1Kwu_H56bAwsBmWbIsnZ34ns_vPUJunHJKe5Em1kpIsLKUKDAhkWOVqFLnA8z10WxCTiZqNtNlR1aPXBgAiI_P4BabsZbvl67Fq7LwhaPaOHoK70opNmSt7ebB8h6qdXXKQpzpu-noFcXIWUCBKWpkMyRM__JQiUfI-OCfkx-S_g8Zj5bfx8wR2YH6mEyHtFggdzhambzTe4AV7bRS3-iwEwqnISMNY7EWg6-b6ePS0qcWWqCR-UEXNS3KEe1Uy_vkZfzwPCqSzh8hMSFRaRLueVpxy6MHOlchVcstq1LlK83BC-95XhkPmXYMjJQSwr8k4CvhDHcDCyw7Ib16WcMpoQOpDWPOQu4QceUqC7guFVqJzJiwimekjxGZrzYSGPNtMM7_6L8g-xh0JO1xeUl6zbqFK7LnPpvFx_o6LtwXmf6YPA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1SBT2pWPHbHLyuJrvdTXIs1bJqLZVW7K3kY1YKsi216-83k27ViwdvIRACmexmXibvPUKurLRSuSyOjBEQYWUpkqB9IseKrIit8zDXBbMJ0e_L8VgNarJ64MIAQHh8BtfYDLV8N7MVXpX5LxzVxtFTeDNteeCzomuttw8W-FCvq9YW4kzdDDuvKEfOPA6MUSWbIWX6l4tKOES6u_-cfo80f-h4dPB90OyTDSgPyLBN8ymyh4OZyTu9BZjTWi31jbZrqXDqc1I_Fqsx-L6ZPswMfa6gAhq4H3Ra0nzQobVueZO8dO9GnTyqHRIi7VOVZcQdjwtueHBB59Ina6lhRSxdoTi4zDmeFtpBoiwDLYQA_zfxCCuzmtuWAZYckkY5K-GI0JZQmjFrILWIuVKZeGQXZ0pmidY-jsekiSsyma9EMCbrxTj5o_-SbOejp96kd99_PCU7GACk8HFxRhrLRQXnZMt-Lqcfi4sQxC-4YZuD
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=A+Hierarchical+Deep+Learning+Approach+for+Predicting+Job+Queue+Times+in+HPC+Systems&rft.au=Lovell%2C+Austin&rft.au=Wisniewski%2C+Philip&rft.au=Rodenbeck%2C+Sarah&rft.au=Ashish&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=621&rft.epage=628&rft_id=info:doi/10.1109%2FSCW63240.2024.00086&rft.externalDocID=10820612