A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC Systems

Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TRO...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 621 - 628
Hlavní autori: Lovell, Austin, Wisniewski, Philip, Rodenbeck, Sarah, Ashish
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 17.11.2024
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Accurate wait-time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.In this work, we investigate and develop TROUT, a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs' priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
DOI:10.1109/SCW63240.2024.00086