GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters

Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job dur...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Future generation computer systems Ročník 152; s. 127 - 137
Hlavní autori: Wang, Sheng, Chen, Shiping, Shi, Yumei
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 01.03.2024
Predmet:
ISSN:0167-739X, 1872-7115
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job durations or GPU utilization in the cluster based on training progress and task speed. Regrettably, these studies often overlooked the performance variations among different GPU types within these clusters, as well as the presence of spatiotemporal correlations among jobs. To address these limitations, this paper introduces the graph predictive algorithm for efficient resource scheduling (GPARS) designed specifically for heterogeneous clusters. GPARS leverages spatiotemporal correlations among jobs and utilizes graph attention networks (GANs) for precise job duration prediction. Building upon the prediction results, we develop a dynamic objective function to allocate suitable GPU types for newly submitted jobs. By conducting a comprehensive analysis of Alibaba’s heterogeneous GPU cluster, we delve into the impact of GPU capacity and type on job completion time (JCT) and resource utilization. Our evaluation, using real traces from Alibaba and Philly, substantiates the effectiveness of GPARS. It achieves a remarkable 10.29% reduction in waiting time and an average improvement of 7.47% in resource utilization compared to the original scheduling method. These findings underscore GPARS’s superior performance in enhancing resource scheduling within heterogeneous GPU clusters. •After analyzing Alibaba’s GPU cluster, we study GPU capacity and type’s impact on job completion time and resource use. Our findings highlight overcrowding in weaker GPUs and load imbalance in high-end machines.•Our GAN approach outperforms, achieving impressive RMSE and MAE: 0.0237 and 0.0073. Our study confirms spatiotemporal correlations in job durations within the cluster.•Leveraging our predictions, we introduce GPARS, an efficient scheduling approach for job-GPU allocation. Evaluated with Alibaba and Philly traces, GPARS substantially cuts waiting time by 10.29
AbstractList Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job durations or GPU utilization in the cluster based on training progress and task speed. Regrettably, these studies often overlooked the performance variations among different GPU types within these clusters, as well as the presence of spatiotemporal correlations among jobs. To address these limitations, this paper introduces the graph predictive algorithm for efficient resource scheduling (GPARS) designed specifically for heterogeneous clusters. GPARS leverages spatiotemporal correlations among jobs and utilizes graph attention networks (GANs) for precise job duration prediction. Building upon the prediction results, we develop a dynamic objective function to allocate suitable GPU types for newly submitted jobs. By conducting a comprehensive analysis of Alibaba’s heterogeneous GPU cluster, we delve into the impact of GPU capacity and type on job completion time (JCT) and resource utilization. Our evaluation, using real traces from Alibaba and Philly, substantiates the effectiveness of GPARS. It achieves a remarkable 10.29% reduction in waiting time and an average improvement of 7.47% in resource utilization compared to the original scheduling method. These findings underscore GPARS’s superior performance in enhancing resource scheduling within heterogeneous GPU clusters. •After analyzing Alibaba’s GPU cluster, we study GPU capacity and type’s impact on job completion time and resource use. Our findings highlight overcrowding in weaker GPUs and load imbalance in high-end machines.•Our GAN approach outperforms, achieving impressive RMSE and MAE: 0.0237 and 0.0073. Our study confirms spatiotemporal correlations in job durations within the cluster.•Leveraging our predictions, we introduce GPARS, an efficient scheduling approach for job-GPU allocation. Evaluated with Alibaba and Philly traces, GPARS substantially cuts waiting time by 10.29
Author Chen, Shiping
Wang, Sheng
Shi, Yumei
Author_xml – sequence: 1
  givenname: Sheng
  surname: Wang
  fullname: Wang, Sheng
  organization: Business School, University of Shanghai for Science and Technology, Shanghai 200082, PR China
– sequence: 2
  givenname: Shiping
  surname: Chen
  fullname: Chen, Shiping
  email: chensp@usst.edu.cn
  organization: Business School, University of Shanghai for Science and Technology, Shanghai 200082, PR China
– sequence: 3
  givenname: Yumei
  surname: Shi
  fullname: Shi, Yumei
  organization: School of Mathematics and Finance, Chuzhou University, Chuzhou 239000, Anhui, PR China
BookMark eNqFkM1Kw0AUhQepYFt9AxfzAonzk2SSLoRSNAqCRS24G9LJnWZKmikzk4Jvb0JdudDVhcP9Dpxvhiad7QChW0piSmh2t491H3oHMSOMD1FMGLtAU5oLFglK0wmaDm8iErz4vEIz7_eEECo4naKqXC_f3he4dNWxwUcHtVHBnABX7c46E5oD1tZh0NooA13ADrztnQLsVQN135puh02HGwjg7A46sL3H5XqDVdv7IfPX6FJXrYebnztHm8eHj9VT9PJaPq-WL5HiJAuREkXCoM7oVmxTxZOEFUTkRZ5rrVJeaCIgF7QiLC8EZyqlQtRKs1RsM54KkfE5Ss69ylnvHWh5dOZQuS9JiRw1yb08a5KjpjEdNA3Y4hemTKiCsV1wlWn_g-_PMAzDTgac9KMlNVh0oIKsrfm74BtPMYkG
CitedBy_id crossref_primary_10_3390_app14114697
crossref_primary_10_3390_electronics14051048
crossref_primary_10_1007_s10723_025_09809_2
crossref_primary_10_1016_j_future_2025_107883
crossref_primary_10_1109_TPDS_2025_3577796
crossref_primary_10_1016_j_engappai_2025_110337
Cites_doi 10.1145/3190508.3190517
10.1109/TPDS.2021.3079202
10.1145/2628071.2628117
10.1016/j.future.2023.07.011
10.1145/3575693.3575705
10.1016/j.jnca.2013.10.009
10.1145/2829988.2787488
10.1145/2647868.2654889
10.1016/j.knosys.2023.110609
10.1016/j.ipm.2023.103328
10.1016/j.ipm.2022.103137
10.1145/1272996.1273005
10.1016/j.eswa.2022.118790
10.1145/1998582.1998637
10.1109/TSE.2023.3285280
10.1109/TNNLS.2020.2978386
10.1145/2168836.2168847
10.1145/3342195.3387555
10.1145/3458817.3476223
10.1109/TPDS.2021.3136245
10.21203/rs.3.rs-2266264/v1
10.1016/j.future.2023.03.041
10.1016/j.future.2023.05.032
10.1145/2523616.2523633
10.1145/3292500.3330925
10.1016/j.compmedimag.2021.102027
10.1016/j.jpdc.2020.05.014
ContentType Journal Article
Copyright 2023 Elsevier B.V.
Copyright_xml – notice: 2023 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.future.2023.10.022
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1872-7115
EndPage 137
ExternalDocumentID 10_1016_j_future_2023_10_022
S0167739X23003953
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29H
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
AEBSH
AEKER
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
KOM
LG9
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SES
SEW
SPC
SPCBC
SSV
SSZ
T5K
UHS
WUQ
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ADNMO
AEIPS
AFJKZ
AGQPQ
AIIUN
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-c7942ed61b7b5c34429078988ffc539f07e871a0289732c5177dcf257b6357763
ISICitedReferencesCount 7
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001106583300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-739X
IngestDate Sat Nov 29 07:22:11 EST 2025
Tue Nov 18 22:11:37 EST 2025
Sat May 04 15:43:12 EDT 2024
IsPeerReviewed true
IsScholarly true
Keywords Graph attention networks
Resource scheduling
Resource utilization
Heterogeneous GPU clusters
Job duration
Waiting time
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-c7942ed61b7b5c34429078988ffc539f07e871a0289732c5177dcf257b6357763
PageCount 11
ParticipantIDs crossref_primary_10_1016_j_future_2023_10_022
crossref_citationtrail_10_1016_j_future_2023_10_022
elsevier_sciencedirect_doi_10_1016_j_future_2023_10_022
PublicationCentury 2000
PublicationDate March 2024
2024-03-00
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 03
  year: 2024
  text: March 2024
PublicationDecade 2020
PublicationTitle Future generation computer systems
PublicationYear 2024
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Y. Peng, et al., Optimus: an efficient dynamic resource scheduler for deep learning clusters, in: Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–14.
Alworafi (b42) 2016
H. Zhao, et al., HiveD: Sharing a GPU cluster for deep learning with guarantees, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 515–532.
Wang (b47) 2020
Ray (b50) 2023
S. Chaudhary, et al., Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning, in: Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16.
P. Zheng, et al., Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning, in: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 23, 2023, pp. 703–723.
Wu (b32) 2020; 32
Q. Hu, et al., Characterization and prediction of deep learning workloads in large-scale gpu datacenters, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
Zhai (b37) 2023; 60
P. Delgado, et al., Hawk: Hybrid datacenter scheduling, in: Proceedings of the 2015 USENIX Annual Technical Conference, 2015, pp. 499–510.
Liu (b31) 2023
M. Jeon, et al., Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads, in: USENIX: Annual Technical Conference, 2019, pp. 947–960.
E. Boutin, et al., Apollo: scalable and coordinated scheduling for cloud-scale computing, in: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, pp. 285–300.
A.D. Ferguson, et al., Jockey: guaranteed job latency in data parallel clusters, in: Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 99–112.
Yu (b12) 2020; 2
Xu (b40) 2020
Dubey (b39) 2018
Zheng (b44) 2023; 273
S. Pai, et al., Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels, in: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 483–484.
Ye (b48) 2021; 33
J. Gu, et al., Tiresias: A GPU Cluster Manager for Distributed Deep Learning, in: 16th USENIX Symposium on Networked Systems Design and Implementation, 2019, pp. 485–500.
Y. Jia, et al., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678.
Paszke (b34) 2019; 32
W.L. Chiang, et al., Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266.
Q. Weng, et al., MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters, in: 19th USENIX symposium on networked systems design and implementation, 2022, pp. 945–960.
Zeng (b26) 2023; 213
Jalaparti (b24) 2015; 45
V.K. Vavilapalli, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 1–16.
Liu (b38) 2020; 145
K. Mahajan, et al., Themis: Fair and efficient GPU cluster scheduling, in: 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 20, 2020, pp. 289–304.
D. Narayanan, et al., Heterogeneity-Aware cluster scheduling policies for deep learning workloads, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 481–498.
Rajasekaran (b3) 2023; 11
S. Venkataraman, et al., Ernest: Efficient performance prediction for Large-Scale advanced analytics, in: 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 16, 2016, pp. 363–378.
Liu (b20) 2023
Ali (b8) 2023; 149
W. Xiao, et al., AntMan: Dynamic scaling on GPU clusters for deep learning, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 533–548.
Hightower (b9) 2017
M. Isard, et al., Dryad: distributed data-parallel programs from sequential building blocks, in: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 2007, pp. 59–72.
Manurung (b41) 2019; 7
ong (b28) 2010; 60
Jiang (b46) 2023; 60
Ahmedt-Aristizabal (b36) 2022; 95
B. Hindman, et al., Mesos: A platform for Fine-Grained resource sharing in the data center, in: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 11, 2011.
Krzywaniak (b35) 2023; 145
A. Verma, et al., Aria: automatic resource inference and allocation for mapreduce environments, in: Proceedings of the 8th ACM International Conference on Autonomic Computing, 2011, pp. 235–244.
S.A. Jyothi, et al., Morpheus: Towards Automated SLOs for Enterprise Clusters, in: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 16, 2016, pp. 117–134.
Wang (b49) 2019
Liu (b29) 2014; 41
Yeung (b4) 2021; 33
W. Xiao, et al., Gandiva: Introspective cluster scheduling for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation, 2018, pp. 595–610.
Q. Hu, et al., Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023, pp. 457–472.
10.1016/j.future.2023.10.022_b45
Liu (10.1016/j.future.2023.10.022_b31) 2023
Paszke (10.1016/j.future.2023.10.022_b34) 2019; 32
Liu (10.1016/j.future.2023.10.022_b38) 2020; 145
Wang (10.1016/j.future.2023.10.022_b49) 2019
10.1016/j.future.2023.10.022_b43
Rajasekaran (10.1016/j.future.2023.10.022_b3) 2023; 11
Krzywaniak (10.1016/j.future.2023.10.022_b35) 2023; 145
10.1016/j.future.2023.10.022_b13
10.1016/j.future.2023.10.022_b14
Liu (10.1016/j.future.2023.10.022_b29) 2014; 41
10.1016/j.future.2023.10.022_b11
10.1016/j.future.2023.10.022_b17
10.1016/j.future.2023.10.022_b18
Wang (10.1016/j.future.2023.10.022_b47) 2020
10.1016/j.future.2023.10.022_b15
10.1016/j.future.2023.10.022_b16
Yeung (10.1016/j.future.2023.10.022_b4) 2021; 33
10.1016/j.future.2023.10.022_b10
10.1016/j.future.2023.10.022_b51
10.1016/j.future.2023.10.022_b52
Zhai (10.1016/j.future.2023.10.022_b37) 2023; 60
ong (10.1016/j.future.2023.10.022_b28) 2010; 60
Zheng (10.1016/j.future.2023.10.022_b44) 2023; 273
10.1016/j.future.2023.10.022_b19
Ali (10.1016/j.future.2023.10.022_b8) 2023; 149
Ahmedt-Aristizabal (10.1016/j.future.2023.10.022_b36) 2022; 95
Ray (10.1016/j.future.2023.10.022_b50) 2023
10.1016/j.future.2023.10.022_b25
10.1016/j.future.2023.10.022_b22
Dubey (10.1016/j.future.2023.10.022_b39) 2018
10.1016/j.future.2023.10.022_b23
10.1016/j.future.2023.10.022_b27
10.1016/j.future.2023.10.022_b21
Wu (10.1016/j.future.2023.10.022_b32) 2020; 32
Ye (10.1016/j.future.2023.10.022_b48) 2021; 33
Yu (10.1016/j.future.2023.10.022_b12) 2020; 2
10.1016/j.future.2023.10.022_b33
Manurung (10.1016/j.future.2023.10.022_b41) 2019; 7
Alworafi (10.1016/j.future.2023.10.022_b42) 2016
Jalaparti (10.1016/j.future.2023.10.022_b24) 2015; 45
Liu (10.1016/j.future.2023.10.022_b20) 2023
10.1016/j.future.2023.10.022_b30
Xu (10.1016/j.future.2023.10.022_b40) 2020
Jiang (10.1016/j.future.2023.10.022_b46) 2023; 60
10.1016/j.future.2023.10.022_b2
10.1016/j.future.2023.10.022_b1
10.1016/j.future.2023.10.022_b6
10.1016/j.future.2023.10.022_b7
Hightower (10.1016/j.future.2023.10.022_b9) 2017
Zeng (10.1016/j.future.2023.10.022_b26) 2023; 213
10.1016/j.future.2023.10.022_b5
References_xml – reference: Q. Hu, et al., Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023, pp. 457–472.
– reference: M. Isard, et al., Dryad: distributed data-parallel programs from sequential building blocks, in: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 2007, pp. 59–72.
– reference: S.A. Jyothi, et al., Morpheus: Towards Automated SLOs for Enterprise Clusters, in: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 16, 2016, pp. 117–134.
– reference: P. Zheng, et al., Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning, in: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 23, 2023, pp. 703–723.
– volume: 32
  year: 2019
  ident: b34
  article-title: Pytorch: An imperative style, high-performance deep learning library
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: B. Hindman, et al., Mesos: A platform for Fine-Grained resource sharing in the data center, in: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 11, 2011.
– reference: S. Venkataraman, et al., Ernest: Efficient performance prediction for Large-Scale advanced analytics, in: 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 16, 2016, pp. 363–378.
– start-page: 337
  year: 2023
  end-page: 339
  ident: b50
  article-title: Privacy-preserving job scheduler for GPU sharing
  publication-title: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops
– reference: Y. Peng, et al., Optimus: an efficient dynamic resource scheduler for deep learning clusters, in: Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–14.
– reference: A. Verma, et al., Aria: automatic resource inference and allocation for mapreduce environments, in: Proceedings of the 8th ACM International Conference on Autonomic Computing, 2011, pp. 235–244.
– volume: 45
  start-page: 407
  year: 2015
  end-page: 420
  ident: b24
  article-title: Network-aware scheduling for data-parallel jobs: Plan when you can
  publication-title: Acm Sigcomm Comput. Com.
– volume: 60
  year: 2023
  ident: b37
  article-title: Causality-based CTR prediction using graph neural networks
  publication-title: Inf. Process. Manage.
– start-page: 189
  year: 2019
  end-page: 202
  ident: b49
  article-title: Characterizing deep learning training workloads on alibaba-pai
  publication-title: 2019 IEEE International Symposium on Workload Characterization
– reference: H. Zhao, et al., HiveD: Sharing a GPU cluster for deep learning with guarantees, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 515–532.
– volume: 60
  year: 2023
  ident: b46
  article-title: Forecasting movements of stock time series based on hidden state guided deep learning approach
  publication-title: Inf. Process. Manage.
– volume: 41
  start-page: 101
  year: 2014
  end-page: 113
  ident: b29
  article-title: Adaptive energy-efficient scheduling algorithm for parallel tasks on homogeneous clusters
  publication-title: J. Netw. Comput. Appl.
– reference: S. Chaudhary, et al., Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning, in: Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16.
– reference: V.K. Vavilapalli, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 1–16.
– volume: 33
  start-page: 88
  year: 2021
  end-page: 100
  ident: b4
  article-title: Horus: Interference-aware and prediction-based scheduling in deep learning systems
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– volume: 145
  start-page: 396
  year: 2023
  end-page: 414
  ident: b35
  article-title: Dynamic GPU power cap with online performance tracing for energy efficient GPU computing using DEPO tool
  publication-title: Future Gener. Comput. Syst.
– volume: 7
  start-page: 44
  year: 2019
  end-page: 47
  ident: b41
  article-title: Application of FIFO algorithm (First In First Out) to simulation queue
  publication-title: INFOKUM
– volume: 213
  year: 2023
  ident: b26
  article-title: Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network
  publication-title: Expert Syst. Appl.
– reference: Q. Hu, et al., Characterization and prediction of deep learning workloads in large-scale gpu datacenters, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
– reference: A.D. Ferguson, et al., Jockey: guaranteed job latency in data parallel clusters, in: Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 99–112.
– reference: W. Xiao, et al., Gandiva: Introspective cluster scheduling for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation, 2018, pp. 595–610.
– reference: W.L. Chiang, et al., Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266.
– year: 2017
  ident: b9
  article-title: Kubernetes: Up and Running Dive Into the Future of Infrastructure
– reference: M. Jeon, et al., Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads, in: USENIX: Annual Technical Conference, 2019, pp. 947–960.
– reference: S. Pai, et al., Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels, in: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 483–484.
– reference: P. Delgado, et al., Hawk: Hybrid datacenter scheduling, in: Proceedings of the 2015 USENIX Annual Technical Conference, 2015, pp. 499–510.
– start-page: 208
  year: 2016
  end-page: 212
  ident: b42
  article-title: An improved SJF scheduling algorithm in cloud computing environment
  publication-title: 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques
– volume: 33
  start-page: 2781
  year: 2021
  end-page: 2793
  ident: b48
  article-title: Astraea: A fair deep learning scheduler for multi-tenant gpu clusters
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– reference: Y. Jia, et al., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678.
– reference: K. Mahajan, et al., Themis: Fair and efficient GPU cluster scheduling, in: 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 20, 2020, pp. 289–304.
– year: 2023
  ident: b31
  article-title: Task-oriented ML/DL library recommendation based on a knowledge graph
  publication-title: IEEE Trans. Softw. Eng.
– volume: 11
  start-page: 324
  year: 2023
  end-page: 329
  ident: b3
  article-title: AI and cloud computing-how the cloud is accelerating AI
  publication-title: Int. J. Intell. Syst. Appl. Eng.
– volume: 149
  start-page: 71
  year: 2023
  end-page: 88
  ident: b8
  article-title: An automated and portable method for selecting an optimal GPU frequency
  publication-title: Future Gener. Comput. Syst.
– volume: 145
  start-page: 1
  year: 2020
  end-page: 12
  ident: b38
  article-title: Automatic blockchain whitepapers analysis via heterogeneous graph neural network
  publication-title: J. Parallel Distrib. Comput.
– year: 2023
  ident: b20
  article-title: Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments
  publication-title: Future Gener. Comput. Syst.
– start-page: 1
  year: 2020
  end-page: 7
  ident: b40
  article-title: Reluplex made more practical: Leaky ReLU
  publication-title: 2020 IEEE Symposium on Computers and Communications
– volume: 95
  year: 2022
  ident: b36
  article-title: A survey on graph-based deep learning for computational histopathology
  publication-title: Comput. Med. Imaging Graph.
– start-page: 873
  year: 2018
  end-page: 880
  ident: b39
  article-title: Comparative study of convolution neural network’s relu and leaky-relu activation functions
  publication-title: Applications of Computing, Automation and Wireless Systems in Electrical Engineering: Proceedings of MARC
– volume: 32
  start-page: 4
  year: 2020
  end-page: 24
  ident: b32
  article-title: A comprehensive survey on graph neural networks
  publication-title: IEEE Trans. Neural Netw. Learn.
– volume: 273
  year: 2023
  ident: b44
  article-title: MathNet: Haar-like wavelet multiresolution analysis for graph representation learning
  publication-title: Knowl.-Based Syst.
– reference: J. Gu, et al., Tiresias: A GPU Cluster Manager for Distributed Deep Learning, in: 16th USENIX Symposium on Networked Systems Design and Implementation, 2019, pp. 485–500.
– start-page: 1
  year: 2020
  end-page: 13
  ident: b47
  article-title: An efficient and non-intrusive GPU scheduling framework for deep learning training systems
  publication-title: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
– reference: D. Narayanan, et al., Heterogeneity-Aware cluster scheduling policies for deep learning workloads, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 481–498.
– volume: 2
  start-page: 98
  year: 2020
  end-page: 111
  ident: b12
  article-title: Fine-grained GPU sharing primitives for deep learning applications
  publication-title: Proc. Mach. Learn. Syst.
– reference: E. Boutin, et al., Apollo: scalable and coordinated scheduling for cloud-scale computing, in: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, pp. 285–300.
– reference: Q. Weng, et al., MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters, in: 19th USENIX symposium on networked systems design and implementation, 2022, pp. 945–960.
– reference: W. Xiao, et al., AntMan: Dynamic scaling on GPU clusters for deep learning, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 533–548.
– volume: 60
  start-page: 360
  year: 2010
  end-page: 374
  ident: b28
  article-title: EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters
  publication-title: IEEE Trans. Comput.
– ident: 10.1016/j.future.2023.10.022_b18
– ident: 10.1016/j.future.2023.10.022_b27
  doi: 10.1145/3190508.3190517
– volume: 33
  start-page: 88
  issue: 1
  year: 2021
  ident: 10.1016/j.future.2023.10.022_b4
  article-title: Horus: Interference-aware and prediction-based scheduling in deep learning systems
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2021.3079202
– ident: 10.1016/j.future.2023.10.022_b6
– ident: 10.1016/j.future.2023.10.022_b43
  doi: 10.1145/2628071.2628117
– volume: 149
  start-page: 71
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b8
  article-title: An automated and portable method for selecting an optimal GPU frequency
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2023.07.011
– ident: 10.1016/j.future.2023.10.022_b19
  doi: 10.1145/3575693.3575705
– volume: 41
  start-page: 101
  year: 2014
  ident: 10.1016/j.future.2023.10.022_b29
  article-title: Adaptive energy-efficient scheduling algorithm for parallel tasks on homogeneous clusters
  publication-title: J. Netw. Comput. Appl.
  doi: 10.1016/j.jnca.2013.10.009
– volume: 45
  start-page: 407
  issue: 4
  year: 2015
  ident: 10.1016/j.future.2023.10.022_b24
  article-title: Network-aware scheduling for data-parallel jobs: Plan when you can
  publication-title: Acm Sigcomm Comput. Com.
  doi: 10.1145/2829988.2787488
– ident: 10.1016/j.future.2023.10.022_b33
  doi: 10.1145/2647868.2654889
– volume: 273
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b44
  article-title: MathNet: Haar-like wavelet multiresolution analysis for graph representation learning
  publication-title: Knowl.-Based Syst.
  doi: 10.1016/j.knosys.2023.110609
– volume: 7
  start-page: 44
  issue: 2
  year: 2019
  ident: 10.1016/j.future.2023.10.022_b41
  article-title: Application of FIFO algorithm (First In First Out) to simulation queue
  publication-title: INFOKUM
– start-page: 337
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b50
  article-title: Privacy-preserving job scheduler for GPU sharing
– volume: 11
  start-page: 324
  issue: 1
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b3
  article-title: AI and cloud computing-how the cloud is accelerating AI
  publication-title: Int. J. Intell. Syst. Appl. Eng.
– ident: 10.1016/j.future.2023.10.022_b15
– volume: 60
  issue: 3
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b46
  article-title: Forecasting movements of stock time series based on hidden state guided deep learning approach
  publication-title: Inf. Process. Manage.
  doi: 10.1016/j.ipm.2023.103328
– ident: 10.1016/j.future.2023.10.022_b30
– year: 2017
  ident: 10.1016/j.future.2023.10.022_b9
– ident: 10.1016/j.future.2023.10.022_b5
– ident: 10.1016/j.future.2023.10.022_b21
– ident: 10.1016/j.future.2023.10.022_b52
– ident: 10.1016/j.future.2023.10.022_b25
– start-page: 208
  year: 2016
  ident: 10.1016/j.future.2023.10.022_b42
  article-title: An improved SJF scheduling algorithm in cloud computing environment
– start-page: 1
  year: 2020
  ident: 10.1016/j.future.2023.10.022_b40
  article-title: Reluplex made more practical: Leaky ReLU
– volume: 60
  start-page: 360
  issue: 3
  year: 2010
  ident: 10.1016/j.future.2023.10.022_b28
  article-title: EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters
  publication-title: IEEE Trans. Comput.
– volume: 32
  year: 2019
  ident: 10.1016/j.future.2023.10.022_b34
  article-title: Pytorch: An imperative style, high-performance deep learning library
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.future.2023.10.022_b14
– volume: 60
  issue: 1
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b37
  article-title: Causality-based CTR prediction using graph neural networks
  publication-title: Inf. Process. Manage.
  doi: 10.1016/j.ipm.2022.103137
– ident: 10.1016/j.future.2023.10.022_b17
  doi: 10.1145/1272996.1273005
– volume: 213
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b26
  article-title: Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network
  publication-title: Expert Syst. Appl.
  doi: 10.1016/j.eswa.2022.118790
– ident: 10.1016/j.future.2023.10.022_b51
  doi: 10.1145/1998582.1998637
– year: 2023
  ident: 10.1016/j.future.2023.10.022_b31
  article-title: Task-oriented ML/DL library recommendation based on a knowledge graph
  publication-title: IEEE Trans. Softw. Eng.
  doi: 10.1109/TSE.2023.3285280
– volume: 32
  start-page: 4
  issue: 1
  year: 2020
  ident: 10.1016/j.future.2023.10.022_b32
  article-title: A comprehensive survey on graph neural networks
  publication-title: IEEE Trans. Neural Netw. Learn.
  doi: 10.1109/TNNLS.2020.2978386
– ident: 10.1016/j.future.2023.10.022_b22
  doi: 10.1145/2168836.2168847
– ident: 10.1016/j.future.2023.10.022_b13
– start-page: 189
  year: 2019
  ident: 10.1016/j.future.2023.10.022_b49
  article-title: Characterizing deep learning training workloads on alibaba-pai
– ident: 10.1016/j.future.2023.10.022_b1
  doi: 10.1145/3342195.3387555
– ident: 10.1016/j.future.2023.10.022_b11
  doi: 10.1145/3458817.3476223
– volume: 33
  start-page: 2781
  issue: 11
  year: 2021
  ident: 10.1016/j.future.2023.10.022_b48
  article-title: Astraea: A fair deep learning scheduler for multi-tenant gpu clusters
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2021.3136245
– ident: 10.1016/j.future.2023.10.022_b2
  doi: 10.21203/rs.3.rs-2266264/v1
– ident: 10.1016/j.future.2023.10.022_b23
– volume: 145
  start-page: 396
  year: 2023
  ident: 10.1016/j.future.2023.10.022_b35
  article-title: Dynamic GPU power cap with online performance tracing for energy efficient GPU computing using DEPO tool
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2023.03.041
– year: 2023
  ident: 10.1016/j.future.2023.10.022_b20
  article-title: Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2023.05.032
– ident: 10.1016/j.future.2023.10.022_b7
– ident: 10.1016/j.future.2023.10.022_b10
  doi: 10.1145/2523616.2523633
– ident: 10.1016/j.future.2023.10.022_b45
  doi: 10.1145/3292500.3330925
– start-page: 1
  year: 2020
  ident: 10.1016/j.future.2023.10.022_b47
  article-title: An efficient and non-intrusive GPU scheduling framework for deep learning training systems
– volume: 95
  year: 2022
  ident: 10.1016/j.future.2023.10.022_b36
  article-title: A survey on graph-based deep learning for computational histopathology
  publication-title: Comput. Med. Imaging Graph.
  doi: 10.1016/j.compmedimag.2021.102027
– volume: 145
  start-page: 1
  year: 2020
  ident: 10.1016/j.future.2023.10.022_b38
  article-title: Automatic blockchain whitepapers analysis via heterogeneous graph neural network
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2020.05.014
– ident: 10.1016/j.future.2023.10.022_b16
– start-page: 873
  year: 2018
  ident: 10.1016/j.future.2023.10.022_b39
  article-title: Comparative study of convolution neural network’s relu and leaky-relu activation functions
– volume: 2
  start-page: 98
  year: 2020
  ident: 10.1016/j.future.2023.10.022_b12
  article-title: Fine-grained GPU sharing primitives for deep learning applications
  publication-title: Proc. Mach. Learn. Syst.
SSID ssj0001731
Score 2.433354
Snippet Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 127
SubjectTerms Graph attention networks
Heterogeneous GPU clusters
Job duration
Resource scheduling
Resource utilization
Waiting time
Title GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters
URI https://dx.doi.org/10.1016/j.future.2023.10.022
Volume 152
WOSCitedRecordID wos001106583300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-7115
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001731
  issn: 0167-739X
  databaseCode: AIEXJ
  dateStart: 19950201
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV07b9swECbcpEOX9I2kL3DoZsgQRckUuxlF4qRDYNQJ6k6CRJG1AkUOHCfI3j_eO5KS3broC-giGIRlEbzP99Ldd4S8VSwfKm1YwPJYQ4AiyyAteRRww4wMizAy3NhhE-L0NJ3N5KTX-9r2wtzWomnSuzt59V9FDWsgbGyd_Qtxdz8KC_AZhA5XEDtc_0jw48no4xQD_TFyUSMJQFlZpdbP6y-LZbWaXzqeb0segaUAS5_C70OkC5an9m0uc6yUWcBzNJbJjifnfVXfIK_C9aZHe2RJSXASs_ZgUn5QhGeJ7pz2T21qeq69ubRlBU7vTec4RLtbntpZw_3PoDmrzcREFK8rs1y2bKtjxiUwQTELbsfngv1xSjcV4OUz19bZaWVHbOv1KnMEAt5EM8cTs6X9XSLiYuDoWAY4GX6ApXuu8_kHXu0pbgV3AkFYyGXC75HdSCQSVOPu6ORw9qEz6Ez4sZZ-620Hpi0T3H7Wzz2cDa_l7BHZ8-EGHTmYPCY93TwhD9tRHtRr9qckt6h5Ry1m6BoztMMMBczQDjO0xQxdY4ZWDf0OMxQwQ1vMPCPnR4dn748DP30jUBBGrgIFmjrS5ZAVokgUj8FxwdEEaWqMSrg0odAQbOf4plrwSCVMiFIZsAAFUhyC2XpOdppFo_cJzUsIC4ZCxyUvYxmhi6qkEWFammFY5OEB4e2JZcpT0-OElDpraxAvMnfOGZ4zrsI5H5Cgu-vKUbP85vuiFUbm3UvnNmaAn1_e-eKf73xJHqz_Gq_Izmp5o1-T--p2VV0v33igfQO1GaMv
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=GPARS%3A+Graph+predictive+algorithm+for+efficient+resource+scheduling+in+heterogeneous+GPU+clusters&rft.jtitle=Future+generation+computer+systems&rft.au=Wang%2C+Sheng&rft.au=Chen%2C+Shiping&rft.au=Shi%2C+Yumei&rft.date=2024-03-01&rft.pub=Elsevier+B.V&rft.issn=0167-739X&rft.eissn=1872-7115&rft.volume=152&rft.spage=127&rft.epage=137&rft_id=info:doi/10.1016%2Fj.future.2023.10.022&rft.externalDocID=S0167739X23003953
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon