GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters

Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job dur...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Future generation computer systems Ročník 152; s. 127 - 137
Hlavní autori:	Wang, Sheng, Chen, Shiping, Shi, Yumei
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier B.V 01.03.2024
Predmet:	Graph attention networks Heterogeneous GPU clusters Job duration Resource scheduling Resource utilization Waiting time Graph attention networks Resource scheduling Resource utilization Heterogeneous GPU clusters Job duration Waiting time
ISSN:	0167-739X, 1872-7115
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job durations or GPU utilization in the cluster based on training progress and task speed. Regrettably, these studies often overlooked the performance variations among different GPU types within these clusters, as well as the presence of spatiotemporal correlations among jobs. To address these limitations, this paper introduces the graph predictive algorithm for efficient resource scheduling (GPARS) designed specifically for heterogeneous clusters. GPARS leverages spatiotemporal correlations among jobs and utilizes graph attention networks (GANs) for precise job duration prediction. Building upon the prediction results, we develop a dynamic objective function to allocate suitable GPU types for newly submitted jobs. By conducting a comprehensive analysis of Alibaba’s heterogeneous GPU cluster, we delve into the impact of GPU capacity and type on job completion time (JCT) and resource utilization. Our evaluation, using real traces from Alibaba and Philly, substantiates the effectiveness of GPARS. It achieves a remarkable 10.29% reduction in waiting time and an average improvement of 7.47% in resource utilization compared to the original scheduling method. These findings underscore GPARS’s superior performance in enhancing resource scheduling within heterogeneous GPU clusters. •After analyzing Alibaba’s GPU cluster, we study GPU capacity and type’s impact on job completion time and resource use. Our findings highlight overcrowding in weaker GPUs and load imbalance in high-end machines.•Our GAN approach outperforms, achieving impressive RMSE and MAE: 0.0237 and 0.0073. Our study confirms spatiotemporal correlations in job durations within the cluster.•Leveraging our predictions, we introduce GPARS, an efficient scheduling approach for job-GPU allocation. Evaluated with Alibaba and Philly traces, GPARS substantially cuts waiting time by 10.29
AbstractList	Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource utilization. However, prior research in resource scheduling algorithms typically employed machine learning (ML) algorithms to estimate job durations or GPU utilization in the cluster based on training progress and task speed. Regrettably, these studies often overlooked the performance variations among different GPU types within these clusters, as well as the presence of spatiotemporal correlations among jobs. To address these limitations, this paper introduces the graph predictive algorithm for efficient resource scheduling (GPARS) designed specifically for heterogeneous clusters. GPARS leverages spatiotemporal correlations among jobs and utilizes graph attention networks (GANs) for precise job duration prediction. Building upon the prediction results, we develop a dynamic objective function to allocate suitable GPU types for newly submitted jobs. By conducting a comprehensive analysis of Alibaba’s heterogeneous GPU cluster, we delve into the impact of GPU capacity and type on job completion time (JCT) and resource utilization. Our evaluation, using real traces from Alibaba and Philly, substantiates the effectiveness of GPARS. It achieves a remarkable 10.29% reduction in waiting time and an average improvement of 7.47% in resource utilization compared to the original scheduling method. These findings underscore GPARS’s superior performance in enhancing resource scheduling within heterogeneous GPU clusters. •After analyzing Alibaba’s GPU cluster, we study GPU capacity and type’s impact on job completion time and resource use. Our findings highlight overcrowding in weaker GPUs and load imbalance in high-end machines.•Our GAN approach outperforms, achieving impressive RMSE and MAE: 0.0237 and 0.0073. Our study confirms spatiotemporal correlations in job durations within the cluster.•Leveraging our predictions, we introduce GPARS, an efficient scheduling approach for job-GPU allocation. Evaluated with Alibaba and Philly traces, GPARS substantially cuts waiting time by 10.29
Author	Chen, Shiping Wang, Sheng Shi, Yumei
Author_xml	– sequence: 1 givenname: Sheng surname: Wang fullname: Wang, Sheng organization: Business School, University of Shanghai for Science and Technology, Shanghai 200082, PR China – sequence: 2 givenname: Shiping surname: Chen fullname: Chen, Shiping email: chensp@usst.edu.cn organization: Business School, University of Shanghai for Science and Technology, Shanghai 200082, PR China – sequence: 3 givenname: Yumei surname: Shi fullname: Shi, Yumei organization: School of Mathematics and Finance, Chuzhou University, Chuzhou 239000, Anhui, PR China
BookMark	eNqFkM1Kw0AUhQepYFt9AxfzAonzk2SSLoRSNAqCRS24G9LJnWZKmikzk4Jvb0JdudDVhcP9Dpxvhiad7QChW0piSmh2t491H3oHMSOMD1FMGLtAU5oLFglK0wmaDm8iErz4vEIz7_eEECo4naKqXC_f3he4dNWxwUcHtVHBnABX7c46E5oD1tZh0NooA13ADrztnQLsVQN135puh02HGwjg7A46sL3H5XqDVdv7IfPX6FJXrYebnztHm8eHj9VT9PJaPq-WL5HiJAuREkXCoM7oVmxTxZOEFUTkRZ5rrVJeaCIgF7QiLC8EZyqlQtRKs1RsM54KkfE5Ss69ylnvHWh5dOZQuS9JiRw1yb08a5KjpjEdNA3Y4hemTKiCsV1wlWn_g-_PMAzDTgac9KMlNVh0oIKsrfm74BtPMYkG
CitedBy_id	crossref_primary_10_3390_app14114697 crossref_primary_10_3390_electronics14051048 crossref_primary_10_1007_s10723_025_09809_2 crossref_primary_10_1016_j_future_2025_107883 crossref_primary_10_1109_TPDS_2025_3577796 crossref_primary_10_1016_j_engappai_2025_110337
Cites_doi	10.1145/3190508.3190517 10.1109/TPDS.2021.3079202 10.1145/2628071.2628117 10.1016/j.future.2023.07.011 10.1145/3575693.3575705 10.1016/j.jnca.2013.10.009 10.1145/2829988.2787488 10.1145/2647868.2654889 10.1016/j.knosys.2023.110609 10.1016/j.ipm.2023.103328 10.1016/j.ipm.2022.103137 10.1145/1272996.1273005 10.1016/j.eswa.2022.118790 10.1145/1998582.1998637 10.1109/TSE.2023.3285280 10.1109/TNNLS.2020.2978386 10.1145/2168836.2168847 10.1145/3342195.3387555 10.1145/3458817.3476223 10.1109/TPDS.2021.3136245 10.21203/rs.3.rs-2266264/v1 10.1016/j.future.2023.03.041 10.1016/j.future.2023.05.032 10.1145/2523616.2523633 10.1145/3292500.3330925 10.1016/j.compmedimag.2021.102027 10.1016/j.jpdc.2020.05.014
ContentType	Journal Article
Copyright	2023 Elsevier B.V.
Copyright_xml	– notice: 2023 Elsevier B.V.
DBID	AAYXX CITATION
DOI	10.1016/j.future.2023.10.022
DatabaseName	CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1872-7115
EndPage	137
ExternalDocumentID	10_1016_j_future_2023_10_022 S0167739X23003953
GroupedDBID	--K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29H 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD AEBSH AEKER AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W KOM LG9 M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SES SEW SPC SPCBC SSV SSZ T5K UHS WUQ XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ADNMO AEIPS AFJKZ AGQPQ AIIUN ANKPU APXCP CITATION EFKBS EFLBG ~HD
ID	FETCH-LOGICAL-c306t-c7942ed61b7b5c34429078988ffc539f07e871a0289732c5177dcf257b6357763
ISICitedReferencesCount	7
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001106583300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0167-739X
IngestDate	Sat Nov 29 07:22:11 EST 2025 Tue Nov 18 22:11:37 EST 2025 Sat May 04 15:43:12 EDT 2024
IsPeerReviewed	true
IsScholarly	true
Keywords	Graph attention networks Resource scheduling Resource utilization Heterogeneous GPU clusters Job duration Waiting time
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c306t-c7942ed61b7b5c34429078988ffc539f07e871a0289732c5177dcf257b6357763
PageCount	11
ParticipantIDs	crossref_primary_10_1016_j_future_2023_10_022 crossref_citationtrail_10_1016_j_future_2023_10_022 elsevier_sciencedirect_doi_10_1016_j_future_2023_10_022
PublicationCentury	2000
PublicationDate	March 2024 2024-03-00
PublicationDateYYYYMMDD	2024-03-01
PublicationDate_xml	– month: 03 year: 2024 text: March 2024
PublicationDecade	2020
PublicationTitle	Future generation computer systems
PublicationYear	2024
Publisher	Elsevier B.V
Publisher_xml	– name: Elsevier B.V
References	Y. Peng, et al., Optimus: an efficient dynamic resource scheduler for deep learning clusters, in: Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–14. Alworafi (b42) 2016 H. Zhao, et al., HiveD: Sharing a GPU cluster for deep learning with guarantees, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 515–532. Wang (b47) 2020 Ray (b50) 2023 S. Chaudhary, et al., Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning, in: Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16. P. Zheng, et al., Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning, in: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 23, 2023, pp. 703–723. Wu (b32) 2020; 32 Q. Hu, et al., Characterization and prediction of deep learning workloads in large-scale gpu datacenters, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15. Zhai (b37) 2023; 60 P. Delgado, et al., Hawk: Hybrid datacenter scheduling, in: Proceedings of the 2015 USENIX Annual Technical Conference, 2015, pp. 499–510. Liu (b31) 2023 M. Jeon, et al., Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads, in: USENIX: Annual Technical Conference, 2019, pp. 947–960. E. Boutin, et al., Apollo: scalable and coordinated scheduling for cloud-scale computing, in: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, pp. 285–300. A.D. Ferguson, et al., Jockey: guaranteed job latency in data parallel clusters, in: Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 99–112. Yu (b12) 2020; 2 Xu (b40) 2020 Dubey (b39) 2018 Zheng (b44) 2023; 273 S. Pai, et al., Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels, in: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 483–484. Ye (b48) 2021; 33 J. Gu, et al., Tiresias: A GPU Cluster Manager for Distributed Deep Learning, in: 16th USENIX Symposium on Networked Systems Design and Implementation, 2019, pp. 485–500. Y. Jia, et al., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678. Paszke (b34) 2019; 32 W.L. Chiang, et al., Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266. Q. Weng, et al., MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters, in: 19th USENIX symposium on networked systems design and implementation, 2022, pp. 945–960. Zeng (b26) 2023; 213 Jalaparti (b24) 2015; 45 V.K. Vavilapalli, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 1–16. Liu (b38) 2020; 145 K. Mahajan, et al., Themis: Fair and efficient GPU cluster scheduling, in: 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 20, 2020, pp. 289–304. D. Narayanan, et al., Heterogeneity-Aware cluster scheduling policies for deep learning workloads, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 481–498. Rajasekaran (b3) 2023; 11 S. Venkataraman, et al., Ernest: Efficient performance prediction for Large-Scale advanced analytics, in: 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 16, 2016, pp. 363–378. Liu (b20) 2023 Ali (b8) 2023; 149 W. Xiao, et al., AntMan: Dynamic scaling on GPU clusters for deep learning, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 533–548. Hightower (b9) 2017 M. Isard, et al., Dryad: distributed data-parallel programs from sequential building blocks, in: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 2007, pp. 59–72. Manurung (b41) 2019; 7 ong (b28) 2010; 60 Jiang (b46) 2023; 60 Ahmedt-Aristizabal (b36) 2022; 95 B. Hindman, et al., Mesos: A platform for Fine-Grained resource sharing in the data center, in: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 11, 2011. Krzywaniak (b35) 2023; 145 A. Verma, et al., Aria: automatic resource inference and allocation for mapreduce environments, in: Proceedings of the 8th ACM International Conference on Autonomic Computing, 2011, pp. 235–244. S.A. Jyothi, et al., Morpheus: Towards Automated SLOs for Enterprise Clusters, in: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 16, 2016, pp. 117–134. Wang (b49) 2019 Liu (b29) 2014; 41 Yeung (b4) 2021; 33 W. Xiao, et al., Gandiva: Introspective cluster scheduling for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation, 2018, pp. 595–610. Q. Hu, et al., Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023, pp. 457–472. 10.1016/j.future.2023.10.022_b45 Liu (10.1016/j.future.2023.10.022_b31) 2023 Paszke (10.1016/j.future.2023.10.022_b34) 2019; 32 Liu (10.1016/j.future.2023.10.022_b38) 2020; 145 Wang (10.1016/j.future.2023.10.022_b49) 2019 10.1016/j.future.2023.10.022_b43 Rajasekaran (10.1016/j.future.2023.10.022_b3) 2023; 11 Krzywaniak (10.1016/j.future.2023.10.022_b35) 2023; 145 10.1016/j.future.2023.10.022_b13 10.1016/j.future.2023.10.022_b14 Liu (10.1016/j.future.2023.10.022_b29) 2014; 41 10.1016/j.future.2023.10.022_b11 10.1016/j.future.2023.10.022_b17 10.1016/j.future.2023.10.022_b18 Wang (10.1016/j.future.2023.10.022_b47) 2020 10.1016/j.future.2023.10.022_b15 10.1016/j.future.2023.10.022_b16 Yeung (10.1016/j.future.2023.10.022_b4) 2021; 33 10.1016/j.future.2023.10.022_b10 10.1016/j.future.2023.10.022_b51 10.1016/j.future.2023.10.022_b52 Zhai (10.1016/j.future.2023.10.022_b37) 2023; 60 ong (10.1016/j.future.2023.10.022_b28) 2010; 60 Zheng (10.1016/j.future.2023.10.022_b44) 2023; 273 10.1016/j.future.2023.10.022_b19 Ali (10.1016/j.future.2023.10.022_b8) 2023; 149 Ahmedt-Aristizabal (10.1016/j.future.2023.10.022_b36) 2022; 95 Ray (10.1016/j.future.2023.10.022_b50) 2023 10.1016/j.future.2023.10.022_b25 10.1016/j.future.2023.10.022_b22 Dubey (10.1016/j.future.2023.10.022_b39) 2018 10.1016/j.future.2023.10.022_b23 10.1016/j.future.2023.10.022_b27 10.1016/j.future.2023.10.022_b21 Wu (10.1016/j.future.2023.10.022_b32) 2020; 32 Ye (10.1016/j.future.2023.10.022_b48) 2021; 33 Yu (10.1016/j.future.2023.10.022_b12) 2020; 2 10.1016/j.future.2023.10.022_b33 Manurung (10.1016/j.future.2023.10.022_b41) 2019; 7 Alworafi (10.1016/j.future.2023.10.022_b42) 2016 Jalaparti (10.1016/j.future.2023.10.022_b24) 2015; 45 Liu (10.1016/j.future.2023.10.022_b20) 2023 10.1016/j.future.2023.10.022_b30 Xu (10.1016/j.future.2023.10.022_b40) 2020 Jiang (10.1016/j.future.2023.10.022_b46) 2023; 60 10.1016/j.future.2023.10.022_b2 10.1016/j.future.2023.10.022_b1 10.1016/j.future.2023.10.022_b6 10.1016/j.future.2023.10.022_b7 Hightower (10.1016/j.future.2023.10.022_b9) 2017 Zeng (10.1016/j.future.2023.10.022_b26) 2023; 213 10.1016/j.future.2023.10.022_b5
References_xml	– reference: Q. Hu, et al., Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs, in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023, pp. 457–472. – reference: M. Isard, et al., Dryad: distributed data-parallel programs from sequential building blocks, in: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 2007, pp. 59–72. – reference: S.A. Jyothi, et al., Morpheus: Towards Automated SLOs for Enterprise Clusters, in: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 16, 2016, pp. 117–134. – reference: P. Zheng, et al., Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning, in: 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 23, 2023, pp. 703–723. – volume: 32 year: 2019 ident: b34 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Adv. Neural Inf. Process. Syst. – reference: B. Hindman, et al., Mesos: A platform for Fine-Grained resource sharing in the data center, in: 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 11, 2011. – reference: S. Venkataraman, et al., Ernest: Efficient performance prediction for Large-Scale advanced analytics, in: 13th USENIX Symposium on Networked Systems Design and Implementation, NSDI 16, 2016, pp. 363–378. – start-page: 337 year: 2023 end-page: 339 ident: b50 article-title: Privacy-preserving job scheduler for GPU sharing publication-title: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops – reference: Y. Peng, et al., Optimus: an efficient dynamic resource scheduler for deep learning clusters, in: Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–14. – reference: A. Verma, et al., Aria: automatic resource inference and allocation for mapreduce environments, in: Proceedings of the 8th ACM International Conference on Autonomic Computing, 2011, pp. 235–244. – volume: 45 start-page: 407 year: 2015 end-page: 420 ident: b24 article-title: Network-aware scheduling for data-parallel jobs: Plan when you can publication-title: Acm Sigcomm Comput. Com. – volume: 60 year: 2023 ident: b37 article-title: Causality-based CTR prediction using graph neural networks publication-title: Inf. Process. Manage. – start-page: 189 year: 2019 end-page: 202 ident: b49 article-title: Characterizing deep learning training workloads on alibaba-pai publication-title: 2019 IEEE International Symposium on Workload Characterization – reference: H. Zhao, et al., HiveD: Sharing a GPU cluster for deep learning with guarantees, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 515–532. – volume: 60 year: 2023 ident: b46 article-title: Forecasting movements of stock time series based on hidden state guided deep learning approach publication-title: Inf. Process. Manage. – volume: 41 start-page: 101 year: 2014 end-page: 113 ident: b29 article-title: Adaptive energy-efficient scheduling algorithm for parallel tasks on homogeneous clusters publication-title: J. Netw. Comput. Appl. – reference: S. Chaudhary, et al., Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning, in: Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16. – reference: V.K. Vavilapalli, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 1–16. – volume: 33 start-page: 88 year: 2021 end-page: 100 ident: b4 article-title: Horus: Interference-aware and prediction-based scheduling in deep learning systems publication-title: IEEE Trans. Parallel Distrib. Syst. – volume: 145 start-page: 396 year: 2023 end-page: 414 ident: b35 article-title: Dynamic GPU power cap with online performance tracing for energy efficient GPU computing using DEPO tool publication-title: Future Gener. Comput. Syst. – volume: 7 start-page: 44 year: 2019 end-page: 47 ident: b41 article-title: Application of FIFO algorithm (First In First Out) to simulation queue publication-title: INFOKUM – volume: 213 year: 2023 ident: b26 article-title: Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network publication-title: Expert Syst. Appl. – reference: Q. Hu, et al., Characterization and prediction of deep learning workloads in large-scale gpu datacenters, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15. – reference: A.D. Ferguson, et al., Jockey: guaranteed job latency in data parallel clusters, in: Proceedings of the 7th ACM European Conference on Computer Systems, 2012, pp. 99–112. – reference: W. Xiao, et al., Gandiva: Introspective cluster scheduling for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation, 2018, pp. 595–610. – reference: W.L. Chiang, et al., Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 257–266. – year: 2017 ident: b9 article-title: Kubernetes: Up and Running Dive Into the Future of Infrastructure – reference: M. Jeon, et al., Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads, in: USENIX: Annual Technical Conference, 2019, pp. 947–960. – reference: S. Pai, et al., Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels, in: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 483–484. – reference: P. Delgado, et al., Hawk: Hybrid datacenter scheduling, in: Proceedings of the 2015 USENIX Annual Technical Conference, 2015, pp. 499–510. – start-page: 208 year: 2016 end-page: 212 ident: b42 article-title: An improved SJF scheduling algorithm in cloud computing environment publication-title: 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques – volume: 33 start-page: 2781 year: 2021 end-page: 2793 ident: b48 article-title: Astraea: A fair deep learning scheduler for multi-tenant gpu clusters publication-title: IEEE Trans. Parallel Distrib. Syst. – reference: Y. Jia, et al., Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 675–678. – reference: K. Mahajan, et al., Themis: Fair and efficient GPU cluster scheduling, in: 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 20, 2020, pp. 289–304. – year: 2023 ident: b31 article-title: Task-oriented ML/DL library recommendation based on a knowledge graph publication-title: IEEE Trans. Softw. Eng. – volume: 11 start-page: 324 year: 2023 end-page: 329 ident: b3 article-title: AI and cloud computing-how the cloud is accelerating AI publication-title: Int. J. Intell. Syst. Appl. Eng. – volume: 149 start-page: 71 year: 2023 end-page: 88 ident: b8 article-title: An automated and portable method for selecting an optimal GPU frequency publication-title: Future Gener. Comput. Syst. – volume: 145 start-page: 1 year: 2020 end-page: 12 ident: b38 article-title: Automatic blockchain whitepapers analysis via heterogeneous graph neural network publication-title: J. Parallel Distrib. Comput. – year: 2023 ident: b20 article-title: Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments publication-title: Future Gener. Comput. Syst. – start-page: 1 year: 2020 end-page: 7 ident: b40 article-title: Reluplex made more practical: Leaky ReLU publication-title: 2020 IEEE Symposium on Computers and Communications – volume: 95 year: 2022 ident: b36 article-title: A survey on graph-based deep learning for computational histopathology publication-title: Comput. Med. Imaging Graph. – start-page: 873 year: 2018 end-page: 880 ident: b39 article-title: Comparative study of convolution neural network’s relu and leaky-relu activation functions publication-title: Applications of Computing, Automation and Wireless Systems in Electrical Engineering: Proceedings of MARC – volume: 32 start-page: 4 year: 2020 end-page: 24 ident: b32 article-title: A comprehensive survey on graph neural networks publication-title: IEEE Trans. Neural Netw. Learn. – volume: 273 year: 2023 ident: b44 article-title: MathNet: Haar-like wavelet multiresolution analysis for graph representation learning publication-title: Knowl.-Based Syst. – reference: J. Gu, et al., Tiresias: A GPU Cluster Manager for Distributed Deep Learning, in: 16th USENIX Symposium on Networked Systems Design and Implementation, 2019, pp. 485–500. – start-page: 1 year: 2020 end-page: 13 ident: b47 article-title: An efficient and non-intrusive GPU scheduling framework for deep learning training systems publication-title: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis – reference: D. Narayanan, et al., Heterogeneity-Aware cluster scheduling policies for deep learning workloads, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 481–498. – volume: 2 start-page: 98 year: 2020 end-page: 111 ident: b12 article-title: Fine-grained GPU sharing primitives for deep learning applications publication-title: Proc. Mach. Learn. Syst. – reference: E. Boutin, et al., Apollo: scalable and coordinated scheduling for cloud-scale computing, in: 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 14, 2014, pp. 285–300. – reference: Q. Weng, et al., MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters, in: 19th USENIX symposium on networked systems design and implementation, 2022, pp. 945–960. – reference: W. Xiao, et al., AntMan: Dynamic scaling on GPU clusters for deep learning, in: 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 20, 2020, pp. 533–548. – volume: 60 start-page: 360 year: 2010 end-page: 374 ident: b28 article-title: EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters publication-title: IEEE Trans. Comput. – ident: 10.1016/j.future.2023.10.022_b18 – ident: 10.1016/j.future.2023.10.022_b27 doi: 10.1145/3190508.3190517 – volume: 33 start-page: 88 issue: 1 year: 2021 ident: 10.1016/j.future.2023.10.022_b4 article-title: Horus: Interference-aware and prediction-based scheduling in deep learning systems publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2021.3079202 – ident: 10.1016/j.future.2023.10.022_b6 – ident: 10.1016/j.future.2023.10.022_b43 doi: 10.1145/2628071.2628117 – volume: 149 start-page: 71 year: 2023 ident: 10.1016/j.future.2023.10.022_b8 article-title: An automated and portable method for selecting an optimal GPU frequency publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2023.07.011 – ident: 10.1016/j.future.2023.10.022_b19 doi: 10.1145/3575693.3575705 – volume: 41 start-page: 101 year: 2014 ident: 10.1016/j.future.2023.10.022_b29 article-title: Adaptive energy-efficient scheduling algorithm for parallel tasks on homogeneous clusters publication-title: J. Netw. Comput. Appl. doi: 10.1016/j.jnca.2013.10.009 – volume: 45 start-page: 407 issue: 4 year: 2015 ident: 10.1016/j.future.2023.10.022_b24 article-title: Network-aware scheduling for data-parallel jobs: Plan when you can publication-title: Acm Sigcomm Comput. Com. doi: 10.1145/2829988.2787488 – ident: 10.1016/j.future.2023.10.022_b33 doi: 10.1145/2647868.2654889 – volume: 273 year: 2023 ident: 10.1016/j.future.2023.10.022_b44 article-title: MathNet: Haar-like wavelet multiresolution analysis for graph representation learning publication-title: Knowl.-Based Syst. doi: 10.1016/j.knosys.2023.110609 – volume: 7 start-page: 44 issue: 2 year: 2019 ident: 10.1016/j.future.2023.10.022_b41 article-title: Application of FIFO algorithm (First In First Out) to simulation queue publication-title: INFOKUM – start-page: 337 year: 2023 ident: 10.1016/j.future.2023.10.022_b50 article-title: Privacy-preserving job scheduler for GPU sharing – volume: 11 start-page: 324 issue: 1 year: 2023 ident: 10.1016/j.future.2023.10.022_b3 article-title: AI and cloud computing-how the cloud is accelerating AI publication-title: Int. J. Intell. Syst. Appl. Eng. – ident: 10.1016/j.future.2023.10.022_b15 – volume: 60 issue: 3 year: 2023 ident: 10.1016/j.future.2023.10.022_b46 article-title: Forecasting movements of stock time series based on hidden state guided deep learning approach publication-title: Inf. Process. Manage. doi: 10.1016/j.ipm.2023.103328 – ident: 10.1016/j.future.2023.10.022_b30 – year: 2017 ident: 10.1016/j.future.2023.10.022_b9 – ident: 10.1016/j.future.2023.10.022_b5 – ident: 10.1016/j.future.2023.10.022_b21 – ident: 10.1016/j.future.2023.10.022_b52 – ident: 10.1016/j.future.2023.10.022_b25 – start-page: 208 year: 2016 ident: 10.1016/j.future.2023.10.022_b42 article-title: An improved SJF scheduling algorithm in cloud computing environment – start-page: 1 year: 2020 ident: 10.1016/j.future.2023.10.022_b40 article-title: Reluplex made more practical: Leaky ReLU – volume: 60 start-page: 360 issue: 3 year: 2010 ident: 10.1016/j.future.2023.10.022_b28 article-title: EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters publication-title: IEEE Trans. Comput. – volume: 32 year: 2019 ident: 10.1016/j.future.2023.10.022_b34 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.future.2023.10.022_b14 – volume: 60 issue: 1 year: 2023 ident: 10.1016/j.future.2023.10.022_b37 article-title: Causality-based CTR prediction using graph neural networks publication-title: Inf. Process. Manage. doi: 10.1016/j.ipm.2022.103137 – ident: 10.1016/j.future.2023.10.022_b17 doi: 10.1145/1272996.1273005 – volume: 213 year: 2023 ident: 10.1016/j.future.2023.10.022_b26 article-title: Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2022.118790 – ident: 10.1016/j.future.2023.10.022_b51 doi: 10.1145/1998582.1998637 – year: 2023 ident: 10.1016/j.future.2023.10.022_b31 article-title: Task-oriented ML/DL library recommendation based on a knowledge graph publication-title: IEEE Trans. Softw. Eng. doi: 10.1109/TSE.2023.3285280 – volume: 32 start-page: 4 issue: 1 year: 2020 ident: 10.1016/j.future.2023.10.022_b32 article-title: A comprehensive survey on graph neural networks publication-title: IEEE Trans. Neural Netw. Learn. doi: 10.1109/TNNLS.2020.2978386 – ident: 10.1016/j.future.2023.10.022_b22 doi: 10.1145/2168836.2168847 – ident: 10.1016/j.future.2023.10.022_b13 – start-page: 189 year: 2019 ident: 10.1016/j.future.2023.10.022_b49 article-title: Characterizing deep learning training workloads on alibaba-pai – ident: 10.1016/j.future.2023.10.022_b1 doi: 10.1145/3342195.3387555 – ident: 10.1016/j.future.2023.10.022_b11 doi: 10.1145/3458817.3476223 – volume: 33 start-page: 2781 issue: 11 year: 2021 ident: 10.1016/j.future.2023.10.022_b48 article-title: Astraea: A fair deep learning scheduler for multi-tenant gpu clusters publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2021.3136245 – ident: 10.1016/j.future.2023.10.022_b2 doi: 10.21203/rs.3.rs-2266264/v1 – ident: 10.1016/j.future.2023.10.022_b23 – volume: 145 start-page: 396 year: 2023 ident: 10.1016/j.future.2023.10.022_b35 article-title: Dynamic GPU power cap with online performance tracing for energy efficient GPU computing using DEPO tool publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2023.03.041 – year: 2023 ident: 10.1016/j.future.2023.10.022_b20 article-title: Heterps: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2023.05.032 – ident: 10.1016/j.future.2023.10.022_b7 – ident: 10.1016/j.future.2023.10.022_b10 doi: 10.1145/2523616.2523633 – ident: 10.1016/j.future.2023.10.022_b45 doi: 10.1145/3292500.3330925 – start-page: 1 year: 2020 ident: 10.1016/j.future.2023.10.022_b47 article-title: An efficient and non-intrusive GPU scheduling framework for deep learning training systems – volume: 95 year: 2022 ident: 10.1016/j.future.2023.10.022_b36 article-title: A survey on graph-based deep learning for computational histopathology publication-title: Comput. Med. Imaging Graph. doi: 10.1016/j.compmedimag.2021.102027 – volume: 145 start-page: 1 year: 2020 ident: 10.1016/j.future.2023.10.022_b38 article-title: Automatic blockchain whitepapers analysis via heterogeneous graph neural network publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2020.05.014 – ident: 10.1016/j.future.2023.10.022_b16 – start-page: 873 year: 2018 ident: 10.1016/j.future.2023.10.022_b39 article-title: Comparative study of convolution neural network’s relu and leaky-relu activation functions – volume: 2 start-page: 98 year: 2020 ident: 10.1016/j.future.2023.10.022_b12 article-title: Fine-grained GPU sharing primitives for deep learning applications publication-title: Proc. Mach. Learn. Syst.
SSID	ssj0001731
Score	2.433354
Snippet	Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are critical for maximizing system performance and optimizing resource...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	127
SubjectTerms	Graph attention networks Heterogeneous GPU clusters Job duration Resource scheduling Resource utilization Waiting time
Title	GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters
URI	https://dx.doi.org/10.1016/j.future.2023.10.022
Volume	152
WOSCitedRecordID	wos001106583300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-7115 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001731 issn: 0167-739X databaseCode: AIEXJ dateStart: 19950201 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV07b9swECbcpEOX9I2kL3DoZsgQRckUuxlF4qRDYNQJ6k6CRJG1AkUOHCfI3j_eO5KS3broC-giGIRlEbzP99Ldd4S8VSwfKm1YwPJYQ4AiyyAteRRww4wMizAy3NhhE-L0NJ3N5KTX-9r2wtzWomnSuzt59V9FDWsgbGyd_Qtxdz8KC_AZhA5XEDtc_0jw48no4xQD_TFyUSMJQFlZpdbP6y-LZbWaXzqeb0segaUAS5_C70OkC5an9m0uc6yUWcBzNJbJjifnfVXfIK_C9aZHe2RJSXASs_ZgUn5QhGeJ7pz2T21qeq69ubRlBU7vTec4RLtbntpZw_3PoDmrzcREFK8rs1y2bKtjxiUwQTELbsfngv1xSjcV4OUz19bZaWVHbOv1KnMEAt5EM8cTs6X9XSLiYuDoWAY4GX6ApXuu8_kHXu0pbgV3AkFYyGXC75HdSCQSVOPu6ORw9qEz6Ez4sZZ-620Hpi0T3H7Wzz2cDa_l7BHZ8-EGHTmYPCY93TwhD9tRHtRr9qckt6h5Ry1m6BoztMMMBczQDjO0xQxdY4ZWDf0OMxQwQ1vMPCPnR4dn748DP30jUBBGrgIFmjrS5ZAVokgUj8FxwdEEaWqMSrg0odAQbOf4plrwSCVMiFIZsAAFUhyC2XpOdppFo_cJzUsIC4ZCxyUvYxmhi6qkEWFammFY5OEB4e2JZcpT0-OElDpraxAvMnfOGZ4zrsI5H5Cgu-vKUbP85vuiFUbm3UvnNmaAn1_e-eKf73xJHqz_Gq_Izmp5o1-T--p2VV0v33igfQO1GaMv
linkProvider	Elsevier
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=GPARS%3A+Graph+predictive+algorithm+for+efficient+resource+scheduling+in+heterogeneous+GPU+clusters&rft.jtitle=Future+generation+computer+systems&rft.au=Wang%2C+Sheng&rft.au=Chen%2C+Shiping&rft.au=Shi%2C+Yumei&rft.date=2024-03-01&rft.pub=Elsevier+B.V&rft.issn=0167-739X&rft.eissn=1872-7115&rft.volume=152&rft.spage=127&rft.epage=137&rft_id=info:doi/10.1016%2Fj.future.2023.10.022&rft.externalDocID=S0167739X23003953
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon