GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

Efficient task scheduling has become increasingly complex as the number and type of tasks proliferate and the size of computing resource grows in large-scale distributed high-performance computing (HPC) systems. At present, deep reinforcement learning (DRL) methods have achieved certain success in s...

Full description

Saved in:

Bibliographic Details
Published in:	Future generation computer systems Vol. 135; pp. 259 - 269
Main Authors:	Li, Jingbo, Zhang, Xingjun, Wei, Jia, Ji, Zeyu, Wei, Zheng
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.10.2022
Subjects:	Deep reinforcement learning Distributed systems Expert guidance High performance computing Task scheduling Task scheduling Deep reinforcement learning Expert guidance High performance computing Distributed systems
ISSN:	0167-739X, 1872-7115
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Efficient task scheduling has become increasingly complex as the number and type of tasks proliferate and the size of computing resource grows in large-scale distributed high-performance computing (HPC) systems. At present, deep reinforcement learning (DRL) methods have achieved certain success in scheduling problems. However, due to the exogeneity of the task and the sparsity of the reward, the learning of the DRL control policy requires a significant amount of training time and data and cannot guarantee effective convergence. Meanwhile, based on the understanding of HPC system characteristics, various scheduling policies with acceptable performance for different optimization goals have been developed by the experts. But these heuristic methods cannot adapt to environmental changes and optimize for specific workloads. Therefore, the generative adversarial reinforcement learning scheduling (GARLSched) algorithm is proposed to effectively guide the learning of DRL in large-scale dynamic scheduling issues based on the optimal policy in the expert pool. In addition, the task embedding-based discriminator network effectively improves and stabilizes the learning process. Experiments show that compared with heuristic and DRL scheduling algorithms, GARLSched can learn high-quality scheduling policies for various workloads and optimization objects. Furthermore, the learned models can perform stably even when applied to invisible workloads, making them more practical in HPC systems. •The expert guidance algorithm based on generative adversarial method is proposed.•The ability of policy learning is improved for large-scale dynamic task scheduling.•The expert pool is constructed by multiple existing polices to select best expert’s action.•A task-embedding based discriminator is designed to achieve better classification ability.•Experiments are conducted on the real HPC system workloads.•GARLSched can learn high-quality policies for various workloads and objects.•The learned models can perform stably even when applied to invisible workloads.
ISSN:	0167-739X 1872-7115
DOI:	10.1016/j.future.2022.04.032