SLearn: A Case for Task Sampling Based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the runtime properties of newly arrived jobs. However, with fast...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on cloud computing Ročník 11; číslo 3; s. 2664 - 2680
Hlavní autori:	Jajoo, Akshay, Hu, Y. Charlie, Lin, Xiaojun, Deng, Nan
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Piscataway IEEE 01.07.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Art history big data applications Cloud computing cluster scheduling Clusters Completion time data-parallel applications distributed applications Distributed systems History Learning online learning Production Resource management Run time (computers) Runtime Sampling sampling-based learning Schedules scheduling and task partitioning Task analysis Task scheduling
ISSN:	2168-7161, 2372-0018
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the runtime properties of newly arrived jobs. However, with fast-paced development in cluster technology (in both hardware and software) and changing user inputs, job runtime properties can change over time, which lead to inaccurate predictions. In this article, we explore the potential and limitation of real-time learning of job runtime properties, by proactively sampling and scheduling a small fraction of the tasks of each job. Such a task-sampling-based approach exploits the similarity among runtime properties of the tasks of the same job and is inherently immune to changing job behavior. Our analytical and experimental analysis of 3 production traces with different skew and job distribution shows that learning in space can be substantially more accurate. Our simulation and testbed evaluation on Azure of the two learning approaches anchored in a generic job scheduler using 3 production cluster job traces shows that despite its online overhead, learning in space reduces the average Job Completion Time (JCT) by 1.28×, 1.56×, and 1.32× compared to the prior-art history-based predictor. We further analyze the experimental results to give intuitive explanations to why learning in space outperforms learning in time in these experiments. Finally, we show how sampling-based learning can be extended to schedule DAG jobs and achieve similar speedups over the prior-art history-based predictor.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2168-7161 2372-0018
DOI:	10.1109/TCC.2022.3222649