Spread-n-Share: Improving Application Performance and Cluster Throughput with Resource-aware Job Placement

Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few com-pute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studi...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC19: International Conference for High Performance Computing, Networking, Storage and Analysis s. 1 - 15
Hlavní autori: Tang, Xiongchao, Wang, Haojie, Ma, Xiaosong, El-Sayed, Nosayba, Zhai, Jidong, Chen, Wenguang, Aboulnaga, Ashraf
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: ACM 17.11.2019
Predmet:
ISSN:2167-4337
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few com-pute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound ap-plications out onto more nodes to alleviate their performance bot-tleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, consid-ering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.
ISSN:2167-4337
DOI:10.1145/3295500.3356152