Spread-n-Share: Improving Application Performance and Cluster Throughput with Resource-aware Job Placement
Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few com-pute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studi...
Uložené v:
| Vydané v: | SC19: International Conference for High Performance Computing, Networking, Storage and Analysis s. 1 - 15 |
|---|---|
| Hlavní autori: | , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
ACM
17.11.2019
|
| Predmet: | |
| ISSN: | 2167-4337 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few com-pute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound ap-plications out onto more nodes to alleviate their performance bot-tleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, consid-ering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%. |
|---|---|
| ISSN: | 2167-4337 |
| DOI: | 10.1145/3295500.3356152 |