Reducing Fragmentation on 3D Torus-Based HPC Systems Using Packing-Based Job Scheduling and Job Placement Reconfiguration

We address the topology-aware job scheduling and placement problems on 3D torus-based high performance computing systems, with the objective of reducing system fragmentation. In our previous work, we proposed a job placement algorithm based on a local migration process, which aims at reducing the in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2017 16th International Symposium on Parallel and Distributed Computing (ISPDC) S. 34 - 43
Hauptverfasser: Kangkang Li, Malawski, Maciej, Nabrzyski, Jarek
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.07.2017
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We address the topology-aware job scheduling and placement problems on 3D torus-based high performance computing systems, with the objective of reducing system fragmentation. In our previous work, we proposed a job placement algorithm based on a local migration process, which aims at reducing the internal fragmentation due to using a convex prism shape for job allocation. However, HPC systems are prone to suffer from external fragmentation as well. Hence, in this paper, we majorly strive to reduce the external fragmentation brought in by the job scheduling and placement processes. Firstly, from the job scheduling aspect of view, we propose a packing-based job scheduling strategy, which reduces the external fragmentation by using the First Come First Served + backfilling strategy. Secondly, we give a review of the migration-based job placement algorithm in our previous work. Thirdly, in order to reduce the external fragmentation resulting from running jobs scattered across the system, we propose a job placement reconfiguration algorithm, which uses a global migration process to rearrange the placement of the running jobs across the system. Both local and global migration are emulated virtual processes under the off-line scenario, which have no migration overhead. However, under the on-line scenario, migration is a real process and leads to a migration delay. Therefore, we propose a buffer-based on-line scheduling model, which helps to avoid the delay of local migration. The evaluation results validate the efficiency of our approach in reducing system fragmentation and improving system utilization.
DOI:10.1109/ISPDC.2017.11