A High-Level API for Dynamic Load Balancing in Large-Scale Parameter Sweeps
Parameter sweep studies, such as virtual drug screening, may exhibit significant load imbalance during batched execution on large-scale clusters, resulting in idle and thus wasted resources. Work stealing is a popular method for dynamic load balancing in such scenarios. However, we demonstrate that...
Gespeichert in:
| Veröffentlicht in: | International journal of parallel programming Jg. 53; H. 4; S. 25 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
New York
Springer US
01.08.2025
Springer Nature B.V |
| Schlagworte: | |
| ISSN: | 0885-7458, 1573-7640 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Parameter sweep studies, such as virtual drug screening, may exhibit significant load imbalance during batched execution on large-scale clusters, resulting in idle and thus wasted resources. Work stealing is a popular method for dynamic load balancing in such scenarios. However, we demonstrate that work stealing alone falls short in cases where a few ranks generate long running high-cost jobs, particularly towards the end of a computation. To address this challenge, we propose high-cost probing, a mechanism for distributing high-cost jobs across workers early during program execution by leveraging user-provided cost hints. We extend the Celerity programming model for distributed accelerator computing with a high-level API tailored to expressing parameter sweep-style workflows that may benefit from high-cost probing. We demonstrate the effectiveness of our approach on synthetic benchmarks as well as a real-world virtual screening application with a highly irregular workload, achieving a 73 percentage point reduction in load imbalance on 128 GPUs. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0885-7458 1573-7640 |
| DOI: | 10.1007/s10766-025-00804-4 |