A High-Level API for Dynamic Load Balancing in Large-Scale Parameter Sweeps
Parameter sweep studies, such as virtual drug screening, may exhibit significant load imbalance during batched execution on large-scale clusters, resulting in idle and thus wasted resources. Work stealing is a popular method for dynamic load balancing in such scenarios. However, we demonstrate that...
Uloženo v:
| Vydáno v: | International journal of parallel programming Ročník 53; číslo 4; s. 25 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.08.2025
Springer Nature B.V |
| Témata: | |
| ISSN: | 0885-7458, 1573-7640 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Parameter sweep studies, such as virtual drug screening, may exhibit significant load imbalance during batched execution on large-scale clusters, resulting in idle and thus wasted resources. Work stealing is a popular method for dynamic load balancing in such scenarios. However, we demonstrate that work stealing alone falls short in cases where a few ranks generate long running high-cost jobs, particularly towards the end of a computation. To address this challenge, we propose high-cost probing, a mechanism for distributing high-cost jobs across workers early during program execution by leveraging user-provided cost hints. We extend the Celerity programming model for distributed accelerator computing with a high-level API tailored to expressing parameter sweep-style workflows that may benefit from high-cost probing. We demonstrate the effectiveness of our approach on synthetic benchmarks as well as a real-world virtual screening application with a highly irregular workload, achieving a 73 percentage point reduction in load imbalance on 128 GPUs. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0885-7458 1573-7640 |
| DOI: | 10.1007/s10766-025-00804-4 |