A Deep Dive into Task-Based Parallelism in Python

Modern Python programs in high-performance computing call into compiled libraries and kernels for performance-critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across modern heterogeneous platforms remains a challenge. First, we perform an experimenta...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) s. 1147 - 1149
Hlavní autoři: Ruys, William, Lee, Hochan, You, Bozhi, Talati, Shreya, Park, Jaeyoung, Almgren-Bell, James, Yan, Yineng, Fernando, Milinda, Biros, George, Erez, Mattan, Burtscher, Martin, Rossbach, Christopher J., Pingali, Keshav, Gligoric, Milos
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 27.05.2024
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Modern Python programs in high-performance computing call into compiled libraries and kernels for performance-critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across modern heterogeneous platforms remains a challenge. First, we perform an experimental study to examine the impact of Python's Global Interpreter Lock (GIL), and potential speedups under a GIL-less PEP703 future, to guide runtime design. Using our optimized runtime, we explore scheduling tasks with constraints that require resources across multiple, potentially diverse, devices through the introduction of new programming abstractions and runtime mechanisms. We extend an existing Python tasking library, Parla, to augment its performance and add support for such multi-device tasks. Our experimental analysis, using tasks graphs from synthetic and real applications, shows at least a 3 ×(and up to 6 ×) performance improvement over its predecessor in scenarios with high GIL contention. When scheduling multi-GPU tasks, we observe an 8x reduction in per-task launching overhead compared to a multi-process system.
DOI:10.1109/IPDPSW63119.2024.00187