Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness

Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today's data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on parallel computing Jg. 12; H. 1
Hauptverfasser: Pham, Minh, Yuan, Yongke, Li, Hao, Mou, Chengcheng, Tu, Yicheng, Xu, Zichen, Meng, Jinghan
Format: Journal Article
Sprache:Englisch
Veröffentlicht: 11.02.2025
Schlagworte:
ISSN:2329-4957, 2329-4957
Online-Zugang:Weitere Angaben
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today's data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of paramount importance. One such challenge is the design of a dynamic memory allocator to allocate memory at runtime. Traditionally, memory allocators have relied on maintaining a global data structure, such as a queue of free pages. However, in the context of massively parallel systems, accessing such global data structures can quickly become a bottleneck even with multiple queues in place. This paper presents a novel approach to dynamic memory allocation that eliminates the need for a centralized data structure. Our proposed approach revolves around letting threads employ random search procedures to locate free pages. Through mathematical proofs and extensive experiments, we demonstrate that the basic random search design achieves lower latency than the best-known existing solution in most situations. Furthermore, we develop more advanced techniques and algorithms to tackle the challenge of warp divergence and further enhance performance when free memory is limited. Building upon these advancements, our mathematical proofs and experimental results affirm that these advanced designs can yield an order of magnitude improvement over the basic design and consistently outperform the state-of-the-art by up to two orders of magnitude. To illustrate the practical implications of our work, we integrate our memory management techniques into two GPU algorithms: a hash join and a group-by. Both case studies provide compelling evidence of our approach's pronounced performance gains.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2329-4957
2329-4957
DOI:10.1145/3701623