Optimizing Breadth-First Search at Scale Using Hardware-Accelerated Space Consistency

Graph traversal is a critical building block in many algorithms. Traversing a large graph using breadth-first search, although conceptually simple, is time-consuming in distributed-memory environments due to the amount of exchanged data. In this work, we present both an efficient algorithmic approac...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings - International Conference on High Performance Computing S. 23 - 33
1. Verfasser:	Ibrahim, Khaled
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.12.2019
Schlagworte:	Bandwidth Benchmark testing Breadth-first Search Computational modeling Distributed Memory Programming Models Hardware Hardware Accelerated Collective Programming Runtime Voting
ISSN:	2640-0316
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Graph traversal is a critical building block in many algorithms. Traversing a large graph using breadth-first search, although conceptually simple, is time-consuming in distributed-memory environments due to the amount of exchanged data. In this work, we present both an efficient algorithmic approach to carry out the traversal and a low-overhead runtime that provides efficient primitives to implement the algorithm. Our algorithm relies on constructing a traversal composed of partially consistent trees until all vertices are discovered. We resolve such inconsistency through election and exchange steps. The election phase relies on communicating compressed vertices through collectives that do not stress the bisection bandwidth of the interconnect. We leverage the space consistency programming abstraction to allow an efficient overlap of computation with communication. We extend the model to leverage hardware accelerated collectives and provide primitives for one-sided broadcast and sparse reduction. We present the algorithm and runtime designs and show the results of applying our techniques on the Bluegene/Q architec-tures. We achieve 1040 GTEPS on a single rack (1K nodes), which is better than the best-known algorithms on the same architecture. We also achieve superior scalability compared with other implementations up to 32K nodes.
ISSN:	2640-0316
DOI:	10.1109/HiPC.2019.00015