MDLoader: A Hybrid Model-Driven Data Loader for Distributed Graph Neural Network Training

Scalable data management is essential for processing large scientific dataset on HPC platforms for distributed deep learning. In-memory distributed storage is preferred for its speed, enabling rapid, random, and frequent data access required by stochastic optimizers. Processes use one-sided or colle...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis s. 1046 - 1057
Hlavní autori: Bae, Jonghyun, Choi, Jong Youl, Pasini, Massimiliano Lupo, Mehta, Kshitij, Zhang, Pei, Ibrahim, Khaled Z.
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 17.11.2024
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Scalable data management is essential for processing large scientific dataset on HPC platforms for distributed deep learning. In-memory distributed storage is preferred for its speed, enabling rapid, random, and frequent data access required by stochastic optimizers. Processes use one-sided or collective communication to fetch remote data, with optimal performance depending on (i) dataset characteristics, (ii) training scale, and (iii) interconnection network. Empirical analysis shows collective communication excels with larger mini-batch sizes and/or fewer processes, whereas one-sided communication outperforms at larger scales. We propose MDLoader, a hybrid in-memory data loader for distributed graph neural network training. MDLoader features a model-driven performance estimator that dynamically selects between one-sided and collective communication at the beginning of training using Tree of Parzen Estimators (TPE). Evaluations on NERSC Perlmutter and OLCF Summit show MDLoader outperforms single-backend loaders by up to 2.83 × and predicts the suitable communication method with 96.3% (Perlmutter) and 94.3% (Summit) success rate.
DOI:10.1109/SCW63240.2024.00145