UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods c...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on parallel and distributed systems Ročník 34; číslo 11; s. 2978 - 2993
Hlavní autoři: Huang, Yizhi, Liu, Yan, Bai, Yang, Chen, Si, Li, Renfa
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.11.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:1045-9219, 1558-2183
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods can overcome this shortcoming. However, through a series of tests, observations, and analyses, we realize that developing an effective asynchronous multi-CPU/GPU MF framework faces several major design challenges: the underutilized CPUs, high communication overhead, and the asynchronous data safety issue. This article presents a unified multi-CPU/GPU asynchronous computing framework for SGD-based matrix factorization, named UMA-MF . UMA-MF treats CPUs and GPUs in the system as distributed workers that train matrix datasets in parallel and update feature parameters asynchronously. It provides a cache-friendly CPU external working mode, which can improve the CPU's cache hit rate, thereby promoting the efficient use of CPUs. It offers an algorithm to find the shortest communication ring topology of heterogeneous CPU/GPU workers and builds computing-communication pipelines to minimize the communication overhead. It implements a wait-free structure and load-balanced data distribution to achieve asynchronous data safety. UMA-MF can effectively accelerate SGD-based MF on multi-CPU/GPU systems in an asynchronous way. On a physical platform with configurations ranging from single processor system to 2CPUs--4CPUs system, for five common datasets Netfix, R1, R2, Goodreads, and de-dense, UMA-MF achieves up to 3.56x speedup compared with HCC-MF, which is the state-of-the-art multi-CPU/GPU synchronous computing framework for SGD-based MF. UMA-MF also shows good scalability. When the system is scaled to 2CPUs-4GPUs, the training time speedup of UMA-MF can reach 70%--97% of the ideal speedup.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2023.3317535