MatFactory: A Framework for High-Performance Matrix Factorization on FPGAs

Matrix factorization is a widely used powerful tool in signal processing, machine learning and high performance computing. For accelerating matrix factorization, FPGAs are suitable platforms, as they can build wide and deep pipelines with favorable power efficiency. Factorizing matrices on FPGAs is...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design S. 1 - 9
Hauptverfasser: Zhang, Mingzhe, Hao, Xiaochen, Rong, Hongbo, Chen, Wenguang
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: ACM 27.10.2024
Schlagworte:
ISSN:1558-2434
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Matrix factorization is a widely used powerful tool in signal processing, machine learning and high performance computing. For accelerating matrix factorization, FPGAs are suitable platforms, as they can build wide and deep pipelines with favorable power efficiency. Factorizing matrices on FPGAs is thus desirable; however, there is no infrastructure on FPGAs for matrix factorization so far, as it involves several challenges: applicability and scalability of the circuit, pipelining of irregular computing patterns, and effective data caching given the limited memory bandwidth. We propose MatFactory, a novel framework that enables fast development of high-performance algorithms for factorizing matrices on FPGAs. We extract common key operators out of various factorization algorithms, and provide a convenient streaming interface that explicitly moves and manages data through the memory hierarchy. With the interface support, the operators can be easily reused as building blocks and composed together into diverse inBRAM non-blocked factorization algorithms as well as in-DRAM blocked factorization algorithms. We evaluate MatFactory with three typical algorithms (Cholesky, LU and QR) on Intel A10 FPGA. Our non-blocked factorization achieves 4.0-10.7 \times speedup over Vitis Library on Xilinx Alveo U280 FPGA, and the blocked implementation further achieves 1.65-1.88 \times performance compared to the non-blocked version. This is the first framework that systematically designs and accommodates various matrix factorization algorithms for FPGAs, to the best of our knowledge, and it can be easily extended to support more LAPACK routines in general.
ISSN:1558-2434
DOI:10.1145/3676536.3676780