MatFactory: A Framework for High-Performance Matrix Factorization on FPGAs

Matrix factorization is a widely used powerful tool in signal processing, machine learning and high performance computing. For accelerating matrix factorization, FPGAs are suitable platforms, as they can build wide and deep pipelines with favorable power efficiency. Factorizing matrices on FPGAs is...

Full description

Saved in:

Bibliographic Details
Published in:	Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design pp. 1 - 9
Main Authors:	Zhang, Mingzhe, Hao, Xiaochen, Rong, Hongbo, Chen, Wenguang
Format:	Conference Proceeding
Language:	English
Published:	ACM 27.10.2024
Subjects:	Design automation Field programmable gate arrays Hardware High performance computing Libraries Machine learning Pipeline processing Scalability Signal processing Signal processing algorithms
ISSN:	1558-2434
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Matrix factorization is a widely used powerful tool in signal processing, machine learning and high performance computing. For accelerating matrix factorization, FPGAs are suitable platforms, as they can build wide and deep pipelines with favorable power efficiency. Factorizing matrices on FPGAs is thus desirable; however, there is no infrastructure on FPGAs for matrix factorization so far, as it involves several challenges: applicability and scalability of the circuit, pipelining of irregular computing patterns, and effective data caching given the limited memory bandwidth. We propose MatFactory, a novel framework that enables fast development of high-performance algorithms for factorizing matrices on FPGAs. We extract common key operators out of various factorization algorithms, and provide a convenient streaming interface that explicitly moves and manages data through the memory hierarchy. With the interface support, the operators can be easily reused as building blocks and composed together into diverse inBRAM non-blocked factorization algorithms as well as in-DRAM blocked factorization algorithms. We evaluate MatFactory with three typical algorithms (Cholesky, LU and QR) on Intel A10 FPGA. Our non-blocked factorization achieves 4.0-10.7 \times speedup over Vitis Library on Xilinx Alveo U280 FPGA, and the blocked implementation further achieves 1.65-1.88 \times performance compared to the non-blocked version. This is the first framework that systematically designs and accommodates various matrix factorization algorithms for FPGAs, to the best of our knowledge, and it can be easily extended to support more LAPACK routines in general.
ISSN:	1558-2434
DOI:	10.1145/3676536.3676780