A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication kernel

Sparse matrix-vector multiplication (SpMV) is an important computational kernel in many applications. For performance improvement, software libraries designated for SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library for GPUs. However, the computational throughput...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design s. 1 - 6
Hlavní autoři: Sicheng Li, Yandan Wang, Wujie Wen, Yu Wang, Yiran Chen, Hai Li
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: ACM 01.11.2016
Témata:
ISSN:1558-2434
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Sparse matrix-vector multiplication (SpMV) is an important computational kernel in many applications. For performance improvement, software libraries designated for SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library for GPUs. However, the computational throughput of these libraries is far below the peak floating-point performance offered by hardware platforms, because the efficiency of SpMV kernel is greatly constrained by the limited memory bandwidth and irregular data access patterns. In this work, we propose a data locality-aware design framework for FPGA-based SpMV acceleration. We first include the hardware constraints in sparse matrix compression at software level to regularize the memory allocation and accesses. Moreover, a distributed architecture composed of processing elements is developed to improve the computation parallelism. We implement the reconfigurable SpMV kernel on Convey HC-2 ex and conduct the evaluation by using the University of Florida sparse matrix collection. The experiments demonstrate an average computational efficiency of 48.2%, which is a lot better than those of CPU and GPU implementations. Our FPGA-based kernel has a comparable runtime as GPU, and achieves 2.1× reduction than CPU. Moreover, our design obtains substantial saving in energy consumption, say, 9.3× and 5.6× better than the implementations on CPU and GPU, respectively.
ISSN:1558-2434
DOI:10.1145/2966986.2966987