cuTensor-Tubal: Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs
Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architecture...
Saved in:
| Published in: | IEEE transactions on parallel and distributed systems Vol. 31; no. 3; pp. 595 - 610 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York
IEEE
01.03.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 1045-9219, 1558-2183 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations oft-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum 16.91x, 27.03x, 38.97x, 22.36x,15.43x speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum 9.80x and 269.26x speedups over multi-core CPU implementations. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1045-9219 1558-2183 |
| DOI: | 10.1109/TPDS.2019.2940192 |