Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing

The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tenso...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Future generation computer systems Ročník 166; s. 107698
Hlavní autoři: Martinez-Ferrer, Pedro J., Yzelman, Albert-Jan, Beltran, Vicenç
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.05.2025
Témata:
ISSN:0167-739X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively. •A novel distributed tensor–vector contraction algorithm is introduced.•Analytical formulae are derived for minimal data movement and best throughput.•A best-in-class higher order power method with task-based parallelization is given.•Our dTVC and dHOPM3 algorithms reach 50% to 80% of the theoretical peak bandwidth.•Our cached, mixed precision ad-hoc kernels speedup computation and communication.
AbstractList The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively. •A novel distributed tensor–vector contraction algorithm is introduced.•Analytical formulae are derived for minimal data movement and best throughput.•A best-in-class higher order power method with task-based parallelization is given.•Our dTVC and dHOPM3 algorithms reach 50% to 80% of the theoretical peak bandwidth.•Our cached, mixed precision ad-hoc kernels speedup computation and communication.
ArticleNumber 107698
Author Beltran, Vicenç
Martinez-Ferrer, Pedro J.
Yzelman, Albert-Jan
Author_xml – sequence: 1
  givenname: Pedro J.
  orcidid: 0000-0002-6097-6676
  surname: Martinez-Ferrer
  fullname: Martinez-Ferrer, Pedro J.
  email: pedro.martinez.ferrer@upc.edu
  organization: Departament d’Arquitectura de Computadors (DAC), Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Campus Nord, Edif. D6, C. Jordi Girona 1-3, Barcelona, 08034, Cataluña, Spain
– sequence: 2
  givenname: Albert-Jan
  orcidid: 0000-0001-8842-3689
  surname: Yzelman
  fullname: Yzelman, Albert-Jan
  email: albert-jan@yzelman.net
  organization: Computing Systems Laboratory, Huawei Technologies Switzerland AG, Thurgauerstrasse 80, Zürich, 8050, Dübendorf, Switzerland
– sequence: 3
  givenname: Vicenç
  orcidid: 0000-0002-3580-9630
  surname: Beltran
  fullname: Beltran, Vicenç
  email: vbeltran@bsc.es
  organization: Barcelona Supercomputing Center (BSC), Pl. Eusebi Güell 1-3, Barcelona, 08034, Cataluña, Spain
BookMark eNp9kE1OwzAQhb0oEi1wAxa-QIqdPycbJFR-pUpsQGJnOfY4cdXYle1UYscduCEnwVVYs5rRvHmjed8KLayzgNA1JWtKaH2zW-spTh7WOcnLNGJ12yzQMkksY0X7cY5WIewIIZQVdInkvQnRm26KoLCwCg8QwbseLLgp4Ag2OP_z9X0EGZ3H0tnohYzGWSz2vfMmDmPAOkmD6Qd8AJ_6UVgJaXc8TNHY_hKdabEPcPVXL9D748Pb5jnbvj69bO62mcyrKmaQ54qUXdtBqYluWJV3jVaMiYKWKie60kQqogS0DSuKUgio27atOy3rQpREFBeonO9K70LwoPnBm1H4T04JP8HhOz7D4Sc4fIaTbLezDdJvRwOeB2kgJVDGp9RcOfP_gV86z3h6
Cites_doi 10.1016/j.jocs.2019.02.007
10.1109/99.660313
10.1002/cpe.1206
10.1137/07070111X
10.1016/j.jpdc.2014.06.002
10.1016/0167-8191(94)90033-7
10.1109/TPDS.2022.3153113
10.1007/BF02289464
10.1137/S0895479898346995
10.1145/2764454
10.1016/j.jpdc.2023.04.012
10.3389/fdata.2020.594302
ContentType Journal Article
Copyright 2025 Elsevier B.V.
Copyright_xml – notice: 2025 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.future.2024.107698
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
ExternalDocumentID 10_1016_j_future_2024_107698
S0167739X24006629
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29H
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
ABBOA
ABDPE
ABFNM
ABJNI
ABMAC
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADNMO
AEBSH
AEIPS
AEKER
AFJKZ
AFTJW
AFXIZ
AGCQF
AGHFR
AGQPQ
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
APXCP
ASPBG
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
BNPGV
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
KOM
LG9
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SES
SEW
SPC
SPCBC
SSH
SSV
SSZ
T5K
UHS
WUQ
XPP
ZMT
~G-
9DU
AAYWO
AAYXX
ACLOT
AIIUN
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c255t-e22d04b9be4f0f8752b8fd77a314d20f5f0cd0dae987334aae69996bfc63a40a3
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001421137500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0167-739X
IngestDate Sat Nov 29 06:59:23 EST 2025
Sat May 03 15:56:10 EDT 2025
IsPeerReviewed true
IsScholarly true
Keywords Distributed memory
Tensor contraction
Task-based parallelization
Mixed precision
GPU
High bandwidth memory
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c255t-e22d04b9be4f0f8752b8fd77a314d20f5f0cd0dae987334aae69996bfc63a40a3
ORCID 0000-0002-3580-9630
0000-0002-6097-6676
0000-0001-8842-3689
ParticipantIDs crossref_primary_10_1016_j_future_2024_107698
elsevier_sciencedirect_doi_10_1016_j_future_2024_107698
PublicationCentury 2000
PublicationDate May 2025
2025-05-00
PublicationDateYYYYMMDD 2025-05-01
PublicationDate_xml – month: 05
  year: 2025
  text: May 2025
PublicationDecade 2020
PublicationTitle Future generation computer systems
PublicationYear 2025
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Martinez-Ferrer, Nicholas Yzelman, Beltran (b8) 2022; 33
Bassoy (b12) 2019
Heinecke, Henry, Hutchinson, Pabst (b15) 2016
Tucker (b4) 1966; 31
Pawłowski, Uçar, Yzelman (b7) 2020
Di Napoli, Fabregat-Traver, Quintana-Ortí, Bientinesi (b11) 2014; 235
Kang, Papalexakis, Harpale, Faloutsos (b16) 2012
Brock, n Buluç, Mattson, McMillan, Moreira (b21) 2021
Lathauwer, Moor, Vandewalle (b3) 2000; 21
Martinez-Ferrer, Arslan, Beltran (b33) 2023; 179
Ziogas, Kwasniewski, Ben-Nun, Schneider, Hoefler (b25) 2022
Chan, Heimlich, Purkayastha, van de Geijn (b29) 2007; 19
Kolda, Bader (b1) 2009; 51
Spampinato, Jelovina, Zhuang, Yzelman (b22) 2023
Van Zee, van de Geijn (b14) 2015; 41
Anderson, Bai, Bischof, Blackford, Demmel (b26) 1999
Harshman (b5) 1970; Vol. 16
Dongarra, Duff, Gates, Haidar, Hammarling (b28) 2016
Pawłowski, Uçar, Yzelman (b6) 2019; 33
Perez, Beltran, Labarta, Ayguadé (b31) 2017
McCalpin (b32) 1995
Khatri, Rao (b2) 1968; 30
Park, Jeon, Lee, Kang (b17) 2016
Solomonik, Matthews, Hammond, Stanton, Demmel (b23) 2014; 74
Blanco, Liu, Dehnavi (b18) 2018
Dagum, Menon (b30) 1998; 5
Calvin, Lewis, Valeev (b24) 2015
Rau (b36) 2023
Kalamkar, Mudigere, Mellempudi, Das, Banerjee (b35) 2019
Yzelman, Di Nardo, Nash, Suijlen (b20) 2020
Dongarra, Luszczek (b27) 2011
Shin, Hooi, Kim, Faloutsos (b9) 2021; 3
Xianyi, Qian, Yunquan (b13) 2012
S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep Learning with Limited Numerical Precision, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 1737–1746.
Walker (b19) 1994; 20
Martinez-Ferrer (b10) 2024
Kalamkar (10.1016/j.future.2024.107698_b35) 2019
Kang (10.1016/j.future.2024.107698_b16) 2012
Martinez-Ferrer (10.1016/j.future.2024.107698_b8) 2022; 33
Kolda (10.1016/j.future.2024.107698_b1) 2009; 51
Martinez-Ferrer (10.1016/j.future.2024.107698_b33) 2023; 179
Pawłowski (10.1016/j.future.2024.107698_b6) 2019; 33
Chan (10.1016/j.future.2024.107698_b29) 2007; 19
Ziogas (10.1016/j.future.2024.107698_b25) 2022
Perez (10.1016/j.future.2024.107698_b31) 2017
Rau (10.1016/j.future.2024.107698_b36) 2023
Dagum (10.1016/j.future.2024.107698_b30) 1998; 5
10.1016/j.future.2024.107698_b34
Xianyi (10.1016/j.future.2024.107698_b13) 2012
Dongarra (10.1016/j.future.2024.107698_b28) 2016
Park (10.1016/j.future.2024.107698_b17) 2016
Blanco (10.1016/j.future.2024.107698_b18) 2018
Lathauwer (10.1016/j.future.2024.107698_b3) 2000; 21
Van Zee (10.1016/j.future.2024.107698_b14) 2015; 41
Brock (10.1016/j.future.2024.107698_b21) 2021
Walker (10.1016/j.future.2024.107698_b19) 1994; 20
Calvin (10.1016/j.future.2024.107698_b24) 2015
Solomonik (10.1016/j.future.2024.107698_b23) 2014; 74
Di Napoli (10.1016/j.future.2024.107698_b11) 2014; 235
Heinecke (10.1016/j.future.2024.107698_b15) 2016
Khatri (10.1016/j.future.2024.107698_b2) 1968; 30
Anderson (10.1016/j.future.2024.107698_b26) 1999
Shin (10.1016/j.future.2024.107698_b9) 2021; 3
Dongarra (10.1016/j.future.2024.107698_b27) 2011
Pawłowski (10.1016/j.future.2024.107698_b7) 2020
Bassoy (10.1016/j.future.2024.107698_b12) 2019
Yzelman (10.1016/j.future.2024.107698_b20) 2020
McCalpin (10.1016/j.future.2024.107698_b32) 1995
Harshman (10.1016/j.future.2024.107698_b5) 1970; Vol. 16
Martinez-Ferrer (10.1016/j.future.2024.107698_b10) 2024
Spampinato (10.1016/j.future.2024.107698_b22) 2023
Tucker (10.1016/j.future.2024.107698_b4) 1966; 31
References_xml – year: 2023
  ident: b36
  article-title: IEEE 754-based half-precision floating-point library (version 2.2) [computer software]
– year: 2016
  ident: b28
  article-title: A proposed API for batched basic linear algebra subprograms
  publication-title: MIMS EPrint 2016.25
– volume: 33
  start-page: 34
  year: 2019
  end-page: 44
  ident: b6
  article-title: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations
  publication-title: J. Comput. Sci.
– volume: 21
  start-page: 1324
  year: 2000
  end-page: 1342
  ident: b3
  article-title: On the best rank-1 and rank-(R
  publication-title: SIAM J. Matrix Anal. Appl.
– volume: 5
  start-page: 46
  year: 1998
  end-page: 55
  ident: b30
  article-title: OpenMP: An industry standard API for shared-memory programming
  publication-title: IEEE Comput. Sci. Eng.
– volume: 74
  start-page: 3176
  year: 2014
  end-page: 3190
  ident: b23
  article-title: A massively parallel tensor contraction framework for coupled-cluster computations
  publication-title: J. Parallel Distrib. Comput.
– start-page: 1
  year: 2022
  end-page: 15
  ident: b25
  article-title: Deinsum: Practically I/O optimal multi-linear algebra
  publication-title: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis
– start-page: 1773
  year: 2011
  end-page: 1775
  ident: b27
  article-title: Scalapack
  publication-title: Encyclopedia of Parallel Computing
– year: 2016
  ident: b15
  article-title: LIBXSMM: Accelerating small matrix multiplications by runtime code generation
  publication-title: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
– start-page: 50
  year: 2023
  end-page: 61
  ident: b22
  article-title: Towards structured algebraic programming
  publication-title: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming
– year: 2015
  ident: b24
  article-title: Scalable task-based algorithm for multiplication of block-rank-sparse matrices
  publication-title: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms
– volume: 30
  start-page: 167
  year: 1968
  end-page: 180
  ident: b2
  article-title: Solutions to some functional equations and their applications to characterization of probability distributions
  publication-title: Sankhyā: Indian J. Stat. Ser. A (1961-2002)
– year: 2018
  ident: b18
  article-title: CSTF: Large-scale sparse tensor factorizations on distributed platforms
  publication-title: Proceedings of the 47th International Conference on Parallel Processing
– year: 2021
  ident: b21
  article-title: The GraphBLAS C API specification (Version 2.0.0) [computer software]
– year: 1999
  ident: b26
  article-title: LAPACK users’ guide: Third edition
– reference: S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep Learning with Limited Numerical Precision, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 1737–1746.
– volume: 179
  year: 2023
  ident: b33
  article-title: Improving the performance of classical linear algebra iterative methods via hybrid parallelism
  publication-title: J. Parallel Distrib. Comput.
– start-page: 316
  year: 2012
  end-page: 324
  ident: b16
  article-title: GigaTensor: Scaling tensor analysis up by 100 times - algorithms and discoveries
  publication-title: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
– volume: 19
  start-page: 1749
  year: 2007
  end-page: 1783
  ident: b29
  article-title: Collective communication: theory, practice, and experience
  publication-title: Concurr. Comput.: Pract. Exper.
– volume: 33
  start-page: 3363
  year: 2022
  end-page: 3374
  ident: b8
  article-title: A native tensor-vector multiplication algorithm for high performance computing
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– start-page: 2457
  year: 2016
  end-page: 2460
  ident: b17
  article-title: BIGtensor: Mining billion-scale tensor made easy
  publication-title: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
– volume: 3
  year: 2021
  ident: b9
  article-title: Detecting group anomalies in tera-scale multi-aspect data via dense-subtensor mining
  publication-title: Front. Big Data
– start-page: 684
  year: 2012
  end-page: 691
  ident: b13
  article-title: Model-driven level 3 BLAS performance optimization on Loongson 3A processor
  publication-title: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
– year: 2019
  ident: b35
  article-title: A study of bfloat16 for deep learning training
– start-page: 19
  year: 1995
  end-page: 25
  ident: b32
  article-title: Memory bandwidth and machine balance in current high performance computers
  publication-title: IEEE Computer Society Technical Committee on Computer Architecture Newsletter
– volume: 235
  start-page: 454
  year: 2014
  end-page: 468
  ident: b11
  article-title: Towards an efficient use of the BLAS library for multilinear tensor contractions
  publication-title: Appl. Math. Comput.
– volume: 41
  start-page: 14:1
  year: 2015
  end-page: 14:33
  ident: b14
  article-title: BLIS: A framework for rapidly instantiating BLAS functionality
  publication-title: ACM Trans. Math. Software
– volume: 31
  start-page: 279
  year: 1966
  end-page: 311
  ident: b4
  article-title: Some mathematical notes on three-mode factor analysis
  publication-title: Psychometrika
– start-page: 32
  year: 2019
  end-page: 45
  ident: b12
  article-title: Design of a high-performance tensor-vector multiplication with BLAS
  publication-title: Computational Science – ICCS 2019
– year: 2020
  ident: b20
  article-title: A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation
– volume: Vol. 16
  start-page: 1
  year: 1970
  end-page: 84
  ident: b5
  article-title: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis
  publication-title: UCLA Working Papers in Phonetics
– volume: 51
  start-page: 455
  year: 2009
  end-page: 500
  ident: b1
  article-title: Tensor decompositions and applications
  publication-title: SIAM Rev.
– volume: 20
  start-page: 657
  year: 1994
  end-page: 673
  ident: b19
  article-title: The design of a standard message passing interface for distributed memory concurrent computers
  publication-title: Parallel Comput.
– year: 2024
  ident: b10
  article-title: dTVC library (version 1.0) [computer software]
– start-page: 38
  year: 2020
  end-page: 48
  ident: b7
  article-title: High performance tensor–Vector multiplication on shared-memory systems
  publication-title: Parallel Processing and Applied Mathematics
– start-page: 809
  year: 2017
  end-page: 818
  ident: b31
  article-title: Improving the integration of task nesting and dependencies in OpenMP
  publication-title: IEEE International Parallel and Distributed Processing Symposium
– volume: 33
  start-page: 34
  year: 2019
  ident: 10.1016/j.future.2024.107698_b6
  article-title: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations
  publication-title: J. Comput. Sci.
  doi: 10.1016/j.jocs.2019.02.007
– volume: 5
  start-page: 46
  issue: 1
  year: 1998
  ident: 10.1016/j.future.2024.107698_b30
  article-title: OpenMP: An industry standard API for shared-memory programming
  publication-title: IEEE Comput. Sci. Eng.
  doi: 10.1109/99.660313
– start-page: 809
  year: 2017
  ident: 10.1016/j.future.2024.107698_b31
  article-title: Improving the integration of task nesting and dependencies in OpenMP
– year: 2018
  ident: 10.1016/j.future.2024.107698_b18
  article-title: CSTF: Large-scale sparse tensor factorizations on distributed platforms
– volume: 19
  start-page: 1749
  issue: 13
  year: 2007
  ident: 10.1016/j.future.2024.107698_b29
  article-title: Collective communication: theory, practice, and experience
  publication-title: Concurr. Comput.: Pract. Exper.
  doi: 10.1002/cpe.1206
– volume: 51
  start-page: 455
  issue: 3
  year: 2009
  ident: 10.1016/j.future.2024.107698_b1
  article-title: Tensor decompositions and applications
  publication-title: SIAM Rev.
  doi: 10.1137/07070111X
– volume: 74
  start-page: 3176
  issue: 12
  year: 2014
  ident: 10.1016/j.future.2024.107698_b23
  article-title: A massively parallel tensor contraction framework for coupled-cluster computations
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2014.06.002
– year: 2019
  ident: 10.1016/j.future.2024.107698_b35
– start-page: 38
  year: 2020
  ident: 10.1016/j.future.2024.107698_b7
  article-title: High performance tensor–Vector multiplication on shared-memory systems
– volume: 20
  start-page: 657
  issue: 4
  year: 1994
  ident: 10.1016/j.future.2024.107698_b19
  article-title: The design of a standard message passing interface for distributed memory concurrent computers
  publication-title: Parallel Comput.
  doi: 10.1016/0167-8191(94)90033-7
– start-page: 1
  year: 2022
  ident: 10.1016/j.future.2024.107698_b25
  article-title: Deinsum: Practically I/O optimal multi-linear algebra
– volume: 33
  start-page: 3363
  issue: 12
  year: 2022
  ident: 10.1016/j.future.2024.107698_b8
  article-title: A native tensor-vector multiplication algorithm for high performance computing
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2022.3153113
– start-page: 2457
  year: 2016
  ident: 10.1016/j.future.2024.107698_b17
  article-title: BIGtensor: Mining billion-scale tensor made easy
– year: 2016
  ident: 10.1016/j.future.2024.107698_b28
  article-title: A proposed API for batched basic linear algebra subprograms
– ident: 10.1016/j.future.2024.107698_b34
– year: 1999
  ident: 10.1016/j.future.2024.107698_b26
– start-page: 32
  year: 2019
  ident: 10.1016/j.future.2024.107698_b12
  article-title: Design of a high-performance tensor-vector multiplication with BLAS
– start-page: 50
  year: 2023
  ident: 10.1016/j.future.2024.107698_b22
  article-title: Towards structured algebraic programming
– volume: 31
  start-page: 279
  issue: 3
  year: 1966
  ident: 10.1016/j.future.2024.107698_b4
  article-title: Some mathematical notes on three-mode factor analysis
  publication-title: Psychometrika
  doi: 10.1007/BF02289464
– year: 2015
  ident: 10.1016/j.future.2024.107698_b24
  article-title: Scalable task-based algorithm for multiplication of block-rank-sparse matrices
– year: 2023
  ident: 10.1016/j.future.2024.107698_b36
– volume: 30
  start-page: 167
  issue: 2
  year: 1968
  ident: 10.1016/j.future.2024.107698_b2
  article-title: Solutions to some functional equations and their applications to characterization of probability distributions
  publication-title: Sankhyā: Indian J. Stat. Ser. A (1961-2002)
– year: 2024
  ident: 10.1016/j.future.2024.107698_b10
– start-page: 19
  year: 1995
  ident: 10.1016/j.future.2024.107698_b32
  article-title: Memory bandwidth and machine balance in current high performance computers
– volume: 21
  start-page: 1324
  issue: 4
  year: 2000
  ident: 10.1016/j.future.2024.107698_b3
  article-title: On the best rank-1 and rank-(R1,R2,…,RN) approximation of higher-order tensors
  publication-title: SIAM J. Matrix Anal. Appl.
  doi: 10.1137/S0895479898346995
– volume: 235
  start-page: 454
  year: 2014
  ident: 10.1016/j.future.2024.107698_b11
  article-title: Towards an efficient use of the BLAS library for multilinear tensor contractions
  publication-title: Appl. Math. Comput.
– start-page: 1773
  year: 2011
  ident: 10.1016/j.future.2024.107698_b27
  article-title: Scalapack
– start-page: 684
  year: 2012
  ident: 10.1016/j.future.2024.107698_b13
  article-title: Model-driven level 3 BLAS performance optimization on Loongson 3A processor
– volume: Vol. 16
  start-page: 1
  year: 1970
  ident: 10.1016/j.future.2024.107698_b5
  article-title: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis
– year: 2020
  ident: 10.1016/j.future.2024.107698_b20
– volume: 41
  start-page: 14:1
  issue: 3
  year: 2015
  ident: 10.1016/j.future.2024.107698_b14
  article-title: BLIS: A framework for rapidly instantiating BLAS functionality
  publication-title: ACM Trans. Math. Software
  doi: 10.1145/2764454
– volume: 179
  year: 2023
  ident: 10.1016/j.future.2024.107698_b33
  article-title: Improving the performance of classical linear algebra iterative methods via hybrid parallelism
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2023.04.012
– volume: 3
  year: 2021
  ident: 10.1016/j.future.2024.107698_b9
  article-title: Detecting group anomalies in tera-scale multi-aspect data via dense-subtensor mining
  publication-title: Front. Big Data
  doi: 10.3389/fdata.2020.594302
– year: 2016
  ident: 10.1016/j.future.2024.107698_b15
  article-title: LIBXSMM: Accelerating small matrix multiplications by runtime code generation
– start-page: 316
  year: 2012
  ident: 10.1016/j.future.2024.107698_b16
  article-title: GigaTensor: Scaling tensor analysis up by 100 times - algorithms and discoveries
– year: 2021
  ident: 10.1016/j.future.2024.107698_b21
SSID ssj0001731
Score 2.42885
Snippet The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper...
SourceID crossref
elsevier
SourceType Index Database
Publisher
StartPage 107698
SubjectTerms Distributed memory
GPU
High bandwidth memory
Mixed precision
Task-based parallelization
Tensor contraction
Title Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing
URI https://dx.doi.org/10.1016/j.future.2024.107698
Volume 166
WOSCitedRecordID wos001421137500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  issn: 0167-739X
  databaseCode: AIEXJ
  dateStart: 19950201
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: false
  ssIdentifier: ssj0001731
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELaWLQcuvFHLSz5wq1wltjeOjwu0gj1UlShob5Hj2NBqla1CWKqe-A_8Q34J40cesAgBEpdo5d04kefTeGZ2_H0IPZNVqgG8ijDYPgnnvCR5LjMCsXGlsrS0iikvNiGOj_PlUp5MJp-7szCblajr_PJSXvxXU8MYGNsdnf0Lc_eTwgB8BqPDFcwO1z8y_EtHhetUrIxnYYVQEJZuDT83rtvVNaxDhh5bHNjGF-1Dw3qnGr56v27O2g-BqWHf8Rk7cuP-eIH2OhDdjtcpfHpqEqfHbCKkdJSLiFzRfegeeAvMFTkyTZTSPjFVs95fHPQe6MqsYmF27ii4WrIYMPzcrOBV_ZfvwMnV_p9-MS5e0NnQKhgqalunakKRE5y3YF5id_DSQZxly-OH4sP5QaBggYSfchgUWdC2_olL-42b2s3sOmezjMpraIeKmcynaGf--nC56DfxVEQpy_gq3alL3xq4_axfRzWjSOX0NroZUww8D9C4gyamvotudfIdOHrze0iPkIIBKfgHpOCAlG9fvgaM4BFG8IARDLjADiN4hBHcY-Q-ent0ePriFYmaG0RDctkSQ2mV8FKWhtvEQjJLy9xWQiiW8oomdmYTXSWVMjIXjHGlTOZS5tLqjCmeKPYATet1bXYRLqmVzGilpA_KZ2WaySqXXOTaZtLYPUS6NSsuArVK0fUcnhdhjQu3xkVY4z0kuoUtYngYwr4CsPDbOx_-852P0I0Bto_RtG0-mSfout60Zx-bpxE03wHLTJPO
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributed+and+heterogeneous+tensor%E2%80%93vector+contraction+algorithms+for+high+performance+computing&rft.jtitle=Future+generation+computer+systems&rft.au=Martinez-Ferrer%2C+Pedro+J.&rft.au=Yzelman%2C+Albert-Jan&rft.au=Beltran%2C+Vicen%C3%A7&rft.date=2025-05-01&rft.pub=Elsevier+B.V&rft.issn=0167-739X&rft.volume=166&rft_id=info:doi/10.1016%2Fj.future.2024.107698&rft.externalDocID=S0167739X24006629
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon