Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing
The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tenso...
Saved in:
| Published in: | Future generation computer systems Vol. 166; p. 107698 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.05.2025
|
| Subjects: | |
| ISSN: | 0167-739X |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.
•A novel distributed tensor–vector contraction algorithm is introduced.•Analytical formulae are derived for minimal data movement and best throughput.•A best-in-class higher order power method with task-based parallelization is given.•Our dTVC and dHOPM3 algorithms reach 50% to 80% of the theoretical peak bandwidth.•Our cached, mixed precision ad-hoc kernels speedup computation and communication. |
|---|---|
| AbstractList | The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.
•A novel distributed tensor–vector contraction algorithm is introduced.•Analytical formulae are derived for minimal data movement and best throughput.•A best-in-class higher order power method with task-based parallelization is given.•Our dTVC and dHOPM3 algorithms reach 50% to 80% of the theoretical peak bandwidth.•Our cached, mixed precision ad-hoc kernels speedup computation and communication. |
| ArticleNumber | 107698 |
| Author | Beltran, Vicenç Martinez-Ferrer, Pedro J. Yzelman, Albert-Jan |
| Author_xml | – sequence: 1 givenname: Pedro J. orcidid: 0000-0002-6097-6676 surname: Martinez-Ferrer fullname: Martinez-Ferrer, Pedro J. email: pedro.martinez.ferrer@upc.edu organization: Departament d’Arquitectura de Computadors (DAC), Universitat Politècnica de Catalunya - BarcelonaTech (UPC), Campus Nord, Edif. D6, C. Jordi Girona 1-3, Barcelona, 08034, Cataluña, Spain – sequence: 2 givenname: Albert-Jan orcidid: 0000-0001-8842-3689 surname: Yzelman fullname: Yzelman, Albert-Jan email: albert-jan@yzelman.net organization: Computing Systems Laboratory, Huawei Technologies Switzerland AG, Thurgauerstrasse 80, Zürich, 8050, Dübendorf, Switzerland – sequence: 3 givenname: Vicenç orcidid: 0000-0002-3580-9630 surname: Beltran fullname: Beltran, Vicenç email: vbeltran@bsc.es organization: Barcelona Supercomputing Center (BSC), Pl. Eusebi Güell 1-3, Barcelona, 08034, Cataluña, Spain |
| BookMark | eNp9kE1OwzAQhb0oEi1wAxa-QIqdPycbJFR-pUpsQGJnOfY4cdXYle1UYscduCEnwVVYs5rRvHmjed8KLayzgNA1JWtKaH2zW-spTh7WOcnLNGJ12yzQMkksY0X7cY5WIewIIZQVdInkvQnRm26KoLCwCg8QwbseLLgp4Ag2OP_z9X0EGZ3H0tnohYzGWSz2vfMmDmPAOkmD6Qd8AJ_6UVgJaXc8TNHY_hKdabEPcPVXL9D748Pb5jnbvj69bO62mcyrKmaQ54qUXdtBqYluWJV3jVaMiYKWKie60kQqogS0DSuKUgio27atOy3rQpREFBeonO9K70LwoPnBm1H4T04JP8HhOz7D4Sc4fIaTbLezDdJvRwOeB2kgJVDGp9RcOfP_gV86z3h6 |
| Cites_doi | 10.1016/j.jocs.2019.02.007 10.1109/99.660313 10.1002/cpe.1206 10.1137/07070111X 10.1016/j.jpdc.2014.06.002 10.1016/0167-8191(94)90033-7 10.1109/TPDS.2022.3153113 10.1007/BF02289464 10.1137/S0895479898346995 10.1145/2764454 10.1016/j.jpdc.2023.04.012 10.3389/fdata.2020.594302 |
| ContentType | Journal Article |
| Copyright | 2025 Elsevier B.V. |
| Copyright_xml | – notice: 2025 Elsevier B.V. |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.future.2024.107698 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| ExternalDocumentID | 10_1016_j_future_2024_107698 S0167739X24006629 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29H 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN ABBOA ABDPE ABFNM ABJNI ABMAC ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACZNC ADBBV ADEZE ADJOM ADMUD ADNMO AEBSH AEIPS AEKER AFJKZ AFTJW AFXIZ AGCQF AGHFR AGQPQ AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD APXCP ASPBG AVWKF AXJTR AZFZN BKOJK BLXMC BNPGV CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W KOM LG9 M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SES SEW SPC SPCBC SSH SSV SSZ T5K UHS WUQ XPP ZMT ~G- 9DU AAYWO AAYXX ACLOT AIIUN CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c255t-e22d04b9be4f0f8752b8fd77a314d20f5f0cd0dae987334aae69996bfc63a40a3 |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001421137500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0167-739X |
| IngestDate | Sat Nov 29 06:59:23 EST 2025 Sat May 03 15:56:10 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Distributed memory Tensor contraction Task-based parallelization Mixed precision GPU High bandwidth memory |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c255t-e22d04b9be4f0f8752b8fd77a314d20f5f0cd0dae987334aae69996bfc63a40a3 |
| ORCID | 0000-0002-3580-9630 0000-0002-6097-6676 0000-0001-8842-3689 |
| ParticipantIDs | crossref_primary_10_1016_j_future_2024_107698 elsevier_sciencedirect_doi_10_1016_j_future_2024_107698 |
| PublicationCentury | 2000 |
| PublicationDate | May 2025 2025-05-00 |
| PublicationDateYYYYMMDD | 2025-05-01 |
| PublicationDate_xml | – month: 05 year: 2025 text: May 2025 |
| PublicationDecade | 2020 |
| PublicationTitle | Future generation computer systems |
| PublicationYear | 2025 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Martinez-Ferrer, Nicholas Yzelman, Beltran (b8) 2022; 33 Bassoy (b12) 2019 Heinecke, Henry, Hutchinson, Pabst (b15) 2016 Tucker (b4) 1966; 31 Pawłowski, Uçar, Yzelman (b7) 2020 Di Napoli, Fabregat-Traver, Quintana-Ortí, Bientinesi (b11) 2014; 235 Kang, Papalexakis, Harpale, Faloutsos (b16) 2012 Brock, n Buluç, Mattson, McMillan, Moreira (b21) 2021 Lathauwer, Moor, Vandewalle (b3) 2000; 21 Martinez-Ferrer, Arslan, Beltran (b33) 2023; 179 Ziogas, Kwasniewski, Ben-Nun, Schneider, Hoefler (b25) 2022 Chan, Heimlich, Purkayastha, van de Geijn (b29) 2007; 19 Kolda, Bader (b1) 2009; 51 Spampinato, Jelovina, Zhuang, Yzelman (b22) 2023 Van Zee, van de Geijn (b14) 2015; 41 Anderson, Bai, Bischof, Blackford, Demmel (b26) 1999 Harshman (b5) 1970; Vol. 16 Dongarra, Duff, Gates, Haidar, Hammarling (b28) 2016 Pawłowski, Uçar, Yzelman (b6) 2019; 33 Perez, Beltran, Labarta, Ayguadé (b31) 2017 McCalpin (b32) 1995 Khatri, Rao (b2) 1968; 30 Park, Jeon, Lee, Kang (b17) 2016 Solomonik, Matthews, Hammond, Stanton, Demmel (b23) 2014; 74 Blanco, Liu, Dehnavi (b18) 2018 Dagum, Menon (b30) 1998; 5 Calvin, Lewis, Valeev (b24) 2015 Rau (b36) 2023 Kalamkar, Mudigere, Mellempudi, Das, Banerjee (b35) 2019 Yzelman, Di Nardo, Nash, Suijlen (b20) 2020 Dongarra, Luszczek (b27) 2011 Shin, Hooi, Kim, Faloutsos (b9) 2021; 3 Xianyi, Qian, Yunquan (b13) 2012 S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep Learning with Limited Numerical Precision, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 1737–1746. Walker (b19) 1994; 20 Martinez-Ferrer (b10) 2024 Kalamkar (10.1016/j.future.2024.107698_b35) 2019 Kang (10.1016/j.future.2024.107698_b16) 2012 Martinez-Ferrer (10.1016/j.future.2024.107698_b8) 2022; 33 Kolda (10.1016/j.future.2024.107698_b1) 2009; 51 Martinez-Ferrer (10.1016/j.future.2024.107698_b33) 2023; 179 Pawłowski (10.1016/j.future.2024.107698_b6) 2019; 33 Chan (10.1016/j.future.2024.107698_b29) 2007; 19 Ziogas (10.1016/j.future.2024.107698_b25) 2022 Perez (10.1016/j.future.2024.107698_b31) 2017 Rau (10.1016/j.future.2024.107698_b36) 2023 Dagum (10.1016/j.future.2024.107698_b30) 1998; 5 10.1016/j.future.2024.107698_b34 Xianyi (10.1016/j.future.2024.107698_b13) 2012 Dongarra (10.1016/j.future.2024.107698_b28) 2016 Park (10.1016/j.future.2024.107698_b17) 2016 Blanco (10.1016/j.future.2024.107698_b18) 2018 Lathauwer (10.1016/j.future.2024.107698_b3) 2000; 21 Van Zee (10.1016/j.future.2024.107698_b14) 2015; 41 Brock (10.1016/j.future.2024.107698_b21) 2021 Walker (10.1016/j.future.2024.107698_b19) 1994; 20 Calvin (10.1016/j.future.2024.107698_b24) 2015 Solomonik (10.1016/j.future.2024.107698_b23) 2014; 74 Di Napoli (10.1016/j.future.2024.107698_b11) 2014; 235 Heinecke (10.1016/j.future.2024.107698_b15) 2016 Khatri (10.1016/j.future.2024.107698_b2) 1968; 30 Anderson (10.1016/j.future.2024.107698_b26) 1999 Shin (10.1016/j.future.2024.107698_b9) 2021; 3 Dongarra (10.1016/j.future.2024.107698_b27) 2011 Pawłowski (10.1016/j.future.2024.107698_b7) 2020 Bassoy (10.1016/j.future.2024.107698_b12) 2019 Yzelman (10.1016/j.future.2024.107698_b20) 2020 McCalpin (10.1016/j.future.2024.107698_b32) 1995 Harshman (10.1016/j.future.2024.107698_b5) 1970; Vol. 16 Martinez-Ferrer (10.1016/j.future.2024.107698_b10) 2024 Spampinato (10.1016/j.future.2024.107698_b22) 2023 Tucker (10.1016/j.future.2024.107698_b4) 1966; 31 |
| References_xml | – year: 2023 ident: b36 article-title: IEEE 754-based half-precision floating-point library (version 2.2) [computer software] – year: 2016 ident: b28 article-title: A proposed API for batched basic linear algebra subprograms publication-title: MIMS EPrint 2016.25 – volume: 33 start-page: 34 year: 2019 end-page: 44 ident: b6 article-title: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations publication-title: J. Comput. Sci. – volume: 21 start-page: 1324 year: 2000 end-page: 1342 ident: b3 article-title: On the best rank-1 and rank-(R publication-title: SIAM J. Matrix Anal. Appl. – volume: 5 start-page: 46 year: 1998 end-page: 55 ident: b30 article-title: OpenMP: An industry standard API for shared-memory programming publication-title: IEEE Comput. Sci. Eng. – volume: 74 start-page: 3176 year: 2014 end-page: 3190 ident: b23 article-title: A massively parallel tensor contraction framework for coupled-cluster computations publication-title: J. Parallel Distrib. Comput. – start-page: 1 year: 2022 end-page: 15 ident: b25 article-title: Deinsum: Practically I/O optimal multi-linear algebra publication-title: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis – start-page: 1773 year: 2011 end-page: 1775 ident: b27 article-title: Scalapack publication-title: Encyclopedia of Parallel Computing – year: 2016 ident: b15 article-title: LIBXSMM: Accelerating small matrix multiplications by runtime code generation publication-title: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis – start-page: 50 year: 2023 end-page: 61 ident: b22 article-title: Towards structured algebraic programming publication-title: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming – year: 2015 ident: b24 article-title: Scalable task-based algorithm for multiplication of block-rank-sparse matrices publication-title: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms – volume: 30 start-page: 167 year: 1968 end-page: 180 ident: b2 article-title: Solutions to some functional equations and their applications to characterization of probability distributions publication-title: Sankhyā: Indian J. Stat. Ser. A (1961-2002) – year: 2018 ident: b18 article-title: CSTF: Large-scale sparse tensor factorizations on distributed platforms publication-title: Proceedings of the 47th International Conference on Parallel Processing – year: 2021 ident: b21 article-title: The GraphBLAS C API specification (Version 2.0.0) [computer software] – year: 1999 ident: b26 article-title: LAPACK users’ guide: Third edition – reference: S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep Learning with Limited Numerical Precision, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, 2015, pp. 1737–1746. – volume: 179 year: 2023 ident: b33 article-title: Improving the performance of classical linear algebra iterative methods via hybrid parallelism publication-title: J. Parallel Distrib. Comput. – start-page: 316 year: 2012 end-page: 324 ident: b16 article-title: GigaTensor: Scaling tensor analysis up by 100 times - algorithms and discoveries publication-title: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – volume: 19 start-page: 1749 year: 2007 end-page: 1783 ident: b29 article-title: Collective communication: theory, practice, and experience publication-title: Concurr. Comput.: Pract. Exper. – volume: 33 start-page: 3363 year: 2022 end-page: 3374 ident: b8 article-title: A native tensor-vector multiplication algorithm for high performance computing publication-title: IEEE Trans. Parallel Distrib. Syst. – start-page: 2457 year: 2016 end-page: 2460 ident: b17 article-title: BIGtensor: Mining billion-scale tensor made easy publication-title: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management – volume: 3 year: 2021 ident: b9 article-title: Detecting group anomalies in tera-scale multi-aspect data via dense-subtensor mining publication-title: Front. Big Data – start-page: 684 year: 2012 end-page: 691 ident: b13 article-title: Model-driven level 3 BLAS performance optimization on Loongson 3A processor publication-title: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems – year: 2019 ident: b35 article-title: A study of bfloat16 for deep learning training – start-page: 19 year: 1995 end-page: 25 ident: b32 article-title: Memory bandwidth and machine balance in current high performance computers publication-title: IEEE Computer Society Technical Committee on Computer Architecture Newsletter – volume: 235 start-page: 454 year: 2014 end-page: 468 ident: b11 article-title: Towards an efficient use of the BLAS library for multilinear tensor contractions publication-title: Appl. Math. Comput. – volume: 41 start-page: 14:1 year: 2015 end-page: 14:33 ident: b14 article-title: BLIS: A framework for rapidly instantiating BLAS functionality publication-title: ACM Trans. Math. Software – volume: 31 start-page: 279 year: 1966 end-page: 311 ident: b4 article-title: Some mathematical notes on three-mode factor analysis publication-title: Psychometrika – start-page: 32 year: 2019 end-page: 45 ident: b12 article-title: Design of a high-performance tensor-vector multiplication with BLAS publication-title: Computational Science – ICCS 2019 – year: 2020 ident: b20 article-title: A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation – volume: Vol. 16 start-page: 1 year: 1970 end-page: 84 ident: b5 article-title: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis publication-title: UCLA Working Papers in Phonetics – volume: 51 start-page: 455 year: 2009 end-page: 500 ident: b1 article-title: Tensor decompositions and applications publication-title: SIAM Rev. – volume: 20 start-page: 657 year: 1994 end-page: 673 ident: b19 article-title: The design of a standard message passing interface for distributed memory concurrent computers publication-title: Parallel Comput. – year: 2024 ident: b10 article-title: dTVC library (version 1.0) [computer software] – start-page: 38 year: 2020 end-page: 48 ident: b7 article-title: High performance tensor–Vector multiplication on shared-memory systems publication-title: Parallel Processing and Applied Mathematics – start-page: 809 year: 2017 end-page: 818 ident: b31 article-title: Improving the integration of task nesting and dependencies in OpenMP publication-title: IEEE International Parallel and Distributed Processing Symposium – volume: 33 start-page: 34 year: 2019 ident: 10.1016/j.future.2024.107698_b6 article-title: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations publication-title: J. Comput. Sci. doi: 10.1016/j.jocs.2019.02.007 – volume: 5 start-page: 46 issue: 1 year: 1998 ident: 10.1016/j.future.2024.107698_b30 article-title: OpenMP: An industry standard API for shared-memory programming publication-title: IEEE Comput. Sci. Eng. doi: 10.1109/99.660313 – start-page: 809 year: 2017 ident: 10.1016/j.future.2024.107698_b31 article-title: Improving the integration of task nesting and dependencies in OpenMP – year: 2018 ident: 10.1016/j.future.2024.107698_b18 article-title: CSTF: Large-scale sparse tensor factorizations on distributed platforms – volume: 19 start-page: 1749 issue: 13 year: 2007 ident: 10.1016/j.future.2024.107698_b29 article-title: Collective communication: theory, practice, and experience publication-title: Concurr. Comput.: Pract. Exper. doi: 10.1002/cpe.1206 – volume: 51 start-page: 455 issue: 3 year: 2009 ident: 10.1016/j.future.2024.107698_b1 article-title: Tensor decompositions and applications publication-title: SIAM Rev. doi: 10.1137/07070111X – volume: 74 start-page: 3176 issue: 12 year: 2014 ident: 10.1016/j.future.2024.107698_b23 article-title: A massively parallel tensor contraction framework for coupled-cluster computations publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2014.06.002 – year: 2019 ident: 10.1016/j.future.2024.107698_b35 – start-page: 38 year: 2020 ident: 10.1016/j.future.2024.107698_b7 article-title: High performance tensor–Vector multiplication on shared-memory systems – volume: 20 start-page: 657 issue: 4 year: 1994 ident: 10.1016/j.future.2024.107698_b19 article-title: The design of a standard message passing interface for distributed memory concurrent computers publication-title: Parallel Comput. doi: 10.1016/0167-8191(94)90033-7 – start-page: 1 year: 2022 ident: 10.1016/j.future.2024.107698_b25 article-title: Deinsum: Practically I/O optimal multi-linear algebra – volume: 33 start-page: 3363 issue: 12 year: 2022 ident: 10.1016/j.future.2024.107698_b8 article-title: A native tensor-vector multiplication algorithm for high performance computing publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2022.3153113 – start-page: 2457 year: 2016 ident: 10.1016/j.future.2024.107698_b17 article-title: BIGtensor: Mining billion-scale tensor made easy – year: 2016 ident: 10.1016/j.future.2024.107698_b28 article-title: A proposed API for batched basic linear algebra subprograms – ident: 10.1016/j.future.2024.107698_b34 – year: 1999 ident: 10.1016/j.future.2024.107698_b26 – start-page: 32 year: 2019 ident: 10.1016/j.future.2024.107698_b12 article-title: Design of a high-performance tensor-vector multiplication with BLAS – start-page: 50 year: 2023 ident: 10.1016/j.future.2024.107698_b22 article-title: Towards structured algebraic programming – volume: 31 start-page: 279 issue: 3 year: 1966 ident: 10.1016/j.future.2024.107698_b4 article-title: Some mathematical notes on three-mode factor analysis publication-title: Psychometrika doi: 10.1007/BF02289464 – year: 2015 ident: 10.1016/j.future.2024.107698_b24 article-title: Scalable task-based algorithm for multiplication of block-rank-sparse matrices – year: 2023 ident: 10.1016/j.future.2024.107698_b36 – volume: 30 start-page: 167 issue: 2 year: 1968 ident: 10.1016/j.future.2024.107698_b2 article-title: Solutions to some functional equations and their applications to characterization of probability distributions publication-title: Sankhyā: Indian J. Stat. Ser. A (1961-2002) – year: 2024 ident: 10.1016/j.future.2024.107698_b10 – start-page: 19 year: 1995 ident: 10.1016/j.future.2024.107698_b32 article-title: Memory bandwidth and machine balance in current high performance computers – volume: 21 start-page: 1324 issue: 4 year: 2000 ident: 10.1016/j.future.2024.107698_b3 article-title: On the best rank-1 and rank-(R1,R2,…,RN) approximation of higher-order tensors publication-title: SIAM J. Matrix Anal. Appl. doi: 10.1137/S0895479898346995 – volume: 235 start-page: 454 year: 2014 ident: 10.1016/j.future.2024.107698_b11 article-title: Towards an efficient use of the BLAS library for multilinear tensor contractions publication-title: Appl. Math. Comput. – start-page: 1773 year: 2011 ident: 10.1016/j.future.2024.107698_b27 article-title: Scalapack – start-page: 684 year: 2012 ident: 10.1016/j.future.2024.107698_b13 article-title: Model-driven level 3 BLAS performance optimization on Loongson 3A processor – volume: Vol. 16 start-page: 1 year: 1970 ident: 10.1016/j.future.2024.107698_b5 article-title: Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-model factor analysis – year: 2020 ident: 10.1016/j.future.2024.107698_b20 – volume: 41 start-page: 14:1 issue: 3 year: 2015 ident: 10.1016/j.future.2024.107698_b14 article-title: BLIS: A framework for rapidly instantiating BLAS functionality publication-title: ACM Trans. Math. Software doi: 10.1145/2764454 – volume: 179 year: 2023 ident: 10.1016/j.future.2024.107698_b33 article-title: Improving the performance of classical linear algebra iterative methods via hybrid parallelism publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2023.04.012 – volume: 3 year: 2021 ident: 10.1016/j.future.2024.107698_b9 article-title: Detecting group anomalies in tera-scale multi-aspect data via dense-subtensor mining publication-title: Front. Big Data doi: 10.3389/fdata.2020.594302 – year: 2016 ident: 10.1016/j.future.2024.107698_b15 article-title: LIBXSMM: Accelerating small matrix multiplications by runtime code generation – start-page: 316 year: 2012 ident: 10.1016/j.future.2024.107698_b16 article-title: GigaTensor: Scaling tensor analysis up by 100 times - algorithms and discoveries – year: 2021 ident: 10.1016/j.future.2024.107698_b21 |
| SSID | ssj0001731 |
| Score | 2.42885 |
| Snippet | The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper... |
| SourceID | crossref elsevier |
| SourceType | Index Database Publisher |
| StartPage | 107698 |
| SubjectTerms | Distributed memory GPU High bandwidth memory Mixed precision Task-based parallelization Tensor contraction |
| Title | Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing |
| URI | https://dx.doi.org/10.1016/j.future.2024.107698 |
| Volume | 166 |
| WOSCitedRecordID | wos001421137500001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 issn: 0167-739X databaseCode: AIEXJ dateStart: 19950201 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: false ssIdentifier: ssj0001731 providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELaWLQculKdaXvKBW-UqsZM4Pi59CPZQVaKgvUWOH9Bqla1CWKqe-A_8Q35Jx7HzgEUIkLhEK-_GiTyfxjOz4-9D6KXQEHRopUmUSUESmVIiMxoTLkoWp1KnkclbsQl-cpIvFuJ0MvnSnYVZL3lV5VdX4vK_mhrGwNju6OxfmLufFAbgMxgdrmB2uP6R4Q8dFa5TsTItCyuEgrB0K_i5cd2urmEdMvTQ4sDWbdHeN6x3quHLD6v6vPnomRr2HJ-xIzfujxeoVgei2_E6hc-WmsTpMZsAKRXkIgJXdB-6e94Cc02OTR2ktE-Nrld78_3eA12bZSjMzhwFV0PmA4ZfmSW8avvle3ByVftPPx8XL2g6tAr6itrGqRpf5ATnzVkrsTt4aS_OsuHxffHhYt9TsEDCTxMY5JnXtv6JS_utm9rN7Dpns4yKW2iL8lTkU7Q1e3O0mPebeMyDlGV4le7UZdsauPmsX0c1o0jl7B66G1IMPPPQuI8mpnqAtjv5Dhy8-UOkRkjBgBT8A1KwR8r3r988RvAII3jACAZcYIcRPMII7jHyCL07Pjo7eE2C5gZRkFw2xFCqo6QUpUlsZCGZpWVuNeeSxYmmkU1tpHSkpRE5ZyyR0mQuZS6typhMIskeo2m1qswOwswallmRR0lmIeqJZSwsNakuVSlhl9G7iHRrVlx6apWi6zm8KPwaF26NC7_Gu4h3C1uE8NCHfQVg4bd3PvnnO5-iOwNsn6FpU382z9FttW7OP9UvAmhuALh7lBA |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributed+and+heterogeneous+tensor%E2%80%93vector+contraction+algorithms+for+high+performance+computing&rft.jtitle=Future+generation+computer+systems&rft.au=Martinez-Ferrer%2C+Pedro+J.&rft.au=Yzelman%2C+Albert-Jan&rft.au=Beltran%2C+Vicen%C3%A7&rft.date=2025-05-01&rft.pub=Elsevier+B.V&rft.issn=0167-739X&rft.volume=166&rft_id=info:doi/10.1016%2Fj.future.2024.107698&rft.externalDocID=S0167739X24006629 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0167-739X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0167-739X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0167-739X&client=summon |