Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of parallel and distributed computing Ročník 175; s. 51 - 65
Hlavní autori: Catalán, Sandra, Igual, Francisco D., Herrero, José R., Rodríguez-Sánchez, Rafael, Quintana-Ortí, Enrique S.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier Inc 01.05.2023
Predmet:
ISSN:0743-7315, 1096-0848
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. •Exposure of the performance penalty introduced by NUMA-oblivious implementations.•Demonstration that a high-level approach can largely diminish the programming effort.•Demonstration of performance boost when algorithms span across several NUMA domains.•Validation via matrix factorization and inversion on state-of-the-art NUMA servers.
AbstractList We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. •Exposure of the performance penalty introduced by NUMA-oblivious implementations.•Demonstration that a high-level approach can largely diminish the programming effort.•Demonstration of performance boost when algorithms span across several NUMA domains.•Validation via matrix factorization and inversion on state-of-the-art NUMA servers.
Author Herrero, José R.
Rodríguez-Sánchez, Rafael
Catalán, Sandra
Quintana-Ortí, Enrique S.
Igual, Francisco D.
Author_xml – sequence: 1
  givenname: Sandra
  surname: Catalán
  fullname: Catalán, Sandra
  organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain
– sequence: 2
  givenname: Francisco D.
  surname: Igual
  fullname: Igual, Francisco D.
  organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain
– sequence: 3
  givenname: José R.
  surname: Herrero
  fullname: Herrero, José R.
  organization: Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Barcelona, Spain
– sequence: 4
  givenname: Rafael
  surname: Rodríguez-Sánchez
  fullname: Rodríguez-Sánchez, Rafael
  organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain
– sequence: 5
  givenname: Enrique S.
  surname: Quintana-Ortí
  fullname: Quintana-Ortí, Enrique S.
  email: quintana@disca.upv.es
  organization: Departamento de Informatica de Sistemas y Computadores, Universitat Politecnica de Valencia, Valencia, Spain
BookMark eNp9kMtOwzAQRS1UJErhB1j5BxLGcdIkEhuEeEnlsaBra3DGxVXqVONQHl9PSlmx6Go00j1XuudYjEIXSIgzBakCNT1fpst1Y9MMMp2CSgHyAzFWUE8TqPJqJMZQ5joptSqOxHGMSwClirIai-aZuwXjauXDQq6RsW2plQ2FSHKFPftP6dD2Hftv7H0XosTQSB82xHF4petYBvpIFhSIfxPycf5wKZHtm-_J9u9M8UQcOmwjnf7diZjfXL9c3SWzp9v7q8tZYnWe90lTWls5NaVpXbrXokLQzoFzmVNDAKlQdeaQakJdZOjKUkOtMl1DDhZqVHoisl2v5S5GJmfW7FfIX0aB2XoyS7P1ZLaeDCgzeBqg6h9kff-7pGf07X70YofSMGrjiU20noKlxvMw3TSd34f_AE6ZiF4
CitedBy_id crossref_primary_10_1016_j_future_2023_07_005
Cites_doi 10.1145/2925987
10.3390/electronics7120359
10.1145/2764454
10.1109/MM.2021.3085578
10.1145/77626.79170
10.1145/2508834.2513149
10.1109/TCAD.2020.2970019
10.1145/3264491
10.1145/3199605
10.1145/216585.216588
10.1002/cpe.1463
10.1145/1356052.1356053
10.1145/1527286.1527288
10.1016/j.compeleceng.2015.06.009
10.1016/j.future.2021.11.008
10.3390/electronics10161984
10.1109/TPDS.2017.2787123
ContentType Journal Article
Copyright 2023 The Author(s)
Copyright_xml – notice: 2023 The Author(s)
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.jpdc.2023.01.004
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1096-0848
EndPage 65
ExternalDocumentID 10_1016_j_jpdc_2023_01_004
S0743731523000047
GroupedDBID --K
--M
-~X
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
6I.
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAFTH
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABFSI
ABJNI
ABMAC
ABTAH
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADHUB
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
E.L
EBS
EFBJH
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
G8K
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
K-O
KOM
LG5
LG9
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SET
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
TWZ
WUQ
XJT
XOL
XPP
ZMT
ZU3
ZY4
~G-
~G0
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABDPE
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
ADVLN
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
~HD
ID FETCH-LOGICAL-c344t-d7cc8f16e697fb58a03ff0ff2f1c34ae5192fae9ea352af773091239040c09a13
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000924959200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0743-7315
IngestDate Sat Nov 29 07:17:14 EST 2025
Tue Nov 18 22:12:41 EST 2025
Fri Feb 23 02:38:37 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords NUMA architectures
Portability
Shared memory programming
Chiplets
Dense linear algebra
Language English
License This is an open access article under the CC BY-NC-ND license.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c344t-d7cc8f16e697fb58a03ff0ff2f1c34ae5192fae9ea352af773091239040c09a13
OpenAccessLink https://dx.doi.org/10.1016/j.jpdc.2023.01.004
PageCount 15
ParticipantIDs crossref_primary_10_1016_j_jpdc_2023_01_004
crossref_citationtrail_10_1016_j_jpdc_2023_01_004
elsevier_sciencedirect_doi_10_1016_j_jpdc_2023_01_004
PublicationCentury 2000
PublicationDate May 2023
2023-05-00
PublicationDateYYYYMMDD 2023-05-01
PublicationDate_xml – month: 05
  year: 2023
  text: May 2023
PublicationDecade 2020
PublicationTitle Journal of parallel and distributed computing
PublicationYear 2023
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Grama, Gupta, Karypis, Kumar (br0460) 2003
Anderson, Bai, Bischof, Blackford, Demmel, Dongarra, Croz, Hammarling, Greenbaum, McKenney, Sorensen (br0300) 1999
Van Zee, van de Geijn (br0330) 2015; 41
Shao, Clemons, Venkatesan, Zimmer, Fojtik, Jiang, Keller, Klinefelter, Pinckney, Raina, Tell, Zhang, Dally, Emer, Gray, Khailany, Keckler (br0260) 2019
Caheny, Casas, Moretó, Gloaguen, Saintes, Ayguadé, Labarta, Valero (br0430) 2016
Gates, Charara, Kurzak, YarKhan, Al Farhan, Sukkari, Dongarra (br0490) July 2020
Amd (br0480) 2018
Caheny, Alvarez, Derradji, Valero, Moretó, Casas (br0240) 2018; 29
Roy, Song, Krishnamoorthy, Vishnu, Sengupta, Liu (br0230) jun 2018; 15
Alomairy, Miranda, Ltaief, Badia, Martorell, Labarta, Keyes (br0440) 2015; 2
Kannan, Jerger, Loh (br0020) 2015
Lameter (br0150) 2013; 11
Funston, Lorrillere, Fedorova, Lepers, Vengerov, Lozi, Quéma (br0280) 2018
Zhang, Jiang, Chen, Xiao, Ou (br0200) 2021; 10
Dongarra, Gates, Haidar, Kurzak, Luszczek, Wu, Yamazaki, Yarkhan, Abalenkovs, Bagherpour, Hammarling, Šístek, Stevens, Zounon, Relton (br0360) May 2019; 45
Strazdins (br0500) 1998
Goto, van de Geijn (br0320) 2008; 34
Low, Igual, Smith, Quintana-Ortí (br0340) 2016; 43
Rogers, Krishna, Bell, Vu, Jiang, Solihin (br0140) 2009
Xia, Cheng, Zhou, Hu, Chun (br0060) 2021; 41
Laso, Lorenzo, Cabaleiro, Pena, Ángel Lorenzo, Rivera CIMAR NIMAR, LMMA (br0250) 2022; 129
Zhao, Jerger, Ga (br0090) 2021
Imes, Hofmeyr, Kang, Walters (br0170) 2020
Coskun, Eris, Joshi, Kahng, Ma, Narayan, Srinivas (br0040) 2020; 39
Blackford, Choi, Cleary, Demmel, Dhillon, Dongarra, Hammarling, Henry, Petitet, Stanley, Walker, Whaley (br0450) 1996
Popov, Jimborean, Black-Schaffer (br0160) 2019
Quintana-Ortí, Quintana-Ortí, van de Geijn, Van Zee, Chan (br0350) 2009; 36
Wulf, McKee (br0120) 1995; 23
Sánchez Barrera, Casas, Moretó, Ayguadé, Labarta, Valero (br0420) 2018
(br0470) 2019
Catalán, Igual, Rodríguez-Sánchez, Quintana-Ortí (br0400) 2021
McKee, Wisniewski (br0130) 2011
Agullo, Augonnet, Dongarra, Ltaief, Namyst, Thibault, Tomov (br0370) 2010; vol. 2
Loh, Jerger, Kannan, Eckert (br0010) October 5-8, 2015
Schwarzrock, de, Rocha, Beck, Lorenzon (br0180) 2020
Plauth, Eberhardt, Grapentin, Polze (br0220) 2022; e6887
Smith (br0410) 2014
Su, Lei (br0190) 2018; 7
Dominico, de Almeida, Alves, Meira (br0210) 2021
br0080
Pinto, Syrivelis, Gazzetti, Koutsovasilis, Reale, Katrinis, Hofstee (br0100) 2020
Liu, Mashayekhy (br0110) 2018
Ahmed, Shamim, Mansoor, Mamun, Ganguly (br0030) 2017
Badia, Herrero, Labarta, Pérez, Quintana-Ortí, Quintana-Ortí (br0380) 2009; 21
Moore (br0070) April 2019
Naffziger, Beck, Burd, Lepak, Loh, Subramony, White (br0050) 2021
Dongarra, Du Croz, Hammarling, Duff (br0310) 1990; 16
Golub, Van Loan (br0290) 1996
Dolz, Igual, Ludwig, Piñuel, Quintana-Ortí (br0390) 2015; 46
Voron (br0270) 2018
Coskun (10.1016/j.jpdc.2023.01.004_br0040) 2020; 39
Popov (10.1016/j.jpdc.2023.01.004_br0160) 2019
Lameter (10.1016/j.jpdc.2023.01.004_br0150) 2013; 11
Naffziger (10.1016/j.jpdc.2023.01.004_br0050) 2021
Su (10.1016/j.jpdc.2023.01.004_br0190) 2018; 7
Voron (10.1016/j.jpdc.2023.01.004_br0270) 2018
Caheny (10.1016/j.jpdc.2023.01.004_br0430) 2016
Agullo (10.1016/j.jpdc.2023.01.004_br0370) 2010; vol. 2
Alomairy (10.1016/j.jpdc.2023.01.004_br0440) 2015; 2
Strazdins (10.1016/j.jpdc.2023.01.004_br0500) 1998
Zhao (10.1016/j.jpdc.2023.01.004_br0090) 2021
Badia (10.1016/j.jpdc.2023.01.004_br0380) 2009; 21
Low (10.1016/j.jpdc.2023.01.004_br0340) 2016; 43
Imes (10.1016/j.jpdc.2023.01.004_br0170) 2020
Laso (10.1016/j.jpdc.2023.01.004_br0250) 2022; 129
Zhang (10.1016/j.jpdc.2023.01.004_br0200) 2021; 10
Dongarra (10.1016/j.jpdc.2023.01.004_br0360) 2019; 45
Caheny (10.1016/j.jpdc.2023.01.004_br0240) 2018; 29
Rogers (10.1016/j.jpdc.2023.01.004_br0140) 2009
Dolz (10.1016/j.jpdc.2023.01.004_br0390) 2015; 46
Ahmed (10.1016/j.jpdc.2023.01.004_br0030) 2017
Amd (10.1016/j.jpdc.2023.01.004_br0480) 2018
Goto (10.1016/j.jpdc.2023.01.004_br0320) 2008; 34
Funston (10.1016/j.jpdc.2023.01.004_br0280) 2018
Roy (10.1016/j.jpdc.2023.01.004_br0230) 2018; 15
Blackford (10.1016/j.jpdc.2023.01.004_br0450) 1996
Loh (10.1016/j.jpdc.2023.01.004_br0010) 2015
Moore (10.1016/j.jpdc.2023.01.004_br0070) 2019
Shao (10.1016/j.jpdc.2023.01.004_br0260) 2019
Wulf (10.1016/j.jpdc.2023.01.004_br0120) 1995; 23
Dominico (10.1016/j.jpdc.2023.01.004_br0210) 2021
Golub (10.1016/j.jpdc.2023.01.004_br0290) 1996
Van Zee (10.1016/j.jpdc.2023.01.004_br0330) 2015; 41
Pinto (10.1016/j.jpdc.2023.01.004_br0100) 2020
Liu (10.1016/j.jpdc.2023.01.004_br0110) 2018
McKee (10.1016/j.jpdc.2023.01.004_br0130) 2011
Sánchez Barrera (10.1016/j.jpdc.2023.01.004_br0420) 2018
Quintana-Ortí (10.1016/j.jpdc.2023.01.004_br0350) 2009; 36
Grama (10.1016/j.jpdc.2023.01.004_br0460) 2003
Schwarzrock (10.1016/j.jpdc.2023.01.004_br0180) 2020
Xia (10.1016/j.jpdc.2023.01.004_br0060) 2021; 41
Anderson (10.1016/j.jpdc.2023.01.004_br0300) 1999
Catalán (10.1016/j.jpdc.2023.01.004_br0400) 2021
Dongarra (10.1016/j.jpdc.2023.01.004_br0310) 1990; 16
Plauth (10.1016/j.jpdc.2023.01.004_br0220) 2022; e6887
Gates (10.1016/j.jpdc.2023.01.004_br0490) 2020
Smith (10.1016/j.jpdc.2023.01.004_br0410) 2014
Kannan (10.1016/j.jpdc.2023.01.004_br0020) 2015
References_xml – ident: br0080
  article-title: Universal Chiplet Interconnect Express (UCIe)
– start-page: 371
  year: 2009
  end-page: 382
  ident: br0140
  article-title: Scaling the bandwidth wall: challenges in and avenues for CMP scaling
  publication-title: Proceedings of the 36th Annual International Symposium on Computer Architecture
– year: October 5-8, 2015
  ident: br0010
  article-title: Interconnect-memory challenges for multi-chip, silicon interposer systems
  publication-title: Proceedings of the 2015 International Symposium on Memory Systems
– year: 2019
  ident: br0470
– volume: 46
  start-page: 95
  year: 2015
  end-page: 111
  ident: br0390
  article-title: Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi
  publication-title: Comput. Electr. Eng.
– year: 1998
  ident: br0500
  article-title: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization
– start-page: 275
  year: 2016
  end-page: 286
  ident: br0430
  article-title: Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling
  publication-title: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
– year: 2021
  ident: br0090
  article-title: What can chiplets bring to multi-tenant clouds?
  publication-title: Cloud@MICRO Workshop in Conjunction with MICRO
– start-page: 124
  year: 2018
  end-page: 132
  ident: br0110
  article-title: Joint load-balancing and energy-aware virtual machine placement for network-on-chip systems
  publication-title: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC)
– start-page: 1
  year: 2017
  end-page: 6
  ident: br0030
  article-title: Increasing interposer utilization: a scalable, energy efficient and high bandwidth multicore-multichip integration solution
  publication-title: 2017 Eighth International Green and Sustainable Computing Conference (IGSC)
– year: 2018
  ident: br0480
  article-title: Open-source register reference for amd family 17h processors
– volume: 39
  start-page: 5183
  year: 2020
  end-page: 5196
  ident: br0040
  article-title: Cross-layer co-optimization of network design and chiplet placement in 2.5-d systems
  publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
– start-page: 74
  year: 2020
  end-page: 84
  ident: br0170
  article-title: A case study and characterization of a many-socket, multi-tier NUMA HPC platform
  publication-title: 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)
– volume: 45
  year: May 2019
  ident: br0360
  article-title: PLASMA: parallel linear algebra software for multicore using OpenMP
  publication-title: ACM Trans. Math. Softw.
– start-page: 1110
  year: 2011
  end-page: 1116
  ident: br0130
  article-title: Memory Wall
– volume: 129
  start-page: 18
  year: 2022
  end-page: 32
  ident: br0250
  article-title: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters
  publication-title: Future Gener. Comput. Syst.
– volume: 15
  year: jun 2018
  ident: br0230
  article-title: NUMA-caffe: NUMA-aware deep learning neural networks
  publication-title: ACM Trans. Archit. Code Optim.
– start-page: 57
  year: 2021
  end-page: 70
  ident: br0050
  article-title: Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: industrial product
  publication-title: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
– year: July 2020
  ident: br0490
  article-title: SLATE users' guide
– year: 2021
  ident: br0400
  article-title: Scalable hybrid loop- and task-parallel matrix inversion for multicore processors
  publication-title: 22nd IEEE International Workshop on Parallel and Distributed Scientific and Engineering – PDSEC'21
– start-page: 239
  year: 2020
  end-page: 246
  ident: br0180
  article-title: Effective exploration of thread throttling and thread/page mapping on NUMA systems
  publication-title: 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City, IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
– start-page: 1049
  year: 2014
  end-page: 1059
  ident: br0410
  article-title: Anatomy of high-performance many-threaded matrix multiplication
  publication-title: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp.
– volume: 41
  start-page: 67
  year: 2021
  end-page: 75
  ident: br0060
  article-title: Kunpeng 920: the first 7-nm chiplet-based 64-core ARM SoC for cloud services
  publication-title: IEEE MICRO
– volume: 41
  year: 2015
  ident: br0330
  article-title: BLIS: a framework for rapidly instantiating BLAS functionality
  publication-title: ACM Trans. Math. Softw.
– year: 1999
  ident: br0300
  article-title: LAPACK Users' Guide
– volume: 43
  year: 2016
  ident: br0340
  article-title: Analytical modeling is enough for high-performance BLIS
  publication-title: ACM Trans. Math. Softw.
– start-page: 419
  year: 2018
  end-page: 420
  ident: br0420
  article-title: Graph partitioning applied to dag scheduling to reduce numa effects
  publication-title: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
– start-page: 281
  year: 2018
  end-page: 293
  ident: br0280
  article-title: Placement of virtual containers on NUMA systems: a practical and comprehensive model
  publication-title: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference
– volume: 2
  start-page: 49
  year: 2015
  end-page: 72
  ident: br0440
  article-title: Dense matrix computations on NUMA architectures with distance-aware work stealing
  publication-title: Supercomput. Front. Innov.
– volume: 29
  start-page: 1174
  year: 2018
  end-page: 1187
  ident: br0240
  article-title: Reducing cache coherence traffic with a NUMA-aware runtime approach
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– volume: 16
  start-page: 1
  year: 1990
  end-page: 17
  ident: br0310
  article-title: A set of level 3 basic linear algebra subprograms
  publication-title: ACM Trans. Math. Softw.
– volume: 11
  start-page: 40
  year: 2013
  end-page: 51
  ident: br0150
  article-title: NUMA (non-uniform memory access): an overview: NUMA becomes more common because memory controllers get close to execution units on microprocessors
  publication-title: Queue
– volume: 34
  year: 2008
  ident: br0320
  article-title: Anatomy of high-performance matrix multiplication
  publication-title: ACM Trans. Math. Softw.
– year: April 2019
  ident: br0070
  article-title: Intel's View of the Chiplet Revolution
– volume: e6887
  year: 2022
  ident: br0220
  article-title: Improving the accessibility of NUMA-aware C++ application development based on the PGASUS framework
  publication-title: Concurr. Comput., Pract. Exp.
– year: 1996
  ident: br0450
  article-title: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance
  publication-title: Supercomputing '96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing
– start-page: 546
  year: 2015
  end-page: 558
  ident: br0020
  article-title: Enabling interposer-based disintegration of multi-core processors
  publication-title: 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
– start-page: 868
  year: 2020
  end-page: 880
  ident: br0100
  article-title: ThymesisFlow: a software-defined, HW/SW co-designed interconnect stack for rack-scale memory disaggregation
  publication-title: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
– volume: 21
  start-page: 2438
  year: 2009
  end-page: 2456
  ident: br0380
  article-title: Parallelizing dense and banded linear algebra libraries using smpss
  publication-title: Concurr. Comput., Pract. Exp.
– start-page: 14
  year: 2019
  end-page: 27
  ident: br0260
  article-title: Simba: scaling deep-learning inference with multi-chip-module-based architecture
  publication-title: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
– volume: 23
  start-page: 20
  year: 1995
  end-page: 24
  ident: br0120
  article-title: Hitting the memory wall: implications of the obvious
  publication-title: SIGARCH Comput. Archit. News
– volume: 10
  year: 2021
  ident: br0200
  article-title: NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture
  publication-title: Electronics
– year: 1996
  ident: br0290
  article-title: Matrix Computations
– volume: 7
  year: 2018
  ident: br0190
  article-title: Hybrid-grained dynamic load balanced GEMM on NUMA architectures
  publication-title: Electronics
– year: 2018
  ident: br0270
  article-title: Efficient Virtualization of NUMA Architectures. (Virtualisation Efficace d'Architectures NUMA)
– volume: vol. 2
  year: 2010
  ident: br0370
  article-title: Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs
  publication-title: GPU Computing Gems
– volume: 36
  year: 2009
  ident: br0350
  article-title: Programming matrix algorithms-by-blocks for thread-level parallelism
  publication-title: ACM Trans. Math. Softw.
– start-page: 342
  year: 2019
  end-page: 353
  ident: br0160
  article-title: Efficient thread/page/parallelism autotuning for NUMA systems
  publication-title: Proceedings of the ACM International Conference on Supercomputing
– start-page: 169
  year: 2021
  end-page: 176
  ident: br0210
  article-title: Performance analysis of array database systems in non-uniform memory architecture
  publication-title: 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
– year: 2003
  ident: br0460
  article-title: Introduction to Parallel Computing
– volume: e6887
  year: 2022
  ident: 10.1016/j.jpdc.2023.01.004_br0220
  article-title: Improving the accessibility of NUMA-aware C++ application development based on the PGASUS framework
  publication-title: Concurr. Comput., Pract. Exp.
– start-page: 342
  year: 2019
  ident: 10.1016/j.jpdc.2023.01.004_br0160
  article-title: Efficient thread/page/parallelism autotuning for NUMA systems
– volume: vol. 2
  year: 2010
  ident: 10.1016/j.jpdc.2023.01.004_br0370
  article-title: Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs
– volume: 43
  issue: 2
  year: 2016
  ident: 10.1016/j.jpdc.2023.01.004_br0340
  article-title: Analytical modeling is enough for high-performance BLIS
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/2925987
– volume: 7
  issue: 12
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0190
  article-title: Hybrid-grained dynamic load balanced GEMM on NUMA architectures
  publication-title: Electronics
  doi: 10.3390/electronics7120359
– year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0400
  article-title: Scalable hybrid loop- and task-parallel matrix inversion for multicore processors
– year: 2015
  ident: 10.1016/j.jpdc.2023.01.004_br0010
  article-title: Interconnect-memory challenges for multi-chip, silicon interposer systems
– start-page: 239
  year: 2020
  ident: 10.1016/j.jpdc.2023.01.004_br0180
  article-title: Effective exploration of thread throttling and thread/page mapping on NUMA systems
– volume: 41
  issue: 3
  year: 2015
  ident: 10.1016/j.jpdc.2023.01.004_br0330
  article-title: BLIS: a framework for rapidly instantiating BLAS functionality
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/2764454
– start-page: 1
  year: 2017
  ident: 10.1016/j.jpdc.2023.01.004_br0030
  article-title: Increasing interposer utilization: a scalable, energy efficient and high bandwidth multicore-multichip integration solution
– start-page: 281
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0280
  article-title: Placement of virtual containers on NUMA systems: a practical and comprehensive model
– start-page: 1110
  year: 2011
  ident: 10.1016/j.jpdc.2023.01.004_br0130
– volume: 41
  start-page: 67
  issue: 5
  year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0060
  article-title: Kunpeng 920: the first 7-nm chiplet-based 64-core ARM SoC for cloud services
  publication-title: IEEE MICRO
  doi: 10.1109/MM.2021.3085578
– volume: 2
  start-page: 49
  issue: 1
  year: 2015
  ident: 10.1016/j.jpdc.2023.01.004_br0440
  article-title: Dense matrix computations on NUMA architectures with distance-aware work stealing
  publication-title: Supercomput. Front. Innov.
– volume: 16
  start-page: 1
  issue: 1
  year: 1990
  ident: 10.1016/j.jpdc.2023.01.004_br0310
  article-title: A set of level 3 basic linear algebra subprograms
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/77626.79170
– start-page: 419
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0420
  article-title: Graph partitioning applied to dag scheduling to reduce numa effects
– volume: 11
  start-page: 40
  issue: 7
  year: 2013
  ident: 10.1016/j.jpdc.2023.01.004_br0150
  article-title: NUMA (non-uniform memory access): an overview: NUMA becomes more common because memory controllers get close to execution units on microprocessors
  publication-title: Queue
  doi: 10.1145/2508834.2513149
– start-page: 1049
  year: 2014
  ident: 10.1016/j.jpdc.2023.01.004_br0410
  article-title: Anatomy of high-performance many-threaded matrix multiplication
– start-page: 169
  year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0210
  article-title: Performance analysis of array database systems in non-uniform memory architecture
– year: 2003
  ident: 10.1016/j.jpdc.2023.01.004_br0460
– start-page: 57
  year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0050
  article-title: Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: industrial product
– volume: 39
  start-page: 5183
  issue: 12
  year: 2020
  ident: 10.1016/j.jpdc.2023.01.004_br0040
  article-title: Cross-layer co-optimization of network design and chiplet placement in 2.5-d systems
  publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
  doi: 10.1109/TCAD.2020.2970019
– year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0480
– start-page: 275
  year: 2016
  ident: 10.1016/j.jpdc.2023.01.004_br0430
  article-title: Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling
– volume: 45
  issue: 2
  year: 2019
  ident: 10.1016/j.jpdc.2023.01.004_br0360
  article-title: PLASMA: parallel linear algebra software for multicore using OpenMP
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/3264491
– year: 1996
  ident: 10.1016/j.jpdc.2023.01.004_br0290
– volume: 15
  issue: 2
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0230
  article-title: NUMA-caffe: NUMA-aware deep learning neural networks
  publication-title: ACM Trans. Archit. Code Optim.
  doi: 10.1145/3199605
– volume: 23
  start-page: 20
  issue: 1
  year: 1995
  ident: 10.1016/j.jpdc.2023.01.004_br0120
  article-title: Hitting the memory wall: implications of the obvious
  publication-title: SIGARCH Comput. Archit. News
  doi: 10.1145/216585.216588
– year: 1999
  ident: 10.1016/j.jpdc.2023.01.004_br0300
– start-page: 868
  year: 2020
  ident: 10.1016/j.jpdc.2023.01.004_br0100
  article-title: ThymesisFlow: a software-defined, HW/SW co-designed interconnect stack for rack-scale memory disaggregation
– start-page: 14
  year: 2019
  ident: 10.1016/j.jpdc.2023.01.004_br0260
  article-title: Simba: scaling deep-learning inference with multi-chip-module-based architecture
– year: 2019
  ident: 10.1016/j.jpdc.2023.01.004_br0070
– volume: 21
  start-page: 2438
  issue: 18
  year: 2009
  ident: 10.1016/j.jpdc.2023.01.004_br0380
  article-title: Parallelizing dense and banded linear algebra libraries using smpss
  publication-title: Concurr. Comput., Pract. Exp.
  doi: 10.1002/cpe.1463
– volume: 34
  issue: 3
  year: 2008
  ident: 10.1016/j.jpdc.2023.01.004_br0320
  article-title: Anatomy of high-performance matrix multiplication
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/1356052.1356053
– start-page: 74
  year: 2020
  ident: 10.1016/j.jpdc.2023.01.004_br0170
  article-title: A case study and characterization of a many-socket, multi-tier NUMA HPC platform
– year: 2020
  ident: 10.1016/j.jpdc.2023.01.004_br0490
– volume: 36
  issue: 3
  year: 2009
  ident: 10.1016/j.jpdc.2023.01.004_br0350
  article-title: Programming matrix algorithms-by-blocks for thread-level parallelism
  publication-title: ACM Trans. Math. Softw.
  doi: 10.1145/1527286.1527288
– volume: 46
  start-page: 95
  year: 2015
  ident: 10.1016/j.jpdc.2023.01.004_br0390
  article-title: Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi
  publication-title: Comput. Electr. Eng.
  doi: 10.1016/j.compeleceng.2015.06.009
– start-page: 546
  year: 2015
  ident: 10.1016/j.jpdc.2023.01.004_br0020
  article-title: Enabling interposer-based disintegration of multi-core processors
– volume: 129
  start-page: 18
  year: 2022
  ident: 10.1016/j.jpdc.2023.01.004_br0250
  article-title: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2021.11.008
– volume: 10
  issue: 16
  year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0200
  article-title: NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture
  publication-title: Electronics
  doi: 10.3390/electronics10161984
– year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0270
– volume: 29
  start-page: 1174
  issue: 5
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0240
  article-title: Reducing cache coherence traffic with a NUMA-aware runtime approach
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2017.2787123
– start-page: 124
  year: 2018
  ident: 10.1016/j.jpdc.2023.01.004_br0110
  article-title: Joint load-balancing and energy-aware virtual machine placement for network-on-chip systems
– year: 1998
  ident: 10.1016/j.jpdc.2023.01.004_br0500
– year: 2021
  ident: 10.1016/j.jpdc.2023.01.004_br0090
  article-title: What can chiplets bring to multi-tenant clouds?
– year: 1996
  ident: 10.1016/j.jpdc.2023.01.004_br0450
  article-title: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance
– start-page: 371
  year: 2009
  ident: 10.1016/j.jpdc.2023.01.004_br0140
  article-title: Scaling the bandwidth wall: challenges in and avenues for CMP scaling
SSID ssj0011578
Score 2.362214
Snippet We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose,...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 51
SubjectTerms Chiplets
Dense linear algebra
NUMA architectures
Portability
Shared memory programming
Title Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
URI https://dx.doi.org/10.1016/j.jpdc.2023.01.004
Volume 175
WOSCitedRecordID wos000924959200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1096-0848
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011578
  issn: 0743-7315
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1ba9swFBZZu4e97D7W3dDD3oyDb4rsx7B2rGULJW0hb0azpdCQOsFxQtjP7C_qkY7kOt1WtsFeTHAsS-h8PufoXAn5GMQREyws_EQG0k_KYuCDnFF-xmQEXxOwxMLUmf3KR6N0MslOe71rlwuzmfOqSrfbbPlfSQ33gNg6dfYvyN2-FG7AbyA6XIHscP0jwp9ixNWVSTMXte6VMveAu6x0qGpTX25tjx2XgGnrL23QcGbCDkHV9qemHrVBx-ji29DrehxWv1Fp2-n0K0tdkld305ImcW65bpyUNC4PUPqNkz5EAyyMqFsJcTxdmz4Eru1HsfAO-7dW27qWmJ1zsliho98bt3-PFyV6_w-na_nDP7OTADaNqXwslMsSsLaOqBNZiAY4l4SzEyNqqqzyGLNC-xL5eKBjq1Ms4tkyes46rNrWuUWhj_0qfhInaNmY9WfLUpe7jGJT4hX7Jd8p032ml6FXAWc6rSjzB2Q_4iwDTrs_PD6anLS-rZChfuCWbVO5MOrw7ky_Vpc6KtD5U_LYEpoOEXPPSE9Wz8kT1xeEWjHxgpQdCFKHCWogSBGCdBeCFKhPWwhSgCDdhSDVEKQ7EHxJLj4fnX_64ttuHn4RJ0njl7woUhUO5CDj6jtLRRArFSgVqRAeEBKOEhEgIJMCzgRCcRA9GahVGUiZIshEGL8ie9Wikq8JZUxwHbejuBLJQCVppMOKOWj2ApStMjogodu0vLCl7nXHlXnuYhpnud7oXG90HoQ5bPQB8doxSyz0cu_TzNEit6oqqqA5QOeecW_-cdxb8uj2e3hH9pp6Ld-Th8WmuVzVHyzCbgDgYbnn
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Programming+parallel+dense+matrix+factorizations+and+inversion+for+new-generation+NUMA+architectures&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Catal%C3%A1n%2C+Sandra&rft.au=Igual%2C+Francisco+D.&rft.au=Herrero%2C+Jos%C3%A9+R.&rft.au=Rodr%C3%ADguez-S%C3%A1nchez%2C+Rafael&rft.date=2023-05-01&rft.pub=Elsevier+Inc&rft.issn=0743-7315&rft.eissn=1096-0848&rft.volume=175&rft.spage=51&rft.epage=65&rft_id=info:doi/10.1016%2Fj.jpdc.2023.01.004&rft.externalDocID=S0743731523000047
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon