Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Journal of parallel and distributed computing Ročník 175; s. 51 - 65
Hlavní autori:	Catalán, Sandra, Igual, Francisco D., Herrero, José R., Rodríguez-Sánchez, Rafael, Quintana-Ortí, Enrique S.
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Inc 01.05.2023
Predmet:	Chiplets Dense linear algebra NUMA architectures Portability Shared memory programming NUMA architectures Portability Shared memory programming Chiplets Dense linear algebra
ISSN:	0743-7315, 1096-0848
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. •Exposure of the performance penalty introduced by NUMA-oblivious implementations.•Demonstration that a high-level approach can largely diminish the programming effort.•Demonstration of performance boost when algorithms span across several NUMA domains.•Validation via matrix factorization and inversion on state-of-the-art NUMA servers.
AbstractList	We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming. •Exposure of the performance penalty introduced by NUMA-oblivious implementations.•Demonstration that a high-level approach can largely diminish the programming effort.•Demonstration of performance boost when algorithms span across several NUMA domains.•Validation via matrix factorization and inversion on state-of-the-art NUMA servers.
Author	Herrero, José R. Rodríguez-Sánchez, Rafael Catalán, Sandra Quintana-Ortí, Enrique S. Igual, Francisco D.
Author_xml	– sequence: 1 givenname: Sandra surname: Catalán fullname: Catalán, Sandra organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain – sequence: 2 givenname: Francisco D. surname: Igual fullname: Igual, Francisco D. organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain – sequence: 3 givenname: José R. surname: Herrero fullname: Herrero, José R. organization: Departament d'Arquitectura de Computadors, Universitat Politecnica de Catalunya, Barcelona, Spain – sequence: 4 givenname: Rafael surname: Rodríguez-Sánchez fullname: Rodríguez-Sánchez, Rafael organization: Departamento de Arquitectura de Computadores y Automatica, Universidad Complutense de Madrid, Madrid, Spain – sequence: 5 givenname: Enrique S. surname: Quintana-Ortí fullname: Quintana-Ortí, Enrique S. email: quintana@disca.upv.es organization: Departamento de Informatica de Sistemas y Computadores, Universitat Politecnica de Valencia, Valencia, Spain
BookMark	eNp9kMtOwzAQRS1UJErhB1j5BxLGcdIkEhuEeEnlsaBra3DGxVXqVONQHl9PSlmx6Go00j1XuudYjEIXSIgzBakCNT1fpst1Y9MMMp2CSgHyAzFWUE8TqPJqJMZQ5joptSqOxHGMSwClirIai-aZuwXjauXDQq6RsW2plQ2FSHKFPftP6dD2Hftv7H0XosTQSB82xHF4petYBvpIFhSIfxPycf5wKZHtm-_J9u9M8UQcOmwjnf7diZjfXL9c3SWzp9v7q8tZYnWe90lTWls5NaVpXbrXokLQzoFzmVNDAKlQdeaQakJdZOjKUkOtMl1DDhZqVHoisl2v5S5GJmfW7FfIX0aB2XoyS7P1ZLaeDCgzeBqg6h9kff-7pGf07X70YofSMGrjiU20noKlxvMw3TSd34f_AE6ZiF4
CitedBy_id	crossref_primary_10_1016_j_future_2023_07_005
Cites_doi	10.1145/2925987 10.3390/electronics7120359 10.1145/2764454 10.1109/MM.2021.3085578 10.1145/77626.79170 10.1145/2508834.2513149 10.1109/TCAD.2020.2970019 10.1145/3264491 10.1145/3199605 10.1145/216585.216588 10.1002/cpe.1463 10.1145/1356052.1356053 10.1145/1527286.1527288 10.1016/j.compeleceng.2015.06.009 10.1016/j.future.2021.11.008 10.3390/electronics10161984 10.1109/TPDS.2017.2787123
ContentType	Journal Article
Copyright	2023 The Author(s)
Copyright_xml	– notice: 2023 The Author(s)
DBID	6I. AAFTH AAYXX CITATION
DOI	10.1016/j.jpdc.2023.01.004
DatabaseName	ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1096-0848
EndPage	65
ExternalDocumentID	10_1016_j_jpdc_2023_01_004 S0743731523000047
GroupedDBID	--K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 6I. 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAFTH AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABFSI ABJNI ABMAC ABTAH ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADHUB ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q G8K GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- ~G0 9DU AATTM AAXKI AAYWO AAYXX ABDPE ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO ADVLN AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD
ID	FETCH-LOGICAL-c344t-d7cc8f16e697fb58a03ff0ff2f1c34ae5192fae9ea352af773091239040c09a13
ISICitedReferencesCount	1
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000924959200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0743-7315
IngestDate	Sat Nov 29 07:17:14 EST 2025 Tue Nov 18 22:12:41 EST 2025 Fri Feb 23 02:38:37 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	NUMA architectures Portability Shared memory programming Chiplets Dense linear algebra
Language	English
License	This is an open access article under the CC BY-NC-ND license.
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c344t-d7cc8f16e697fb58a03ff0ff2f1c34ae5192fae9ea352af773091239040c09a13
OpenAccessLink	https://dx.doi.org/10.1016/j.jpdc.2023.01.004
PageCount	15
ParticipantIDs	crossref_primary_10_1016_j_jpdc_2023_01_004 crossref_citationtrail_10_1016_j_jpdc_2023_01_004 elsevier_sciencedirect_doi_10_1016_j_jpdc_2023_01_004
PublicationCentury	2000
PublicationDate	May 2023 2023-05-00
PublicationDateYYYYMMDD	2023-05-01
PublicationDate_xml	– month: 05 year: 2023 text: May 2023
PublicationDecade	2020
PublicationTitle	Journal of parallel and distributed computing
PublicationYear	2023
Publisher	Elsevier Inc
Publisher_xml	– name: Elsevier Inc
References	Grama, Gupta, Karypis, Kumar (br0460) 2003 Anderson, Bai, Bischof, Blackford, Demmel, Dongarra, Croz, Hammarling, Greenbaum, McKenney, Sorensen (br0300) 1999 Van Zee, van de Geijn (br0330) 2015; 41 Shao, Clemons, Venkatesan, Zimmer, Fojtik, Jiang, Keller, Klinefelter, Pinckney, Raina, Tell, Zhang, Dally, Emer, Gray, Khailany, Keckler (br0260) 2019 Caheny, Casas, Moretó, Gloaguen, Saintes, Ayguadé, Labarta, Valero (br0430) 2016 Gates, Charara, Kurzak, YarKhan, Al Farhan, Sukkari, Dongarra (br0490) July 2020 Amd (br0480) 2018 Caheny, Alvarez, Derradji, Valero, Moretó, Casas (br0240) 2018; 29 Roy, Song, Krishnamoorthy, Vishnu, Sengupta, Liu (br0230) jun 2018; 15 Alomairy, Miranda, Ltaief, Badia, Martorell, Labarta, Keyes (br0440) 2015; 2 Kannan, Jerger, Loh (br0020) 2015 Lameter (br0150) 2013; 11 Funston, Lorrillere, Fedorova, Lepers, Vengerov, Lozi, Quéma (br0280) 2018 Zhang, Jiang, Chen, Xiao, Ou (br0200) 2021; 10 Dongarra, Gates, Haidar, Kurzak, Luszczek, Wu, Yamazaki, Yarkhan, Abalenkovs, Bagherpour, Hammarling, Šístek, Stevens, Zounon, Relton (br0360) May 2019; 45 Strazdins (br0500) 1998 Goto, van de Geijn (br0320) 2008; 34 Low, Igual, Smith, Quintana-Ortí (br0340) 2016; 43 Rogers, Krishna, Bell, Vu, Jiang, Solihin (br0140) 2009 Xia, Cheng, Zhou, Hu, Chun (br0060) 2021; 41 Laso, Lorenzo, Cabaleiro, Pena, Ángel Lorenzo, Rivera CIMAR NIMAR, LMMA (br0250) 2022; 129 Zhao, Jerger, Ga (br0090) 2021 Imes, Hofmeyr, Kang, Walters (br0170) 2020 Coskun, Eris, Joshi, Kahng, Ma, Narayan, Srinivas (br0040) 2020; 39 Blackford, Choi, Cleary, Demmel, Dhillon, Dongarra, Hammarling, Henry, Petitet, Stanley, Walker, Whaley (br0450) 1996 Popov, Jimborean, Black-Schaffer (br0160) 2019 Quintana-Ortí, Quintana-Ortí, van de Geijn, Van Zee, Chan (br0350) 2009; 36 Wulf, McKee (br0120) 1995; 23 Sánchez Barrera, Casas, Moretó, Ayguadé, Labarta, Valero (br0420) 2018 (br0470) 2019 Catalán, Igual, Rodríguez-Sánchez, Quintana-Ortí (br0400) 2021 McKee, Wisniewski (br0130) 2011 Agullo, Augonnet, Dongarra, Ltaief, Namyst, Thibault, Tomov (br0370) 2010; vol. 2 Loh, Jerger, Kannan, Eckert (br0010) October 5-8, 2015 Schwarzrock, de, Rocha, Beck, Lorenzon (br0180) 2020 Plauth, Eberhardt, Grapentin, Polze (br0220) 2022; e6887 Smith (br0410) 2014 Su, Lei (br0190) 2018; 7 Dominico, de Almeida, Alves, Meira (br0210) 2021 br0080 Pinto, Syrivelis, Gazzetti, Koutsovasilis, Reale, Katrinis, Hofstee (br0100) 2020 Liu, Mashayekhy (br0110) 2018 Ahmed, Shamim, Mansoor, Mamun, Ganguly (br0030) 2017 Badia, Herrero, Labarta, Pérez, Quintana-Ortí, Quintana-Ortí (br0380) 2009; 21 Moore (br0070) April 2019 Naffziger, Beck, Burd, Lepak, Loh, Subramony, White (br0050) 2021 Dongarra, Du Croz, Hammarling, Duff (br0310) 1990; 16 Golub, Van Loan (br0290) 1996 Dolz, Igual, Ludwig, Piñuel, Quintana-Ortí (br0390) 2015; 46 Voron (br0270) 2018 Coskun (10.1016/j.jpdc.2023.01.004_br0040) 2020; 39 Popov (10.1016/j.jpdc.2023.01.004_br0160) 2019 Lameter (10.1016/j.jpdc.2023.01.004_br0150) 2013; 11 Naffziger (10.1016/j.jpdc.2023.01.004_br0050) 2021 Su (10.1016/j.jpdc.2023.01.004_br0190) 2018; 7 Voron (10.1016/j.jpdc.2023.01.004_br0270) 2018 Caheny (10.1016/j.jpdc.2023.01.004_br0430) 2016 Agullo (10.1016/j.jpdc.2023.01.004_br0370) 2010; vol. 2 Alomairy (10.1016/j.jpdc.2023.01.004_br0440) 2015; 2 Strazdins (10.1016/j.jpdc.2023.01.004_br0500) 1998 Zhao (10.1016/j.jpdc.2023.01.004_br0090) 2021 Badia (10.1016/j.jpdc.2023.01.004_br0380) 2009; 21 Low (10.1016/j.jpdc.2023.01.004_br0340) 2016; 43 Imes (10.1016/j.jpdc.2023.01.004_br0170) 2020 Laso (10.1016/j.jpdc.2023.01.004_br0250) 2022; 129 Zhang (10.1016/j.jpdc.2023.01.004_br0200) 2021; 10 Dongarra (10.1016/j.jpdc.2023.01.004_br0360) 2019; 45 Caheny (10.1016/j.jpdc.2023.01.004_br0240) 2018; 29 Rogers (10.1016/j.jpdc.2023.01.004_br0140) 2009 Dolz (10.1016/j.jpdc.2023.01.004_br0390) 2015; 46 Ahmed (10.1016/j.jpdc.2023.01.004_br0030) 2017 Amd (10.1016/j.jpdc.2023.01.004_br0480) 2018 Goto (10.1016/j.jpdc.2023.01.004_br0320) 2008; 34 Funston (10.1016/j.jpdc.2023.01.004_br0280) 2018 Roy (10.1016/j.jpdc.2023.01.004_br0230) 2018; 15 Blackford (10.1016/j.jpdc.2023.01.004_br0450) 1996 Loh (10.1016/j.jpdc.2023.01.004_br0010) 2015 Moore (10.1016/j.jpdc.2023.01.004_br0070) 2019 Shao (10.1016/j.jpdc.2023.01.004_br0260) 2019 Wulf (10.1016/j.jpdc.2023.01.004_br0120) 1995; 23 Dominico (10.1016/j.jpdc.2023.01.004_br0210) 2021 Golub (10.1016/j.jpdc.2023.01.004_br0290) 1996 Van Zee (10.1016/j.jpdc.2023.01.004_br0330) 2015; 41 Pinto (10.1016/j.jpdc.2023.01.004_br0100) 2020 Liu (10.1016/j.jpdc.2023.01.004_br0110) 2018 McKee (10.1016/j.jpdc.2023.01.004_br0130) 2011 Sánchez Barrera (10.1016/j.jpdc.2023.01.004_br0420) 2018 Quintana-Ortí (10.1016/j.jpdc.2023.01.004_br0350) 2009; 36 Grama (10.1016/j.jpdc.2023.01.004_br0460) 2003 Schwarzrock (10.1016/j.jpdc.2023.01.004_br0180) 2020 Xia (10.1016/j.jpdc.2023.01.004_br0060) 2021; 41 Anderson (10.1016/j.jpdc.2023.01.004_br0300) 1999 Catalán (10.1016/j.jpdc.2023.01.004_br0400) 2021 Dongarra (10.1016/j.jpdc.2023.01.004_br0310) 1990; 16 Plauth (10.1016/j.jpdc.2023.01.004_br0220) 2022; e6887 Gates (10.1016/j.jpdc.2023.01.004_br0490) 2020 Smith (10.1016/j.jpdc.2023.01.004_br0410) 2014 Kannan (10.1016/j.jpdc.2023.01.004_br0020) 2015
References_xml	– ident: br0080 article-title: Universal Chiplet Interconnect Express (UCIe) – start-page: 371 year: 2009 end-page: 382 ident: br0140 article-title: Scaling the bandwidth wall: challenges in and avenues for CMP scaling publication-title: Proceedings of the 36th Annual International Symposium on Computer Architecture – year: October 5-8, 2015 ident: br0010 article-title: Interconnect-memory challenges for multi-chip, silicon interposer systems publication-title: Proceedings of the 2015 International Symposium on Memory Systems – year: 2019 ident: br0470 – volume: 46 start-page: 95 year: 2015 end-page: 111 ident: br0390 article-title: Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi publication-title: Comput. Electr. Eng. – year: 1998 ident: br0500 article-title: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization – start-page: 275 year: 2016 end-page: 286 ident: br0430 article-title: Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling publication-title: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation – year: 2021 ident: br0090 article-title: What can chiplets bring to multi-tenant clouds? publication-title: Cloud@MICRO Workshop in Conjunction with MICRO – start-page: 124 year: 2018 end-page: 132 ident: br0110 article-title: Joint load-balancing and energy-aware virtual machine placement for network-on-chip systems publication-title: 2018 IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC) – start-page: 1 year: 2017 end-page: 6 ident: br0030 article-title: Increasing interposer utilization: a scalable, energy efficient and high bandwidth multicore-multichip integration solution publication-title: 2017 Eighth International Green and Sustainable Computing Conference (IGSC) – year: 2018 ident: br0480 article-title: Open-source register reference for amd family 17h processors – volume: 39 start-page: 5183 year: 2020 end-page: 5196 ident: br0040 article-title: Cross-layer co-optimization of network design and chiplet placement in 2.5-d systems publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. – start-page: 74 year: 2020 end-page: 84 ident: br0170 article-title: A case study and characterization of a many-socket, multi-tier NUMA HPC platform publication-title: 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar) – volume: 45 year: May 2019 ident: br0360 article-title: PLASMA: parallel linear algebra software for multicore using OpenMP publication-title: ACM Trans. Math. Softw. – start-page: 1110 year: 2011 end-page: 1116 ident: br0130 article-title: Memory Wall – volume: 129 start-page: 18 year: 2022 end-page: 32 ident: br0250 article-title: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters publication-title: Future Gener. Comput. Syst. – volume: 15 year: jun 2018 ident: br0230 article-title: NUMA-caffe: NUMA-aware deep learning neural networks publication-title: ACM Trans. Archit. Code Optim. – start-page: 57 year: 2021 end-page: 70 ident: br0050 article-title: Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: industrial product publication-title: 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) – year: July 2020 ident: br0490 article-title: SLATE users' guide – year: 2021 ident: br0400 article-title: Scalable hybrid loop- and task-parallel matrix inversion for multicore processors publication-title: 22nd IEEE International Workshop on Parallel and Distributed Scientific and Engineering – PDSEC'21 – start-page: 239 year: 2020 end-page: 246 ident: br0180 article-title: Effective exploration of thread throttling and thread/page mapping on NUMA systems publication-title: 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City, IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) – start-page: 1049 year: 2014 end-page: 1059 ident: br0410 article-title: Anatomy of high-performance many-threaded matrix multiplication publication-title: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp. – volume: 41 start-page: 67 year: 2021 end-page: 75 ident: br0060 article-title: Kunpeng 920: the first 7-nm chiplet-based 64-core ARM SoC for cloud services publication-title: IEEE MICRO – volume: 41 year: 2015 ident: br0330 article-title: BLIS: a framework for rapidly instantiating BLAS functionality publication-title: ACM Trans. Math. Softw. – year: 1999 ident: br0300 article-title: LAPACK Users' Guide – volume: 43 year: 2016 ident: br0340 article-title: Analytical modeling is enough for high-performance BLIS publication-title: ACM Trans. Math. Softw. – start-page: 419 year: 2018 end-page: 420 ident: br0420 article-title: Graph partitioning applied to dag scheduling to reduce numa effects publication-title: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – start-page: 281 year: 2018 end-page: 293 ident: br0280 article-title: Placement of virtual containers on NUMA systems: a practical and comprehensive model publication-title: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference – volume: 2 start-page: 49 year: 2015 end-page: 72 ident: br0440 article-title: Dense matrix computations on NUMA architectures with distance-aware work stealing publication-title: Supercomput. Front. Innov. – volume: 29 start-page: 1174 year: 2018 end-page: 1187 ident: br0240 article-title: Reducing cache coherence traffic with a NUMA-aware runtime approach publication-title: IEEE Trans. Parallel Distrib. Syst. – volume: 16 start-page: 1 year: 1990 end-page: 17 ident: br0310 article-title: A set of level 3 basic linear algebra subprograms publication-title: ACM Trans. Math. Softw. – volume: 11 start-page: 40 year: 2013 end-page: 51 ident: br0150 article-title: NUMA (non-uniform memory access): an overview: NUMA becomes more common because memory controllers get close to execution units on microprocessors publication-title: Queue – volume: 34 year: 2008 ident: br0320 article-title: Anatomy of high-performance matrix multiplication publication-title: ACM Trans. Math. Softw. – year: April 2019 ident: br0070 article-title: Intel's View of the Chiplet Revolution – volume: e6887 year: 2022 ident: br0220 article-title: Improving the accessibility of NUMA-aware C++ application development based on the PGASUS framework publication-title: Concurr. Comput., Pract. Exp. – year: 1996 ident: br0450 article-title: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance publication-title: Supercomputing '96: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing – start-page: 546 year: 2015 end-page: 558 ident: br0020 article-title: Enabling interposer-based disintegration of multi-core processors publication-title: 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) – start-page: 868 year: 2020 end-page: 880 ident: br0100 article-title: ThymesisFlow: a software-defined, HW/SW co-designed interconnect stack for rack-scale memory disaggregation publication-title: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) – volume: 21 start-page: 2438 year: 2009 end-page: 2456 ident: br0380 article-title: Parallelizing dense and banded linear algebra libraries using smpss publication-title: Concurr. Comput., Pract. Exp. – start-page: 14 year: 2019 end-page: 27 ident: br0260 article-title: Simba: scaling deep-learning inference with multi-chip-module-based architecture publication-title: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture – volume: 23 start-page: 20 year: 1995 end-page: 24 ident: br0120 article-title: Hitting the memory wall: implications of the obvious publication-title: SIGARCH Comput. Archit. News – volume: 10 year: 2021 ident: br0200 article-title: NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture publication-title: Electronics – year: 1996 ident: br0290 article-title: Matrix Computations – volume: 7 year: 2018 ident: br0190 article-title: Hybrid-grained dynamic load balanced GEMM on NUMA architectures publication-title: Electronics – year: 2018 ident: br0270 article-title: Efficient Virtualization of NUMA Architectures. (Virtualisation Efficace d'Architectures NUMA) – volume: vol. 2 year: 2010 ident: br0370 article-title: Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs publication-title: GPU Computing Gems – volume: 36 year: 2009 ident: br0350 article-title: Programming matrix algorithms-by-blocks for thread-level parallelism publication-title: ACM Trans. Math. Softw. – start-page: 342 year: 2019 end-page: 353 ident: br0160 article-title: Efficient thread/page/parallelism autotuning for NUMA systems publication-title: Proceedings of the ACM International Conference on Supercomputing – start-page: 169 year: 2021 end-page: 176 ident: br0210 article-title: Performance analysis of array database systems in non-uniform memory architecture publication-title: 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) – year: 2003 ident: br0460 article-title: Introduction to Parallel Computing – volume: e6887 year: 2022 ident: 10.1016/j.jpdc.2023.01.004_br0220 article-title: Improving the accessibility of NUMA-aware C++ application development based on the PGASUS framework publication-title: Concurr. Comput., Pract. Exp. – start-page: 342 year: 2019 ident: 10.1016/j.jpdc.2023.01.004_br0160 article-title: Efficient thread/page/parallelism autotuning for NUMA systems – volume: vol. 2 year: 2010 ident: 10.1016/j.jpdc.2023.01.004_br0370 article-title: Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs – volume: 43 issue: 2 year: 2016 ident: 10.1016/j.jpdc.2023.01.004_br0340 article-title: Analytical modeling is enough for high-performance BLIS publication-title: ACM Trans. Math. Softw. doi: 10.1145/2925987 – volume: 7 issue: 12 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0190 article-title: Hybrid-grained dynamic load balanced GEMM on NUMA architectures publication-title: Electronics doi: 10.3390/electronics7120359 – year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0400 article-title: Scalable hybrid loop- and task-parallel matrix inversion for multicore processors – year: 2015 ident: 10.1016/j.jpdc.2023.01.004_br0010 article-title: Interconnect-memory challenges for multi-chip, silicon interposer systems – start-page: 239 year: 2020 ident: 10.1016/j.jpdc.2023.01.004_br0180 article-title: Effective exploration of thread throttling and thread/page mapping on NUMA systems – volume: 41 issue: 3 year: 2015 ident: 10.1016/j.jpdc.2023.01.004_br0330 article-title: BLIS: a framework for rapidly instantiating BLAS functionality publication-title: ACM Trans. Math. Softw. doi: 10.1145/2764454 – start-page: 1 year: 2017 ident: 10.1016/j.jpdc.2023.01.004_br0030 article-title: Increasing interposer utilization: a scalable, energy efficient and high bandwidth multicore-multichip integration solution – start-page: 281 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0280 article-title: Placement of virtual containers on NUMA systems: a practical and comprehensive model – start-page: 1110 year: 2011 ident: 10.1016/j.jpdc.2023.01.004_br0130 – volume: 41 start-page: 67 issue: 5 year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0060 article-title: Kunpeng 920: the first 7-nm chiplet-based 64-core ARM SoC for cloud services publication-title: IEEE MICRO doi: 10.1109/MM.2021.3085578 – volume: 2 start-page: 49 issue: 1 year: 2015 ident: 10.1016/j.jpdc.2023.01.004_br0440 article-title: Dense matrix computations on NUMA architectures with distance-aware work stealing publication-title: Supercomput. Front. Innov. – volume: 16 start-page: 1 issue: 1 year: 1990 ident: 10.1016/j.jpdc.2023.01.004_br0310 article-title: A set of level 3 basic linear algebra subprograms publication-title: ACM Trans. Math. Softw. doi: 10.1145/77626.79170 – start-page: 419 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0420 article-title: Graph partitioning applied to dag scheduling to reduce numa effects – volume: 11 start-page: 40 issue: 7 year: 2013 ident: 10.1016/j.jpdc.2023.01.004_br0150 article-title: NUMA (non-uniform memory access): an overview: NUMA becomes more common because memory controllers get close to execution units on microprocessors publication-title: Queue doi: 10.1145/2508834.2513149 – start-page: 1049 year: 2014 ident: 10.1016/j.jpdc.2023.01.004_br0410 article-title: Anatomy of high-performance many-threaded matrix multiplication – start-page: 169 year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0210 article-title: Performance analysis of array database systems in non-uniform memory architecture – year: 2003 ident: 10.1016/j.jpdc.2023.01.004_br0460 – start-page: 57 year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0050 article-title: Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families: industrial product – volume: 39 start-page: 5183 issue: 12 year: 2020 ident: 10.1016/j.jpdc.2023.01.004_br0040 article-title: Cross-layer co-optimization of network design and chiplet placement in 2.5-d systems publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. doi: 10.1109/TCAD.2020.2970019 – year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0480 – start-page: 275 year: 2016 ident: 10.1016/j.jpdc.2023.01.004_br0430 article-title: Reducing cache coherence traffic with hierarchical directory cache and numa-aware runtime scheduling – volume: 45 issue: 2 year: 2019 ident: 10.1016/j.jpdc.2023.01.004_br0360 article-title: PLASMA: parallel linear algebra software for multicore using OpenMP publication-title: ACM Trans. Math. Softw. doi: 10.1145/3264491 – year: 1996 ident: 10.1016/j.jpdc.2023.01.004_br0290 – volume: 15 issue: 2 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0230 article-title: NUMA-caffe: NUMA-aware deep learning neural networks publication-title: ACM Trans. Archit. Code Optim. doi: 10.1145/3199605 – volume: 23 start-page: 20 issue: 1 year: 1995 ident: 10.1016/j.jpdc.2023.01.004_br0120 article-title: Hitting the memory wall: implications of the obvious publication-title: SIGARCH Comput. Archit. News doi: 10.1145/216585.216588 – year: 1999 ident: 10.1016/j.jpdc.2023.01.004_br0300 – start-page: 868 year: 2020 ident: 10.1016/j.jpdc.2023.01.004_br0100 article-title: ThymesisFlow: a software-defined, HW/SW co-designed interconnect stack for rack-scale memory disaggregation – start-page: 14 year: 2019 ident: 10.1016/j.jpdc.2023.01.004_br0260 article-title: Simba: scaling deep-learning inference with multi-chip-module-based architecture – year: 2019 ident: 10.1016/j.jpdc.2023.01.004_br0070 – volume: 21 start-page: 2438 issue: 18 year: 2009 ident: 10.1016/j.jpdc.2023.01.004_br0380 article-title: Parallelizing dense and banded linear algebra libraries using smpss publication-title: Concurr. Comput., Pract. Exp. doi: 10.1002/cpe.1463 – volume: 34 issue: 3 year: 2008 ident: 10.1016/j.jpdc.2023.01.004_br0320 article-title: Anatomy of high-performance matrix multiplication publication-title: ACM Trans. Math. Softw. doi: 10.1145/1356052.1356053 – start-page: 74 year: 2020 ident: 10.1016/j.jpdc.2023.01.004_br0170 article-title: A case study and characterization of a many-socket, multi-tier NUMA HPC platform – year: 2020 ident: 10.1016/j.jpdc.2023.01.004_br0490 – volume: 36 issue: 3 year: 2009 ident: 10.1016/j.jpdc.2023.01.004_br0350 article-title: Programming matrix algorithms-by-blocks for thread-level parallelism publication-title: ACM Trans. Math. Softw. doi: 10.1145/1527286.1527288 – volume: 46 start-page: 95 year: 2015 ident: 10.1016/j.jpdc.2023.01.004_br0390 article-title: Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi publication-title: Comput. Electr. Eng. doi: 10.1016/j.compeleceng.2015.06.009 – start-page: 546 year: 2015 ident: 10.1016/j.jpdc.2023.01.004_br0020 article-title: Enabling interposer-based disintegration of multi-core processors – volume: 129 start-page: 18 year: 2022 ident: 10.1016/j.jpdc.2023.01.004_br0250 article-title: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2021.11.008 – volume: 10 issue: 16 year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0200 article-title: NUMA-aware DGEMM based on 64-bit ARMv8 multicore processors architecture publication-title: Electronics doi: 10.3390/electronics10161984 – year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0270 – volume: 29 start-page: 1174 issue: 5 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0240 article-title: Reducing cache coherence traffic with a NUMA-aware runtime approach publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2017.2787123 – start-page: 124 year: 2018 ident: 10.1016/j.jpdc.2023.01.004_br0110 article-title: Joint load-balancing and energy-aware virtual machine placement for network-on-chip systems – year: 1998 ident: 10.1016/j.jpdc.2023.01.004_br0500 – year: 2021 ident: 10.1016/j.jpdc.2023.01.004_br0090 article-title: What can chiplets bring to multi-tenant clouds? – year: 1996 ident: 10.1016/j.jpdc.2023.01.004_br0450 article-title: Scalapack: a portable linear algebra library for distributed memory computers - design issues and performance – start-page: 371 year: 2009 ident: 10.1016/j.jpdc.2023.01.004_br0140 article-title: Scaling the bandwidth wall: challenges in and avenues for CMP scaling
SSID	ssj0011578
Score	2.362214
Snippet	We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose,...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	51
SubjectTerms	Chiplets Dense linear algebra NUMA architectures Portability Shared memory programming
Title	Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
URI	https://dx.doi.org/10.1016/j.jpdc.2023.01.004
Volume	175
WOSCitedRecordID	wos000924959200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1096-0848 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011578 issn: 0743-7315 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1ba9swFBZZu4e97D7W3dDD3oyDb4rsx7B2rGULJW0hb0azpdCQOsFxQtjP7C_qkY7kOt1WtsFeTHAsS-h8PufoXAn5GMQREyws_EQG0k_KYuCDnFF-xmQEXxOwxMLUmf3KR6N0MslOe71rlwuzmfOqSrfbbPlfSQ33gNg6dfYvyN2-FG7AbyA6XIHscP0jwp9ixNWVSTMXte6VMveAu6x0qGpTX25tjx2XgGnrL23QcGbCDkHV9qemHrVBx-ji29DrehxWv1Fp2-n0K0tdkld305ImcW65bpyUNC4PUPqNkz5EAyyMqFsJcTxdmz4Eru1HsfAO-7dW27qWmJ1zsliho98bt3-PFyV6_w-na_nDP7OTADaNqXwslMsSsLaOqBNZiAY4l4SzEyNqqqzyGLNC-xL5eKBjq1Ms4tkyes46rNrWuUWhj_0qfhInaNmY9WfLUpe7jGJT4hX7Jd8p032ml6FXAWc6rSjzB2Q_4iwDTrs_PD6anLS-rZChfuCWbVO5MOrw7ky_Vpc6KtD5U_LYEpoOEXPPSE9Wz8kT1xeEWjHxgpQdCFKHCWogSBGCdBeCFKhPWwhSgCDdhSDVEKQ7EHxJLj4fnX_64ttuHn4RJ0njl7woUhUO5CDj6jtLRRArFSgVqRAeEBKOEhEgIJMCzgRCcRA9GahVGUiZIshEGL8ie9Wikq8JZUxwHbejuBLJQCVppMOKOWj2ApStMjogodu0vLCl7nXHlXnuYhpnud7oXG90HoQ5bPQB8doxSyz0cu_TzNEit6oqqqA5QOeecW_-cdxb8uj2e3hH9pp6Ld-Th8WmuVzVHyzCbgDgYbnn
linkProvider	Elsevier
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Programming+parallel+dense+matrix+factorizations+and+inversion+for+new-generation+NUMA+architectures&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Catal%C3%A1n%2C+Sandra&rft.au=Igual%2C+Francisco+D.&rft.au=Herrero%2C+Jos%C3%A9+R.&rft.au=Rodr%C3%ADguez-S%C3%A1nchez%2C+Rafael&rft.date=2023-05-01&rft.pub=Elsevier+Inc&rft.issn=0743-7315&rft.eissn=1096-0848&rft.volume=175&rft.spage=51&rft.epage=65&rft_id=info:doi/10.1016%2Fj.jpdc.2023.01.004&rft.externalDocID=S0743731523000047
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon