A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Concurrency and computation Ročník 35; číslo 25
Hlavní autor:	Czarnul, Paweł
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Hoboken Wiley Subscription Services, Inc 15.11.2023
Témata:	Graphics processing units Nodes Optimization Power Streams
ISSN:	1532-0626, 1532-0634
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
AbstractList	In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency. In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8×$$ \times $$ NVIDIA Quadro RTX 6000 GPUs, the second with 4×$$ \times $$ NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2×$$ \times $$ NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
Author	Czarnul, Paweł
Author_xml	– sequence: 1 givenname: Paweł orcidid: 0000-0002-4918-9196 surname: Czarnul fullname: Czarnul, Paweł organization: Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology Gdansk Poland
BookMark	eNplkMFOAjEQhhuDiYAmPkITL17A6XbZtkeCiiYYOMh5U7ezuMi2a7uEcPMRfEafxBKMBz3Nn8k3__z5e6RjnUVCLhkMGUByUzQ4FFKJE9JlI54MIONp51cn2RnphbAGYAw465JyTOvtpq3aV4_aoKGT5e2YamvovEH7tKAvOsRt43bovz4-9U57pI13K6_rurIrWkaBO-ffaOn80Sty1hmk08WShn1osQ7n5LTUm4AXP7NPlvd3z5OHwWw-fZyMZ4MiSYENVAwIJRSgTCa05CkHUBlXhWF6BChRylQYhYURWmSopIiZE8MzPpKZMYr3ydXRN0Z832Jo87Xbehtf5okUMoFUChGp6yNVeBeCxzJvfFVrv88Z5IcW89hifmgxosM_aFG1uq2cbb2uNv8PvgEJtnfp
CitedBy_id	crossref_primary_10_1002_cpe_8003 crossref_primary_10_1142_S0129156424400068
Cites_doi	10.1007/978-3-642-36036-7_6 10.1155/2020/4176794 10.1109/TPDS.2014.2315203 10.1007/978‐3‐030‐71593‐9_17 10.1016/j.suscom.2018.05.007 10.1093/comjnl/bxaa187 10.1109/IPDPSW.2012.228 10.3390/en15020474 10.1007/s11227‐017‐2091‐x 10.3390/en16020890 10.1201/b22395 10.1007/978-3-030-43222-5_11 10.3390/s21196526 10.1002/cpe.5275 10.1109/HPCSim.2014.6903687 10.1016/j.cpc.2010.06.035 10.1145/3431920.3439469 10.1016/j.parco.2022.102969 10.31577/cai_2020_3_510 10.1002/cpe.3719 10.1145/2908080.2908094 10.1145/2716320 10.3390/s21165395 10.1109/HPCS48598.2019.9188149 10.1007/978-3-642-45249-9_5 10.1109/HPCS.2018.00017 10.1145/3226112 10.1007/978-3-030-85262-7_11 10.1109/TPDS.2020.3027338 10.1016/j.sysarc.2015.10.003 10.21105/joss.02352 10.1016/j.suscom.2020.100412 10.1007/978-3-642-29737-3_48 10.1007/978-3-031-06156-1_12 10.1109/tpds.2021.3137867 10.1002/cpe.4072
ContentType	Journal Article
Copyright	2023 John Wiley & Sons, Ltd.
Copyright_xml	– notice: 2023 John Wiley & Sons, Ltd.
DBID	AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1002/cpe.7897
DatabaseName	CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	CrossRef Computer and Information Systems Abstracts
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1532-0634
ExternalDocumentID	10_1002_cpe_7897
GroupedDBID	.3N .DC .GA 05W 0R~ 10A 1L6 1OC 33P 3SF 3WU 4.4 50Y 50Z 51W 51X 52M 52N 52O 52P 52S 52T 52U 52W 52X 5GY 5VS 66C 702 7PT 8-0 8-1 8-3 8-4 8-5 8UM 930 A03 AAESR AAEVG AAHQN AAMNL AANLZ AAONW AAXRX AAYCA AAYXX AAZKR ABCQN ABCUV ABEML ABIJN ACAHQ ACCZN ACPOU ACSCC ACXBN ACXQS ADBBV ADEOM ADIZJ ADKYN ADMGS ADMLS ADOZA ADXAS ADZMN AEIGN AEIMD AEUYR AEYWJ AFBPY AFFPM AFGKR AFWVQ AGHNM AGYGG AHBTC AITYG AIURR AJXKR ALMA_UNASSIGNED_HOLDINGS ALVPJ AMBMR AMYDB ATUGU AUFTA AZBYB BAFTC BDRZF BFHJK BHBCM BMNLL BROTX BRXPI BY8 CITATION CS3 D-E D-F DCZOG DPXWK DR2 DRFUL DRSTM EBS F00 F01 F04 F5P G-S G.N GNP GODZA HGLYW HHY HZ~ IX1 JPC KQQ LATKE LAW LC2 LC3 LEEKS LH4 LITHE LOXES LP6 LP7 LUTES LYRES MEWTI MK4 MRFUL MRSTM MSFUL MSSTM MXFUL MXSTM N04 N05 N9A O66 O8X O9- OIG P2W P2X P4D PQQKQ Q.N Q11 QB0 QRW R.K ROL RX1 SUPJJ TN5 UB1 V2E W8V W99 WBKPD WIH WIK WOHZO WQJ WXSBR WYISQ WZISG XG1 XV2 ~IA ~WT 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93
ISSN	1532-0626
IngestDate	Sun Nov 30 04:23:58 EST 2025 Tue Nov 18 21:27:19 EST 2025 Sat Nov 29 03:49:54 EST 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	25
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-4918-9196
OpenAccessLink	https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpe.7897
PQID	2878204877
PQPubID	2045170
ParticipantIDs	proquest_journals_2878204877 crossref_primary_10_1002_cpe_7897 crossref_citationtrail_10_1002_cpe_7897
PublicationCentury	2000
PublicationDate	2023-11-15
PublicationDateYYYYMMDD	2023-11-15
PublicationDate_xml	– month: 11 year: 2023 text: 2023-11-15 day: 15
PublicationDecade	2020
PublicationPlace	Hoboken
PublicationPlace_xml	– name: Hoboken
PublicationTitle	Concurrency and computation
PublicationYear	2023
Publisher	Wiley Subscription Services, Inc
Publisher_xml	– name: Wiley Subscription Services, Inc
References	e_1_2_8_28_1 e_1_2_8_29_1 e_1_2_8_24_1 e_1_2_8_25_1 e_1_2_8_27_1 e_1_2_8_3_1 e_1_2_8_2_1 Georgiou Y (e_1_2_8_39_1) 2015 e_1_2_8_5_1 e_1_2_8_4_1 e_1_2_8_7_1 e_1_2_8_6_1 e_1_2_8_9_1 e_1_2_8_8_1 e_1_2_8_20_1 e_1_2_8_21_1 e_1_2_8_42_1 e_1_2_8_22_1 e_1_2_8_41_1 e_1_2_8_40_1 Krzywaniak A (e_1_2_8_13_1) 2022 e_1_2_8_17_1 e_1_2_8_18_1 e_1_2_8_19_1 e_1_2_8_36_1 e_1_2_8_14_1 e_1_2_8_35_1 e_1_2_8_15_1 e_1_2_8_38_1 e_1_2_8_16_1 e_1_2_8_37_1 Breitbart J (e_1_2_8_23_1) 2010 e_1_2_8_32_1 e_1_2_8_10_1 e_1_2_8_31_1 e_1_2_8_11_1 e_1_2_8_34_1 e_1_2_8_12_1 Li H (e_1_2_8_26_1) 2016; 32 e_1_2_8_33_1 e_1_2_8_30_1
References_xml	– ident: e_1_2_8_29_1 doi: 10.1007/978-3-642-36036-7_6 – ident: e_1_2_8_5_1 doi: 10.1155/2020/4176794 – ident: e_1_2_8_8_1 doi: 10.1109/TPDS.2014.2315203 – ident: e_1_2_8_20_1 doi: 10.1007/978‐3‐030‐71593‐9_17 – ident: e_1_2_8_7_1 doi: 10.1016/j.suscom.2018.05.007 – ident: e_1_2_8_33_1 doi: 10.1093/comjnl/bxaa187 – volume: 32 start-page: 517 issue: 3 year: 2016 ident: e_1_2_8_26_1 article-title: An OpenMP programming toolkit for hybrid CPU/GPU clusters based on software unified memory publication-title: J Inf Sci Eng – ident: e_1_2_8_16_1 doi: 10.1109/IPDPSW.2012.228 – ident: e_1_2_8_38_1 doi: 10.3390/en15020474 – ident: e_1_2_8_25_1 doi: 10.1007/s11227‐017‐2091‐x – ident: e_1_2_8_6_1 doi: 10.3390/en16020890 – ident: e_1_2_8_2_1 doi: 10.1201/b22395 – ident: e_1_2_8_14_1 doi: 10.1007/978-3-030-43222-5_11 – ident: e_1_2_8_41_1 doi: 10.3390/s21196526 – start-page: 667 volume-title: 2022 International Conference on Computational Science (ICCS) year: 2022 ident: e_1_2_8_13_1 – volume-title: OpenMP for Next Generation Heterogeneous Clusters year: 2010 ident: e_1_2_8_23_1 – ident: e_1_2_8_36_1 doi: 10.1002/cpe.5275 – ident: e_1_2_8_18_1 doi: 10.1109/HPCSim.2014.6903687 – ident: e_1_2_8_4_1 doi: 10.1016/j.cpc.2010.06.035 – ident: e_1_2_8_21_1 doi: 10.1145/3431920.3439469 – ident: e_1_2_8_15_1 doi: 10.1016/j.parco.2022.102969 – ident: e_1_2_8_34_1 doi: 10.31577/cai_2020_3_510 – ident: e_1_2_8_24_1 doi: 10.1002/cpe.3719 – ident: e_1_2_8_31_1 doi: 10.1145/2908080.2908094 – volume-title: Power Adaptive Scheduling September 2015 year: 2015 ident: e_1_2_8_39_1 – ident: e_1_2_8_17_1 doi: 10.1145/2716320 – ident: e_1_2_8_42_1 doi: 10.3390/s21165395 – ident: e_1_2_8_12_1 doi: 10.1109/HPCS48598.2019.9188149 – ident: e_1_2_8_9_1 doi: 10.1007/978-3-642-45249-9_5 – ident: e_1_2_8_19_1 doi: 10.1109/HPCS.2018.00017 – ident: e_1_2_8_32_1 doi: 10.1145/3226112 – ident: e_1_2_8_35_1 – ident: e_1_2_8_3_1 doi: 10.1007/978-3-030-85262-7_11 – ident: e_1_2_8_11_1 doi: 10.1109/TPDS.2020.3027338 – ident: e_1_2_8_30_1 doi: 10.1016/j.sysarc.2015.10.003 – ident: e_1_2_8_22_1 doi: 10.21105/joss.02352 – ident: e_1_2_8_40_1 doi: 10.1016/j.suscom.2020.100412 – ident: e_1_2_8_27_1 doi: 10.1007/978-3-642-29737-3_48 – ident: e_1_2_8_10_1 doi: 10.1007/978-3-031-06156-1_12 – ident: e_1_2_8_37_1 doi: 10.1109/tpds.2021.3137867 – ident: e_1_2_8_28_1 doi: 10.1002/cpe.4072
SSID	ssj0011031
Score	2.3465195
Snippet	In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing...
SourceID	proquest crossref
SourceType	Aggregation Database Enrichment Source Index Database
SubjectTerms	Graphics processing units Nodes Optimization Power Streams
Title	A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems
URI	https://www.proquest.com/docview/2878204877
Volume	35
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVWIB databaseName: Wiley Online Library Full Collection 2020 customDbUrl: eissn: 1532-0634 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011031 issn: 1532-0626 databaseCode: DRFUL dateStart: 20010101 isFulltext: true titleUrlDefault: https://onlinelibrary.wiley.com providerName: Wiley-Blackwell
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELbKlgOXQnmoLQUZicKhCmSdbGwfV9tWPSzLCu1Ke4sS25GQIJvuow9O_AR-I7-EGdsJG4FQe-ASrZyRN_J8mczYM98Q8popZG3TIsi1EUEs8ZXSQgYciUwKDLG7tlB4yEcjMZvJsU8dWtp2ArwsxfW1rP6rqmEMlI2ls3dQdzMpDMBvUDpcQe1wvZXi-y5JcAVKyjTu305P-vaIAHNHPoyP8buljyvsjtZkOmRXmADmc7W-2uTKOmnL5iHaGRvpcq4BW-Opp4Febjq4g3mpLOWTuqlL5qp1-7h_8C1blGvXOjm7MkeD3pHrDFTvPrAIy_Bc_eUtbVzLtrIgTJhnvt4c8_uZ3iA7_hIPPFcW_Yehd8SxqjLvuHAJvm0u7dHH9Gw6HKaT09nkTXURYJsxPI73PVfukW3Ge1J0yPbJJxBsDp6w64Wj2HXPWvMVh-x9_WdtD6b9AbdeyeQR2fHhBO07GOySLVM-Jg_rVh3UW-4npOjTFiooooKCgqhDBbWooBYVP7__sHigG3igDR4o4MHNBXKIBApIoB4JT8n07HQyOA98i41AgSvXDSTGs0WoQqkTnokojjCmjaTS3awXGmEEeC9aGqV5xhMjBYdnZDoCN1UkWsvoGemU89LsERoLBbFGLjX4RbGJ41yKIhR5kucFg5nFPnlbL1uqPP88tkH5kjrmbJbCAqe4wPvkVSNZOc6Vv8gc1iuf-pdvmUL0L5CImvODf99-Th78BvMh6awWa_OC3FeXq8_LxUuPiV-DbYIj
linkProvider	Wiley-Blackwell
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+multithreaded+CUDA+and+OpenMP+based+power%E2%80%90aware+programming+framework+for+multi%E2%80%90node+GPU+systems&rft.jtitle=Concurrency+and+computation&rft.au=Czarnul%2C+Pawe%C5%82&rft.date=2023-11-15&rft.pub=Wiley+Subscription+Services%2C+Inc&rft.issn=1532-0626&rft.eissn=1532-0634&rft.volume=35&rft.issue=25&rft_id=info:doi/10.1002%2Fcpe.7897&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1532-0626&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1532-0626&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1532-0626&client=summon