A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems
In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and...
Uloženo v:
| Vydáno v: | Concurrency and computation Ročník 35; číslo 25 |
|---|---|
| Hlavní autor: | |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Hoboken
Wiley Subscription Services, Inc
15.11.2023
|
| Témata: | |
| ISSN: | 1532-0626, 1532-0634 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the
MPI_THREAD_MULTIPLE
mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency. |
|---|---|
| AbstractList | In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the
MPI_THREAD_MULTIPLE
mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency. In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8×$$ \times $$ NVIDIA Quadro RTX 6000 GPUs, the second with 4×$$ \times $$ NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2×$$ \times $$ NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency. |
| Author | Czarnul, Paweł |
| Author_xml | – sequence: 1 givenname: Paweł orcidid: 0000-0002-4918-9196 surname: Czarnul fullname: Czarnul, Paweł organization: Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology Gdansk Poland |
| BookMark | eNplkMFOAjEQhhuDiYAmPkITL17A6XbZtkeCiiYYOMh5U7ezuMi2a7uEcPMRfEafxBKMBz3Nn8k3__z5e6RjnUVCLhkMGUByUzQ4FFKJE9JlI54MIONp51cn2RnphbAGYAw465JyTOvtpq3aV4_aoKGT5e2YamvovEH7tKAvOsRt43bovz4-9U57pI13K6_rurIrWkaBO-ffaOn80Sty1hmk08WShn1osQ7n5LTUm4AXP7NPlvd3z5OHwWw-fZyMZ4MiSYENVAwIJRSgTCa05CkHUBlXhWF6BChRylQYhYURWmSopIiZE8MzPpKZMYr3ydXRN0Z832Jo87Xbehtf5okUMoFUChGp6yNVeBeCxzJvfFVrv88Z5IcW89hifmgxosM_aFG1uq2cbb2uNv8PvgEJtnfp |
| CitedBy_id | crossref_primary_10_1002_cpe_8003 crossref_primary_10_1142_S0129156424400068 |
| Cites_doi | 10.1007/978-3-642-36036-7_6 10.1155/2020/4176794 10.1109/TPDS.2014.2315203 10.1007/978‐3‐030‐71593‐9_17 10.1016/j.suscom.2018.05.007 10.1093/comjnl/bxaa187 10.1109/IPDPSW.2012.228 10.3390/en15020474 10.1007/s11227‐017‐2091‐x 10.3390/en16020890 10.1201/b22395 10.1007/978-3-030-43222-5_11 10.3390/s21196526 10.1002/cpe.5275 10.1109/HPCSim.2014.6903687 10.1016/j.cpc.2010.06.035 10.1145/3431920.3439469 10.1016/j.parco.2022.102969 10.31577/cai_2020_3_510 10.1002/cpe.3719 10.1145/2908080.2908094 10.1145/2716320 10.3390/s21165395 10.1109/HPCS48598.2019.9188149 10.1007/978-3-642-45249-9_5 10.1109/HPCS.2018.00017 10.1145/3226112 10.1007/978-3-030-85262-7_11 10.1109/TPDS.2020.3027338 10.1016/j.sysarc.2015.10.003 10.21105/joss.02352 10.1016/j.suscom.2020.100412 10.1007/978-3-642-29737-3_48 10.1007/978-3-031-06156-1_12 10.1109/tpds.2021.3137867 10.1002/cpe.4072 |
| ContentType | Journal Article |
| Copyright | 2023 John Wiley & Sons, Ltd. |
| Copyright_xml | – notice: 2023 John Wiley & Sons, Ltd. |
| DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1002/cpe.7897 |
| DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | CrossRef Computer and Information Systems Abstracts |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1532-0634 |
| ExternalDocumentID | 10_1002_cpe_7897 |
| GroupedDBID | .3N .DC .GA 05W 0R~ 10A 1L6 1OC 33P 3SF 3WU 4.4 50Y 50Z 51W 51X 52M 52N 52O 52P 52S 52T 52U 52W 52X 5GY 5VS 66C 702 7PT 8-0 8-1 8-3 8-4 8-5 8UM 930 A03 AAESR AAEVG AAHQN AAMNL AANLZ AAONW AAXRX AAYCA AAYXX AAZKR ABCQN ABCUV ABEML ABIJN ACAHQ ACCZN ACPOU ACSCC ACXBN ACXQS ADBBV ADEOM ADIZJ ADKYN ADMGS ADMLS ADOZA ADXAS ADZMN AEIGN AEIMD AEUYR AEYWJ AFBPY AFFPM AFGKR AFWVQ AGHNM AGYGG AHBTC AITYG AIURR AJXKR ALMA_UNASSIGNED_HOLDINGS ALVPJ AMBMR AMYDB ATUGU AUFTA AZBYB BAFTC BDRZF BFHJK BHBCM BMNLL BROTX BRXPI BY8 CITATION CS3 D-E D-F DCZOG DPXWK DR2 DRFUL DRSTM EBS F00 F01 F04 F5P G-S G.N GNP GODZA HGLYW HHY HZ~ IX1 JPC KQQ LATKE LAW LC2 LC3 LEEKS LH4 LITHE LOXES LP6 LP7 LUTES LYRES MEWTI MK4 MRFUL MRSTM MSFUL MSSTM MXFUL MXSTM N04 N05 N9A O66 O8X O9- OIG P2W P2X P4D PQQKQ Q.N Q11 QB0 QRW R.K ROL RX1 SUPJJ TN5 UB1 V2E W8V W99 WBKPD WIH WIK WOHZO WQJ WXSBR WYISQ WZISG XG1 XV2 ~IA ~WT 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93 |
| ISSN | 1532-0626 |
| IngestDate | Sun Nov 30 04:23:58 EST 2025 Tue Nov 18 21:27:19 EST 2025 Sat Nov 29 03:49:54 EST 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 25 |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-4918-9196 |
| OpenAccessLink | https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpe.7897 |
| PQID | 2878204877 |
| PQPubID | 2045170 |
| ParticipantIDs | proquest_journals_2878204877 crossref_primary_10_1002_cpe_7897 crossref_citationtrail_10_1002_cpe_7897 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-11-15 |
| PublicationDateYYYYMMDD | 2023-11-15 |
| PublicationDate_xml | – month: 11 year: 2023 text: 2023-11-15 day: 15 |
| PublicationDecade | 2020 |
| PublicationPlace | Hoboken |
| PublicationPlace_xml | – name: Hoboken |
| PublicationTitle | Concurrency and computation |
| PublicationYear | 2023 |
| Publisher | Wiley Subscription Services, Inc |
| Publisher_xml | – name: Wiley Subscription Services, Inc |
| References | e_1_2_8_28_1 e_1_2_8_29_1 e_1_2_8_24_1 e_1_2_8_25_1 e_1_2_8_27_1 e_1_2_8_3_1 e_1_2_8_2_1 Georgiou Y (e_1_2_8_39_1) 2015 e_1_2_8_5_1 e_1_2_8_4_1 e_1_2_8_7_1 e_1_2_8_6_1 e_1_2_8_9_1 e_1_2_8_8_1 e_1_2_8_20_1 e_1_2_8_21_1 e_1_2_8_42_1 e_1_2_8_22_1 e_1_2_8_41_1 e_1_2_8_40_1 Krzywaniak A (e_1_2_8_13_1) 2022 e_1_2_8_17_1 e_1_2_8_18_1 e_1_2_8_19_1 e_1_2_8_36_1 e_1_2_8_14_1 e_1_2_8_35_1 e_1_2_8_15_1 e_1_2_8_38_1 e_1_2_8_16_1 e_1_2_8_37_1 Breitbart J (e_1_2_8_23_1) 2010 e_1_2_8_32_1 e_1_2_8_10_1 e_1_2_8_31_1 e_1_2_8_11_1 e_1_2_8_34_1 e_1_2_8_12_1 Li H (e_1_2_8_26_1) 2016; 32 e_1_2_8_33_1 e_1_2_8_30_1 |
| References_xml | – ident: e_1_2_8_29_1 doi: 10.1007/978-3-642-36036-7_6 – ident: e_1_2_8_5_1 doi: 10.1155/2020/4176794 – ident: e_1_2_8_8_1 doi: 10.1109/TPDS.2014.2315203 – ident: e_1_2_8_20_1 doi: 10.1007/978‐3‐030‐71593‐9_17 – ident: e_1_2_8_7_1 doi: 10.1016/j.suscom.2018.05.007 – ident: e_1_2_8_33_1 doi: 10.1093/comjnl/bxaa187 – volume: 32 start-page: 517 issue: 3 year: 2016 ident: e_1_2_8_26_1 article-title: An OpenMP programming toolkit for hybrid CPU/GPU clusters based on software unified memory publication-title: J Inf Sci Eng – ident: e_1_2_8_16_1 doi: 10.1109/IPDPSW.2012.228 – ident: e_1_2_8_38_1 doi: 10.3390/en15020474 – ident: e_1_2_8_25_1 doi: 10.1007/s11227‐017‐2091‐x – ident: e_1_2_8_6_1 doi: 10.3390/en16020890 – ident: e_1_2_8_2_1 doi: 10.1201/b22395 – ident: e_1_2_8_14_1 doi: 10.1007/978-3-030-43222-5_11 – ident: e_1_2_8_41_1 doi: 10.3390/s21196526 – start-page: 667 volume-title: 2022 International Conference on Computational Science (ICCS) year: 2022 ident: e_1_2_8_13_1 – volume-title: OpenMP for Next Generation Heterogeneous Clusters year: 2010 ident: e_1_2_8_23_1 – ident: e_1_2_8_36_1 doi: 10.1002/cpe.5275 – ident: e_1_2_8_18_1 doi: 10.1109/HPCSim.2014.6903687 – ident: e_1_2_8_4_1 doi: 10.1016/j.cpc.2010.06.035 – ident: e_1_2_8_21_1 doi: 10.1145/3431920.3439469 – ident: e_1_2_8_15_1 doi: 10.1016/j.parco.2022.102969 – ident: e_1_2_8_34_1 doi: 10.31577/cai_2020_3_510 – ident: e_1_2_8_24_1 doi: 10.1002/cpe.3719 – ident: e_1_2_8_31_1 doi: 10.1145/2908080.2908094 – volume-title: Power Adaptive Scheduling September 2015 year: 2015 ident: e_1_2_8_39_1 – ident: e_1_2_8_17_1 doi: 10.1145/2716320 – ident: e_1_2_8_42_1 doi: 10.3390/s21165395 – ident: e_1_2_8_12_1 doi: 10.1109/HPCS48598.2019.9188149 – ident: e_1_2_8_9_1 doi: 10.1007/978-3-642-45249-9_5 – ident: e_1_2_8_19_1 doi: 10.1109/HPCS.2018.00017 – ident: e_1_2_8_32_1 doi: 10.1145/3226112 – ident: e_1_2_8_35_1 – ident: e_1_2_8_3_1 doi: 10.1007/978-3-030-85262-7_11 – ident: e_1_2_8_11_1 doi: 10.1109/TPDS.2020.3027338 – ident: e_1_2_8_30_1 doi: 10.1016/j.sysarc.2015.10.003 – ident: e_1_2_8_22_1 doi: 10.21105/joss.02352 – ident: e_1_2_8_40_1 doi: 10.1016/j.suscom.2020.100412 – ident: e_1_2_8_27_1 doi: 10.1007/978-3-642-29737-3_48 – ident: e_1_2_8_10_1 doi: 10.1007/978-3-031-06156-1_12 – ident: e_1_2_8_37_1 doi: 10.1109/tpds.2021.3137867 – ident: e_1_2_8_28_1 doi: 10.1002/cpe.4072 |
| SSID | ssj0011031 |
| Score | 2.3465195 |
| Snippet | In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing... |
| SourceID | proquest crossref |
| SourceType | Aggregation Database Enrichment Source Index Database |
| SubjectTerms | Graphics processing units Nodes Optimization Power Streams |
| Title | A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems |
| URI | https://www.proquest.com/docview/2878204877 |
| Volume | 35 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVWIB databaseName: Wiley Online Library Full Collection 2020 customDbUrl: eissn: 1532-0634 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011031 issn: 1532-0626 databaseCode: DRFUL dateStart: 20010101 isFulltext: true titleUrlDefault: https://onlinelibrary.wiley.com providerName: Wiley-Blackwell |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELbKlgOXQnmoLQUZicKhCmSdbGwfV9tWPSzLCu1Ke4sS25GQIJvuow9O_AR-I7-EGdsJG4FQe-ASrZyRN_J8mczYM98Q8popZG3TIsi1EUEs8ZXSQgYciUwKDLG7tlB4yEcjMZvJsU8dWtp2ArwsxfW1rP6rqmEMlI2ls3dQdzMpDMBvUDpcQe1wvZXi-y5JcAVKyjTu305P-vaIAHNHPoyP8buljyvsjtZkOmRXmADmc7W-2uTKOmnL5iHaGRvpcq4BW-Opp4Febjq4g3mpLOWTuqlL5qp1-7h_8C1blGvXOjm7MkeD3pHrDFTvPrAIy_Bc_eUtbVzLtrIgTJhnvt4c8_uZ3iA7_hIPPFcW_Yehd8SxqjLvuHAJvm0u7dHH9Gw6HKaT09nkTXURYJsxPI73PVfukW3Ge1J0yPbJJxBsDp6w64Wj2HXPWvMVh-x9_WdtD6b9AbdeyeQR2fHhBO07GOySLVM-Jg_rVh3UW-4npOjTFiooooKCgqhDBbWooBYVP7__sHigG3igDR4o4MHNBXKIBApIoB4JT8n07HQyOA98i41AgSvXDSTGs0WoQqkTnokojjCmjaTS3awXGmEEeC9aGqV5xhMjBYdnZDoCN1UkWsvoGemU89LsERoLBbFGLjX4RbGJ41yKIhR5kucFg5nFPnlbL1uqPP88tkH5kjrmbJbCAqe4wPvkVSNZOc6Vv8gc1iuf-pdvmUL0L5CImvODf99-Th78BvMh6awWa_OC3FeXq8_LxUuPiV-DbYIj |
| linkProvider | Wiley-Blackwell |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+multithreaded+CUDA+and+OpenMP+based+power%E2%80%90aware+programming+framework+for+multi%E2%80%90node+GPU+systems&rft.jtitle=Concurrency+and+computation&rft.au=Czarnul%2C+Pawe%C5%82&rft.date=2023-11-15&rft.pub=Wiley+Subscription+Services%2C+Inc&rft.issn=1532-0626&rft.eissn=1532-0634&rft.volume=35&rft.issue=25&rft_id=info:doi/10.1002%2Fcpe.7897&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1532-0626&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1532-0626&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1532-0626&client=summon |