A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Concurrency and computation Ročník 35; číslo 25
Hlavní autor: Czarnul, Paweł
Médium: Journal Article
Jazyk:angličtina
Vydáno: Hoboken Wiley Subscription Services, Inc 15.11.2023
Témata:
ISSN:1532-0626, 1532-0634
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
AbstractList In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8×$$ \times $$ NVIDIA Quadro RTX 6000 GPUs, the second with 4×$$ \times $$ NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2×$$ \times $$ NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.
Author Czarnul, Paweł
Author_xml – sequence: 1
  givenname: Paweł
  orcidid: 0000-0002-4918-9196
  surname: Czarnul
  fullname: Czarnul, Paweł
  organization: Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology Gdansk Poland
BookMark eNplkMFOAjEQhhuDiYAmPkITL17A6XbZtkeCiiYYOMh5U7ezuMi2a7uEcPMRfEafxBKMBz3Nn8k3__z5e6RjnUVCLhkMGUByUzQ4FFKJE9JlI54MIONp51cn2RnphbAGYAw465JyTOvtpq3aV4_aoKGT5e2YamvovEH7tKAvOsRt43bovz4-9U57pI13K6_rurIrWkaBO-ffaOn80Sty1hmk08WShn1osQ7n5LTUm4AXP7NPlvd3z5OHwWw-fZyMZ4MiSYENVAwIJRSgTCa05CkHUBlXhWF6BChRylQYhYURWmSopIiZE8MzPpKZMYr3ydXRN0Z832Jo87Xbehtf5okUMoFUChGp6yNVeBeCxzJvfFVrv88Z5IcW89hifmgxosM_aFG1uq2cbb2uNv8PvgEJtnfp
CitedBy_id crossref_primary_10_1002_cpe_8003
crossref_primary_10_1142_S0129156424400068
Cites_doi 10.1007/978-3-642-36036-7_6
10.1155/2020/4176794
10.1109/TPDS.2014.2315203
10.1007/978‐3‐030‐71593‐9_17
10.1016/j.suscom.2018.05.007
10.1093/comjnl/bxaa187
10.1109/IPDPSW.2012.228
10.3390/en15020474
10.1007/s11227‐017‐2091‐x
10.3390/en16020890
10.1201/b22395
10.1007/978-3-030-43222-5_11
10.3390/s21196526
10.1002/cpe.5275
10.1109/HPCSim.2014.6903687
10.1016/j.cpc.2010.06.035
10.1145/3431920.3439469
10.1016/j.parco.2022.102969
10.31577/cai_2020_3_510
10.1002/cpe.3719
10.1145/2908080.2908094
10.1145/2716320
10.3390/s21165395
10.1109/HPCS48598.2019.9188149
10.1007/978-3-642-45249-9_5
10.1109/HPCS.2018.00017
10.1145/3226112
10.1007/978-3-030-85262-7_11
10.1109/TPDS.2020.3027338
10.1016/j.sysarc.2015.10.003
10.21105/joss.02352
10.1016/j.suscom.2020.100412
10.1007/978-3-642-29737-3_48
10.1007/978-3-031-06156-1_12
10.1109/tpds.2021.3137867
10.1002/cpe.4072
ContentType Journal Article
Copyright 2023 John Wiley & Sons, Ltd.
Copyright_xml – notice: 2023 John Wiley & Sons, Ltd.
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1002/cpe.7897
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList CrossRef
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1532-0634
ExternalDocumentID 10_1002_cpe_7897
GroupedDBID .3N
.DC
.GA
05W
0R~
10A
1L6
1OC
33P
3SF
3WU
4.4
50Y
50Z
51W
51X
52M
52N
52O
52P
52S
52T
52U
52W
52X
5GY
5VS
66C
702
7PT
8-0
8-1
8-3
8-4
8-5
8UM
930
A03
AAESR
AAEVG
AAHQN
AAMNL
AANLZ
AAONW
AAXRX
AAYCA
AAYXX
AAZKR
ABCQN
ABCUV
ABEML
ABIJN
ACAHQ
ACCZN
ACPOU
ACSCC
ACXBN
ACXQS
ADBBV
ADEOM
ADIZJ
ADKYN
ADMGS
ADMLS
ADOZA
ADXAS
ADZMN
AEIGN
AEIMD
AEUYR
AEYWJ
AFBPY
AFFPM
AFGKR
AFWVQ
AGHNM
AGYGG
AHBTC
AITYG
AIURR
AJXKR
ALMA_UNASSIGNED_HOLDINGS
ALVPJ
AMBMR
AMYDB
ATUGU
AUFTA
AZBYB
BAFTC
BDRZF
BFHJK
BHBCM
BMNLL
BROTX
BRXPI
BY8
CITATION
CS3
D-E
D-F
DCZOG
DPXWK
DR2
DRFUL
DRSTM
EBS
F00
F01
F04
F5P
G-S
G.N
GNP
GODZA
HGLYW
HHY
HZ~
IX1
JPC
KQQ
LATKE
LAW
LC2
LC3
LEEKS
LH4
LITHE
LOXES
LP6
LP7
LUTES
LYRES
MEWTI
MK4
MRFUL
MRSTM
MSFUL
MSSTM
MXFUL
MXSTM
N04
N05
N9A
O66
O8X
O9-
OIG
P2W
P2X
P4D
PQQKQ
Q.N
Q11
QB0
QRW
R.K
ROL
RX1
SUPJJ
TN5
UB1
V2E
W8V
W99
WBKPD
WIH
WIK
WOHZO
WQJ
WXSBR
WYISQ
WZISG
XG1
XV2
~IA
~WT
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93
ISSN 1532-0626
IngestDate Sun Nov 30 04:23:58 EST 2025
Tue Nov 18 21:27:19 EST 2025
Sat Nov 29 03:49:54 EST 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 25
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c2401-96260f0c09d67a8343009639cd1a50e8e8847d9ecd7a76e987ade2d363586dd93
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-4918-9196
OpenAccessLink https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpe.7897
PQID 2878204877
PQPubID 2045170
ParticipantIDs proquest_journals_2878204877
crossref_primary_10_1002_cpe_7897
crossref_citationtrail_10_1002_cpe_7897
PublicationCentury 2000
PublicationDate 2023-11-15
PublicationDateYYYYMMDD 2023-11-15
PublicationDate_xml – month: 11
  year: 2023
  text: 2023-11-15
  day: 15
PublicationDecade 2020
PublicationPlace Hoboken
PublicationPlace_xml – name: Hoboken
PublicationTitle Concurrency and computation
PublicationYear 2023
Publisher Wiley Subscription Services, Inc
Publisher_xml – name: Wiley Subscription Services, Inc
References e_1_2_8_28_1
e_1_2_8_29_1
e_1_2_8_24_1
e_1_2_8_25_1
e_1_2_8_27_1
e_1_2_8_3_1
e_1_2_8_2_1
Georgiou Y (e_1_2_8_39_1) 2015
e_1_2_8_5_1
e_1_2_8_4_1
e_1_2_8_7_1
e_1_2_8_6_1
e_1_2_8_9_1
e_1_2_8_8_1
e_1_2_8_20_1
e_1_2_8_21_1
e_1_2_8_42_1
e_1_2_8_22_1
e_1_2_8_41_1
e_1_2_8_40_1
Krzywaniak A (e_1_2_8_13_1) 2022
e_1_2_8_17_1
e_1_2_8_18_1
e_1_2_8_19_1
e_1_2_8_36_1
e_1_2_8_14_1
e_1_2_8_35_1
e_1_2_8_15_1
e_1_2_8_38_1
e_1_2_8_16_1
e_1_2_8_37_1
Breitbart J (e_1_2_8_23_1) 2010
e_1_2_8_32_1
e_1_2_8_10_1
e_1_2_8_31_1
e_1_2_8_11_1
e_1_2_8_34_1
e_1_2_8_12_1
Li H (e_1_2_8_26_1) 2016; 32
e_1_2_8_33_1
e_1_2_8_30_1
References_xml – ident: e_1_2_8_29_1
  doi: 10.1007/978-3-642-36036-7_6
– ident: e_1_2_8_5_1
  doi: 10.1155/2020/4176794
– ident: e_1_2_8_8_1
  doi: 10.1109/TPDS.2014.2315203
– ident: e_1_2_8_20_1
  doi: 10.1007/978‐3‐030‐71593‐9_17
– ident: e_1_2_8_7_1
  doi: 10.1016/j.suscom.2018.05.007
– ident: e_1_2_8_33_1
  doi: 10.1093/comjnl/bxaa187
– volume: 32
  start-page: 517
  issue: 3
  year: 2016
  ident: e_1_2_8_26_1
  article-title: An OpenMP programming toolkit for hybrid CPU/GPU clusters based on software unified memory
  publication-title: J Inf Sci Eng
– ident: e_1_2_8_16_1
  doi: 10.1109/IPDPSW.2012.228
– ident: e_1_2_8_38_1
  doi: 10.3390/en15020474
– ident: e_1_2_8_25_1
  doi: 10.1007/s11227‐017‐2091‐x
– ident: e_1_2_8_6_1
  doi: 10.3390/en16020890
– ident: e_1_2_8_2_1
  doi: 10.1201/b22395
– ident: e_1_2_8_14_1
  doi: 10.1007/978-3-030-43222-5_11
– ident: e_1_2_8_41_1
  doi: 10.3390/s21196526
– start-page: 667
  volume-title: 2022 International Conference on Computational Science (ICCS)
  year: 2022
  ident: e_1_2_8_13_1
– volume-title: OpenMP for Next Generation Heterogeneous Clusters
  year: 2010
  ident: e_1_2_8_23_1
– ident: e_1_2_8_36_1
  doi: 10.1002/cpe.5275
– ident: e_1_2_8_18_1
  doi: 10.1109/HPCSim.2014.6903687
– ident: e_1_2_8_4_1
  doi: 10.1016/j.cpc.2010.06.035
– ident: e_1_2_8_21_1
  doi: 10.1145/3431920.3439469
– ident: e_1_2_8_15_1
  doi: 10.1016/j.parco.2022.102969
– ident: e_1_2_8_34_1
  doi: 10.31577/cai_2020_3_510
– ident: e_1_2_8_24_1
  doi: 10.1002/cpe.3719
– ident: e_1_2_8_31_1
  doi: 10.1145/2908080.2908094
– volume-title: Power Adaptive Scheduling September 2015
  year: 2015
  ident: e_1_2_8_39_1
– ident: e_1_2_8_17_1
  doi: 10.1145/2716320
– ident: e_1_2_8_42_1
  doi: 10.3390/s21165395
– ident: e_1_2_8_12_1
  doi: 10.1109/HPCS48598.2019.9188149
– ident: e_1_2_8_9_1
  doi: 10.1007/978-3-642-45249-9_5
– ident: e_1_2_8_19_1
  doi: 10.1109/HPCS.2018.00017
– ident: e_1_2_8_32_1
  doi: 10.1145/3226112
– ident: e_1_2_8_35_1
– ident: e_1_2_8_3_1
  doi: 10.1007/978-3-030-85262-7_11
– ident: e_1_2_8_11_1
  doi: 10.1109/TPDS.2020.3027338
– ident: e_1_2_8_30_1
  doi: 10.1016/j.sysarc.2015.10.003
– ident: e_1_2_8_22_1
  doi: 10.21105/joss.02352
– ident: e_1_2_8_40_1
  doi: 10.1016/j.suscom.2020.100412
– ident: e_1_2_8_27_1
  doi: 10.1007/978-3-642-29737-3_48
– ident: e_1_2_8_10_1
  doi: 10.1007/978-3-031-06156-1_12
– ident: e_1_2_8_37_1
  doi: 10.1109/tpds.2021.3137867
– ident: e_1_2_8_28_1
  doi: 10.1002/cpe.4072
SSID ssj0011031
Score 2.3465195
Snippet In the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing...
SourceID proquest
crossref
SourceType Aggregation Database
Enrichment Source
Index Database
SubjectTerms Graphics processing units
Nodes
Optimization
Power
Streams
Title A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems
URI https://www.proquest.com/docview/2878204877
Volume 35
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVWIB
  databaseName: Wiley Online Library Full Collection 2020
  customDbUrl:
  eissn: 1532-0634
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011031
  issn: 1532-0626
  databaseCode: DRFUL
  dateStart: 20010101
  isFulltext: true
  titleUrlDefault: https://onlinelibrary.wiley.com
  providerName: Wiley-Blackwell
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwELbKlgOXQnmoLQUZicKhCmSdbGwfV9tWPSzLCu1Ke4sS25GQIJvuow9O_AR-I7-EGdsJG4FQe-ASrZyRN_J8mczYM98Q8popZG3TIsi1EUEs8ZXSQgYciUwKDLG7tlB4yEcjMZvJsU8dWtp2ArwsxfW1rP6rqmEMlI2ls3dQdzMpDMBvUDpcQe1wvZXi-y5JcAVKyjTu305P-vaIAHNHPoyP8buljyvsjtZkOmRXmADmc7W-2uTKOmnL5iHaGRvpcq4BW-Opp4Febjq4g3mpLOWTuqlL5qp1-7h_8C1blGvXOjm7MkeD3pHrDFTvPrAIy_Bc_eUtbVzLtrIgTJhnvt4c8_uZ3iA7_hIPPFcW_Yehd8SxqjLvuHAJvm0u7dHH9Gw6HKaT09nkTXURYJsxPI73PVfukW3Ge1J0yPbJJxBsDp6w64Wj2HXPWvMVh-x9_WdtD6b9AbdeyeQR2fHhBO07GOySLVM-Jg_rVh3UW-4npOjTFiooooKCgqhDBbWooBYVP7__sHigG3igDR4o4MHNBXKIBApIoB4JT8n07HQyOA98i41AgSvXDSTGs0WoQqkTnokojjCmjaTS3awXGmEEeC9aGqV5xhMjBYdnZDoCN1UkWsvoGemU89LsERoLBbFGLjX4RbGJ41yKIhR5kucFg5nFPnlbL1uqPP88tkH5kjrmbJbCAqe4wPvkVSNZOc6Vv8gc1iuf-pdvmUL0L5CImvODf99-Th78BvMh6awWa_OC3FeXq8_LxUuPiV-DbYIj
linkProvider Wiley-Blackwell
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+multithreaded+CUDA+and+OpenMP+based+power%E2%80%90aware+programming+framework+for+multi%E2%80%90node+GPU+systems&rft.jtitle=Concurrency+and+computation&rft.au=Czarnul%2C+Pawe%C5%82&rft.date=2023-11-15&rft.pub=Wiley+Subscription+Services%2C+Inc&rft.issn=1532-0626&rft.eissn=1532-0634&rft.volume=35&rft.issue=25&rft_id=info:doi/10.1002%2Fcpe.7897&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1532-0626&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1532-0626&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1532-0626&client=summon