Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings of the International Conference on Parallel Processing s. 979 - 988
Hlavní autori: Da Li, Hancheng Wu, Becchi, Michela
Médium: Konferenčný príspevok.. Journal Article
Jazyk:English
Vydavateľské údaje: IEEE 01.09.2015
Predmet:
ISSN:0190-3918
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
AbstractList The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
Author Becchi, Michela
Da Li
Hancheng Wu
Author_xml – sequence: 1
  surname: Da Li
  fullname: Da Li
  email: da.li@mail.missouri.edu
  organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA
– sequence: 2
  surname: Hancheng Wu
  fullname: Hancheng Wu
  email: hancheng.wu@mail.missouri.edu
  organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA
– sequence: 3
  givenname: Michela
  surname: Becchi
  fullname: Becchi, Michela
  email: becchim@missouri.edu
  organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA
BookMark eNpFj89LwzAcxSNMcM4dPXnJ0Utn0qT54U3KnIOhRbZzSbtvRyBtatKK-tdbVBAevMP7vAfvEs063wFC15SsKCX6bpsXxSolNFtRIs_QUktFuZBMZkqSGZoTqknCNFUXaBmjrUgqpOCT5qh9hjjAERcmGOfA2dhi3-FNcbjH64_e-WC703_6ZQY7xXtoe2cGiLjxAW9DgNPoTMA77_uITXfEr1CPIdp3wLlv-3H46cUrdN4YF2H55wt0eFzv86dk97LZ5g-7xKZEDYlhSqqsVimvGqgNrTkDzZVmXIDING2MZgoMEJpqprluTFPDRElSkYoxYAt0-7vbB_82Tg_L1sYanDMd-DGWVFFBKJVCT-jNL2oBoOyDbU34LCXjWmSMfQO722oX
CODEN IEEPAD
ContentType Conference Proceeding
Journal Article
DBID 6IE
6IL
CBEJK
RIE
RIL
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/ICPP.2015.107
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781467375870
146737587X
EndPage 988
ExternalDocumentID 7349653
Genre orig-research
GroupedDBID -~X
23M
29P
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABDPE
ADZIZ
AFFNX
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
XOL
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3
IEDL.DBID RIE
ISICitedReferencesCount 9
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0190-3918
IngestDate Fri Jul 11 16:07:00 EDT 2025
Wed Aug 27 02:55:25 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Conference-1
ObjectType-Feature-3
content type line 23
SourceType-Conference Papers & Proceedings-2
PQID 1816011769
PQPubID 23500
PageCount 10
ParticipantIDs proquest_miscellaneous_1816011769
ieee_primary_7349653
PublicationCentury 2000
PublicationDate 20150901
PublicationDateYYYYMMDD 2015-09-01
PublicationDate_xml – month: 09
  year: 2015
  text: 20150901
  day: 01
PublicationDecade 2010
PublicationTitle Proceedings of the International Conference on Parallel Processing
PublicationTitleAbbrev ICPP
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib026764764
ssj0020354
Score 2.1103327
Snippet The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the...
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the...
SourceID proquest
ieee
SourceType Aggregation Database
Publisher
StartPage 979
SubjectTerms Algorithms
Computation
Graphics processing units
Graphs
Hardware
Heuristic algorithms
Instruction sets
Kernel
Load management
Nested loops
Parallel processing
Recursive
Synchronism
Trees
Title Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
URI https://ieeexplore.ieee.org/document/7349653
https://www.proquest.com/docview/1816011769
WOSCitedRecordID wos000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT4MwFG7m4sHT1M04f6UmHsUxWvrD6-J0ybIQs5ndSIHXZMkGC2z-_dIC20Ev3kiAQB6v9OvX73sPoSfBpI4lKEckHncoBc-JoBzukVScRkQBWMLta8pnM7FcyqCFng9eGACw4jN4MYd2Lz_J4r2hygbcVjcnJ-iEc1Z5tZrc8RhnlBvoXy-2XOLTxipN5FAc62sOJqMgMKIuv1y6Nl1Vfv2K7fwy7vzvzc5R72jUw8FhCrpALUgvUafp1IDrgdtFm5mlNXGgctM7Zb0qNjhL8XuweMUHFd7xbGXNxHPYbNcGi-IS2eJJntu-9TmeZtm2wCpN8Kdh640AHlfPrOi_HlqM3-ajD6dutOCsPFfsHEUEF34sPBppiNUwpgSkqUpDGTBfDrWSRIACQ1oRSaVWOobyKu5GbkQIkCvUTrMUrhGmXDPmg1ciCU0p9ZQfuUniaj_REdFM9lHXRC_cVrU0wjpwffTYhD8s89tsWqgUsn0RlgiEmbp1TN78festOjPfstJ93aH2Lt_DPTqNv3erIn-wSfIDFQS9aQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD7oFPRpXibOawQfrXZNmouv4nQ4R5EpeytpewKDrR3d5u-3adftQV98K6Sh4eSk-fLlO-cA3EquTKxQOzLxhMMYek6ExXKPlBYsohqxJNy--mIwkKORCrbgbh0Lg4il-Azv7WN5l59k8dJSZQ-izG5Ot2HHZ8xzq2it2ns8LjgTFvyvjlsu9VkdLE1VR24ybD70noLAyrr84vBa11X59TMud5hu839jO4DWJlSPBOtN6BC2MD2CZl2rgayW7jFMByWxSQKd2-opk_F8SrKUvASfj2Stw9u0VsGZZIjT2cSiUVJgW9LL87JyfU76WTabE50m5MPy9VYCT6pvVgRgCz67z8OnV2dVasEZe65cOJpKIf1YeiwyGOtOzCgqm5eGceS-6hitqESNlraiiimjTYzFW8KN3IhSpCfQSLMUT4EwYTj30SuwhGHFNGk_cpPENX5iImq4asOxtV44q7JphCvDteGmNn9YeLi9ttApZst5WGAQbjPXcXX2d9dr2HsdvvfDfm_wdg77dl4rFdgFNBb5Ei9hN_5ejOf5VekwPxjfwLA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Parallel+Processing&rft.atitle=Nested+Parallelism+on+GPU%3A+Exploring+Parallelization+Templates+for+Irregular+Loops+and+Recursive+Computations&rft.au=Da+Li&rft.au=Hancheng+Wu&rft.au=Becchi%2C+Michela&rft.date=2015-09-01&rft.pub=IEEE&rft.issn=0190-3918&rft.spage=979&rft.epage=988&rft_id=info:doi/10.1109%2FICPP.2015.107&rft.externalDocID=7349653
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0190-3918&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0190-3918&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0190-3918&client=summon