Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings of the International Conference on Parallel Processing s. 979 - 988
Hlavní autoři:	Da Li, Hancheng Wu, Becchi, Michela
Médium:	Konferenční příspěvek Journal Article
Jazyk:	angličtina
Vydáno:	IEEE 01.09.2015
Témata:	Algorithms Computation Graphics processing units Graphs Hardware Heuristic algorithms Instruction sets Kernel Load management Nested loops Parallel processing Recursive Synchronism Trees
ISSN:	0190-3918
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
AbstractList	The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
Author	Becchi, Michela Da Li Hancheng Wu
Author_xml	– sequence: 1 surname: Da Li fullname: Da Li email: da.li@mail.missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA – sequence: 2 surname: Hancheng Wu fullname: Hancheng Wu email: hancheng.wu@mail.missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA – sequence: 3 givenname: Michela surname: Becchi fullname: Becchi, Michela email: becchim@missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA
BookMark	eNpFj89LwzAcxSNMcM4dPXnJ0Utn0qT54U3KnIOhRbZzSbtvRyBtatKK-tdbVBAevMP7vAfvEs063wFC15SsKCX6bpsXxSolNFtRIs_QUktFuZBMZkqSGZoTqknCNFUXaBmjrUgqpOCT5qh9hjjAERcmGOfA2dhi3-FNcbjH64_e-WC703_6ZQY7xXtoe2cGiLjxAW9DgNPoTMA77_uITXfEr1CPIdp3wLlv-3H46cUrdN4YF2H55wt0eFzv86dk97LZ5g-7xKZEDYlhSqqsVimvGqgNrTkDzZVmXIDING2MZgoMEJpqprluTFPDRElSkYoxYAt0-7vbB_82Tg_L1sYanDMd-DGWVFFBKJVCT-jNL2oBoOyDbU34LCXjWmSMfQO722oX
CODEN	IEEPAD
ContentType	Conference Proceeding Journal Article
DBID	6IE 6IL CBEJK RIE RIL 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1109/ICPP.2015.107
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781467375870 146737587X
EndPage	988
ExternalDocumentID	7349653
Genre	orig-research
GroupedDBID	-~X 23M 29P 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABDPE ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS XOL 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3
IEDL.DBID	RIE
ISICitedReferencesCount	9
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0190-3918
IngestDate	Fri Jul 11 16:07:00 EDT 2025 Wed Aug 27 02:55:25 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3
Notes	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
PQID	1816011769
PQPubID	23500
PageCount	10
ParticipantIDs	proquest_miscellaneous_1816011769 ieee_primary_7349653
PublicationCentury	2000
PublicationDate	20150901
PublicationDateYYYYMMDD	2015-09-01
PublicationDate_xml	– month: 09 year: 2015 text: 20150901 day: 01
PublicationDecade	2010
PublicationTitle	Proceedings of the International Conference on Parallel Processing
PublicationTitleAbbrev	ICPP
PublicationYear	2015
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssib026764764 ssj0020354
Score	2.1103327
Snippet	The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the... The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the...
SourceID	proquest ieee
SourceType	Aggregation Database Publisher
StartPage	979
SubjectTerms	Algorithms Computation Graphics processing units Graphs Hardware Heuristic algorithms Instruction sets Kernel Load management Nested loops Parallel processing Recursive Synchronism Trees
Title	Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
URI	https://ieeexplore.ieee.org/document/7349653 https://www.proquest.com/docview/1816011769
WOSCitedRecordID	wos000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Lb4JAEN6o6aEn22pT-8o26bFUYJd99Gpqa2IMaTTxZhZ2SEwUDGh_f9kF9NBeeiMBApkdmMfO930IPXNNE_D8yFEEtEPBV46CiDq0jCau9gxjnWXXn_LZTCyXMmyhlyMWBgDs8Bm8mkO7l6-z-GBaZUNu2c1JG7U5ZxVWq_Edn3FGuUn962LLJQFtoNJEeuLErzmcjMLQDHUFZenaqKr8-hXb-DLu_u_NLlD_BNTD4TEEXaIWpFeo2yg14PrD7aHtzLY1cahyo52yWRdbnKX4I1y84eMU3ulsBc3Ec9juNiYXxWVmiyd5bnXrczzNsl2BVarxl-nWmwF4XD2zav_10WL8Ph99OrXQgrP2XbEv10lwEcTCp1ECsfJiSkAaVhrKgAXSS5QkAhSYphWRVCYqiaG8iruRGxEC5Bp10iyFG4TjQJHYZ0onwGiitQTlcSHLIpART-logHrGeqtdxaWxqg03QE-N-Velf5tNC5VCdihWpc8ww1vH5O3ft96hc7OW1dzXPers8wM8oLP4e78u8kfrJD_EML20
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT4NAEJ3UaqKnqq2xfq6JR7HALgvr1VjbWBti2qQ3srBD0qSFBlp_vyyU9qAXbyRAILMD87Hz3gN4dBWL0bJDQ1JUBkNbGhJDZrAimpjK0ox1Jbv-yB2PvdlM-A142mFhELEcPsNnfVju5as02uhWWc8t2c3pARw6jNlmhdaqvcfmLmeuTv635ZZJHVaDpamwvD3DZm_46vt6rMspitdaV-XXz7iMMP3W_97tFDp7qB7xd0HoDBqYnEOr1mog20-3Dctx2dgkvsy0espini9JmpB3f_pCdnN4-7MVOJNMcLla6GyUFLktGWZZqVyfkVGarnIiE0W-dL9ej8CT6plVA7AD0_7b5HVgbKUWjLlteutipTzXcyLPZmGMkbQiRlFoXhrGkTvCiqWgHkrUbSsqmIhlHGFxlWuGZkgp0gtoJmmCl0AiR9LI5lLFyFmslEBpuZ4oykBOLanCLrS19YJVxaYRbA3XhYfa_EHh4XrbQiaYbvKg8Bqumeu4uPr71ns4Hkw-R8FoOP64hhO9rtUU2A0019kGb-Eo-l7P8-yudJgfyv3A-w
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Parallel+Processing&rft.atitle=Nested+Parallelism+on+GPU%3A+Exploring+Parallelization+Templates+for+Irregular+Loops+and+Recursive+Computations&rft.au=Da+Li&rft.au=Hancheng+Wu&rft.au=Becchi%2C+Michela&rft.date=2015-09-01&rft.pub=IEEE&rft.issn=0190-3918&rft.spage=979&rft.epage=988&rft_id=info:doi/10.1109%2FICPP.2015.107&rft.externalDocID=7349653
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0190-3918&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0190-3918&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0190-3918&client=summon