Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns...
Uloženo v:
| Vydáno v: | Proceedings of the International Conference on Parallel Processing s. 979 - 988 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.09.2015
|
| Témata: | |
| ISSN: | 0190-3918 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. |
|---|---|
| AbstractList | The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable. |
| Author | Becchi, Michela Da Li Hancheng Wu |
| Author_xml | – sequence: 1 surname: Da Li fullname: Da Li email: da.li@mail.missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA – sequence: 2 surname: Hancheng Wu fullname: Hancheng Wu email: hancheng.wu@mail.missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA – sequence: 3 givenname: Michela surname: Becchi fullname: Becchi, Michela email: becchim@missouri.edu organization: Dept. of Electr. & Comput. Eng., Univ. of Missouri-Columbia, Columbia, MO, USA |
| BookMark | eNpFj89LwzAcxSNMcM4dPXnJ0Utn0qT54U3KnIOhRbZzSbtvRyBtatKK-tdbVBAevMP7vAfvEs063wFC15SsKCX6bpsXxSolNFtRIs_QUktFuZBMZkqSGZoTqknCNFUXaBmjrUgqpOCT5qh9hjjAERcmGOfA2dhi3-FNcbjH64_e-WC703_6ZQY7xXtoe2cGiLjxAW9DgNPoTMA77_uITXfEr1CPIdp3wLlv-3H46cUrdN4YF2H55wt0eFzv86dk97LZ5g-7xKZEDYlhSqqsVimvGqgNrTkDzZVmXIDING2MZgoMEJpqprluTFPDRElSkYoxYAt0-7vbB_82Tg_L1sYanDMd-DGWVFFBKJVCT-jNL2oBoOyDbU34LCXjWmSMfQO722oX |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding Journal Article |
| DBID | 6IE 6IL CBEJK RIE RIL 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/ICPP.2015.107 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781467375870 146737587X |
| EndPage | 988 |
| ExternalDocumentID | 7349653 |
| Genre | orig-research |
| GroupedDBID | -~X 23M 29P 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABDPE ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS XOL 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 9 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0190-3918 |
| IngestDate | Fri Jul 11 16:07:00 EDT 2025 Wed Aug 27 02:55:25 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i208t-a38785c824bfeca1c43e9489346e6591fa938eae01293949fafce1c470b0b33e3 |
| Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2 |
| PQID | 1816011769 |
| PQPubID | 23500 |
| PageCount | 10 |
| ParticipantIDs | proquest_miscellaneous_1816011769 ieee_primary_7349653 |
| PublicationCentury | 2000 |
| PublicationDate | 20150901 |
| PublicationDateYYYYMMDD | 2015-09-01 |
| PublicationDate_xml | – month: 09 year: 2015 text: 20150901 day: 01 |
| PublicationDecade | 2010 |
| PublicationTitle | Proceedings of the International Conference on Parallel Processing |
| PublicationTitleAbbrev | ICPP |
| PublicationYear | 2015 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib026764764 ssj0020354 |
| Score | 2.1103327 |
| Snippet | The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the... The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the... |
| SourceID | proquest ieee |
| SourceType | Aggregation Database Publisher |
| StartPage | 979 |
| SubjectTerms | Algorithms Computation Graphics processing units Graphs Hardware Heuristic algorithms Instruction sets Kernel Load management Nested loops Parallel processing Recursive Synchronism Trees |
| Title | Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations |
| URI | https://ieeexplore.ieee.org/document/7349653 https://www.proquest.com/docview/1816011769 |
| WOSCitedRecordID | wos000379202700099&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Lb4JAEN6o6aEn22pT-8o26bFUYJd99Gpqa2IMaTTxZhZ2SEwUDGh_f9kF9NBeeiMBApkdmMfO930IPXNNE_D8yFEEtEPBV46CiDq0jCau9gxjnWXXn_LZTCyXMmyhlyMWBgDs8Bm8mkO7l6-z-GBaZUNu2c1JG7U5ZxVWq_Edn3FGuUn962LLJQFtoNJEeuLErzmcjMLQDHUFZenaqKr8-hXb-DLu_u_NLlD_BNTD4TEEXaIWpFeo2yg14PrD7aHtzLY1cahyo52yWRdbnKX4I1y84eMU3ulsBc3Ec9juNiYXxWVmiyd5bnXrczzNsl2BVarxl-nWmwF4XD2zav_10WL8Ph99OrXQgrP2XbEv10lwEcTCp1ECsfJiSkAaVhrKgAXSS5QkAhSYphWRVCYqiaG8iruRGxEC5Bp10iyFG4TjQJHYZ0onwGiitQTlcSHLIpART-logHrGeqtdxaWxqg03QE-N-Velf5tNC5VCdihWpc8ww1vH5O3ft96hc7OW1dzXPers8wM8oLP4e78u8kfrJD_EML20 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT4NAEJ3UaqKnqq2xfq6JR7HALgvr1VjbWBti2qQ3srBD0qSFBlp_vyyU9qAXbyRAILMD87Hz3gN4dBWL0bJDQ1JUBkNbGhJDZrAimpjK0ox1Jbv-yB2PvdlM-A142mFhELEcPsNnfVju5as02uhWWc8t2c3pARw6jNlmhdaqvcfmLmeuTv635ZZJHVaDpamwvD3DZm_46vt6rMspitdaV-XXz7iMMP3W_97tFDp7qB7xd0HoDBqYnEOr1mog20-3Dctx2dgkvsy0espini9JmpB3f_pCdnN4-7MVOJNMcLla6GyUFLktGWZZqVyfkVGarnIiE0W-dL9ej8CT6plVA7AD0_7b5HVgbKUWjLlteutipTzXcyLPZmGMkbQiRlFoXhrGkTvCiqWgHkrUbSsqmIhlHGFxlWuGZkgp0gtoJmmCl0AiR9LI5lLFyFmslEBpuZ4oykBOLanCLrS19YJVxaYRbA3XhYfa_EHh4XrbQiaYbvKg8Bqumeu4uPr71ns4Hkw-R8FoOP64hhO9rtUU2A0019kGb-Eo-l7P8-yudJgfyv3A-w |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+International+Conference+on+Parallel+Processing&rft.atitle=Nested+Parallelism+on+GPU%3A+Exploring+Parallelization+Templates+for+Irregular+Loops+and+Recursive+Computations&rft.au=Da+Li&rft.au=Hancheng+Wu&rft.au=Becchi%2C+Michela&rft.date=2015-09-01&rft.pub=IEEE&rft.issn=0190-3918&rft.spage=979&rft.epage=988&rft_id=info:doi/10.1109%2FICPP.2015.107&rft.externalDocID=7349653 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0190-3918&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0190-3918&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0190-3918&client=summon |