Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of nove...

Full description

Saved in:

Bibliographic Details
Published in:	2012 International Conference on High Performance Computing and Simulation pp. 511 - 519
Main Authors:	Martin, P. J., Ayuso, L. F., Torres, R., Gavilanes, A.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.07.2012
Subjects:	Arrays CUDA data-parallel algorithms GPGPU Graphics processing unit Instruction sets Kernel Optimization parallel reduction segmented parallel reduction Synchronization Throughput
ISBN:	9781467323598, 1467323594
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
AbstractList	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
Author	Gavilanes, A. Martin, P. J. Torres, R. Ayuso, L. F.
Author_xml	– sequence: 1 givenname: P. J. surname: Martin fullname: Martin, P. J. email: pjmartin@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 2 givenname: L. F. surname: Ayuso fullname: Ayuso, L. F. email: lf.ayuso@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 3 givenname: R. surname: Torres fullname: Torres, R. email: r.torres@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 4 givenname: A. surname: Gavilanes fullname: Gavilanes, A. email: agav@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
BookMark	eNotUNtKxDAUjKigu_YL9iU_0JrL9qR5LPWywqKC6_OSpqfdSG-kUdCvt2LnZZhhGIZZkYt-6JGQDWcJ50zf7l6LN9clgnGRgADQAGdkxbegpJDA4ZxEWmWLTnV2RaJp-mAzZpencE2e87YZvAunzlk6BW8CNg4nWg-eDmNwnftxfUPDCelovGlbbKnH6tMGN_R09HMguC-krqfF-11-Qy5r004YLbwmh4f7Q7GL9y-PT0W-j51mIUYtJa-ytC6NRFGLrbIWM7Ac0MqSK6uYqow2nBkFFhlUTNnUlsyi0akWck02_7UOEY9_K4z_Pi4PyF_q61MV
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/HPCSim.2012.6266966
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1467323616 9781467323611 1467323624 9781467323628
EndPage	519
ExternalDocumentID	6266966
Genre	orig-research
GroupedDBID	6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL
ID	FETCH-LOGICAL-i90t-e9331d85fba3e2f247cce86c16ec3b17c707da9a10a76ce06d07c5cb0cea95923
IEDL.DBID	RIE
ISBN	9781467323598 1467323594
IngestDate	Wed Aug 27 04:36:09 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-e9331d85fba3e2f247cce86c16ec3b17c707da9a10a76ce06d07c5cb0cea95923
PageCount	9
ParticipantIDs	ieee_primary_6266966
PublicationCentury	2000
PublicationDate	2012-July
PublicationDateYYYYMMDD	2012-07-01
PublicationDate_xml	– month: 07 year: 2012 text: 2012-July
PublicationDecade	2010
PublicationTitle	2012 International Conference on High Performance Computing and Simulation
PublicationTitleAbbrev	HPCSim
PublicationYear	2012
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000781156
Score	1.6167023
Snippet	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are...
SourceID	ieee
SourceType	Publisher
StartPage	511
SubjectTerms	Arrays CUDA data-parallel algorithms GPGPU Graphics processing unit Instruction sets Kernel Optimization parallel reduction segmented parallel reduction Synchronization Throughput
Title	Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
URI	https://ieeexplore.ieee.org/document/6266966
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA21ePCk0orf5ODRbZP9SDbHUi09lYIVeiub2YkubHdLu_XgrzfZXSuCF29JCCHMQOZNkveGkIcIjbKBIPW0MDZBUQZsS4ee1BgB0ypOTFNsQs5m8XKp5h3yeODCIGL9-QwHrlm_5acl7N1V2dCCb2Hh-RE5klI0XK3DfYoTrbG5SM3dEjLwg6jV_YsP_bhVHeJMDafz8UvmmOjcH7TL_qqvUoeXyen_NnZG-j88PTo_RKBz0sGiR2aj_K20Kf_7OgO6q76lIKhFp7S0B8Q6-7STqQV-1Ol-5znmdOsEXJ2L6MaV-XJHIM0KOn59GvXJYvK8GE-9tmqClylWeaiCgKdxZHQSoG_8UAJgLIALhEBzCZLJNFEJZ4kUgEykTEIEmgEmKrJw74J0i7LAS0I51ww1BJCaMPRRamOzM8lEjGhX0-EV6TlTrDaNLsaqtcL138M35MRZu_nqeku61XaPd-QYPqpst72vnfkF7j-e8A
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zCnpS2cTf5uDRbumvpDmO6Zg4y8AJu40mfdFC146t8-Bfb9LWiuDFWxJCCO9B3veSfN9D6NYHxXUgiC1BlU5QuJK6JTyLCfAlETyIVFVsgoVhMJ_zaQvdNVwYACg_n0HPNMu3_DiXW3NV1tfgm2p4voN2fc9zSMXWam5UjGyNzkZK9hZlruP6tfJf0PSDWnfIJrw_ng5fEsNFt51evfCvCitlgBkd_m9rR6j7w9TD0yYGHaMWZB0UDtK3XCf978tE4k3xLQaBNT7FuT4ilsmnnow19MNG-TtNIcVrI-FqnIRXptCXOQRxkuHh6_2gi2ajh9lwbNV1E6yEk8IC7rp2HPhKRC44yvGYlBBQaVOQrrCZZITFEY9sEjEqgdCYMOlLQSRE3NeA7wS1szyDU4RtWxAQ0pWx0gYHJpTOzxihAYBeTXhnqGNMsVhVyhiL2grnfw_foP3x7HmymDyGTxfowFi--vh6idrFegtXaE9-FMlmfV069gtlCaI3
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+International+Conference+on+High+Performance+Computing+and+Simulation&rft.atitle=Algorithmic+strategies+for+optimizing+the+parallel+reduction+primitive+in+CUDA&rft.au=Martin%2C+P.+J.&rft.au=Ayuso%2C+L.+F.&rft.au=Torres%2C+R.&rft.au=Gavilanes%2C+A.&rft.date=2012-07-01&rft.pub=IEEE&rft.isbn=9781467323598&rft.spage=511&rft.epage=519&rft_id=info:doi/10.1109%2FHPCSim.2012.6266966&rft.externalDocID=6266966
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/sc.gif&client=summon&freeimage=true