Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of nove...

Full description

Saved in:
Bibliographic Details
Published in:2012 International Conference on High Performance Computing and Simulation pp. 511 - 519
Main Authors: Martin, P. J., Ayuso, L. F., Torres, R., Gavilanes, A.
Format: Conference Proceeding
Language:English
Published: IEEE 01.07.2012
Subjects:
ISBN:9781467323598, 1467323594
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
AbstractList Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
Author Gavilanes, A.
Martin, P. J.
Torres, R.
Ayuso, L. F.
Author_xml – sequence: 1
  givenname: P. J.
  surname: Martin
  fullname: Martin, P. J.
  email: pjmartin@sip.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 2
  givenname: L. F.
  surname: Ayuso
  fullname: Ayuso, L. F.
  email: lf.ayuso@fdi.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 3
  givenname: R.
  surname: Torres
  fullname: Torres, R.
  email: r.torres@fdi.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 4
  givenname: A.
  surname: Gavilanes
  fullname: Gavilanes, A.
  email: agav@sip.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
BookMark eNotUNtKxDAUjKigu_YL9iU_0JrL9qR5LPWywqKC6_OSpqfdSG-kUdCvt2LnZZhhGIZZkYt-6JGQDWcJ50zf7l6LN9clgnGRgADQAGdkxbegpJDA4ZxEWmWLTnV2RaJp-mAzZpencE2e87YZvAunzlk6BW8CNg4nWg-eDmNwnftxfUPDCelovGlbbKnH6tMGN_R09HMguC-krqfF-11-Qy5r004YLbwmh4f7Q7GL9y-PT0W-j51mIUYtJa-ytC6NRFGLrbIWM7Ac0MqSK6uYqow2nBkFFhlUTNnUlsyi0akWck02_7UOEY9_K4z_Pi4PyF_q61MV
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HPCSim.2012.6266966
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1467323616
9781467323611
1467323624
9781467323628
EndPage 519
ExternalDocumentID 6266966
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-e9331d85fba3e2f247cce86c16ec3b17c707da9a10a76ce06d07c5cb0cea95923
IEDL.DBID RIE
ISBN 9781467323598
1467323594
IngestDate Wed Aug 27 04:36:09 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-e9331d85fba3e2f247cce86c16ec3b17c707da9a10a76ce06d07c5cb0cea95923
PageCount 9
ParticipantIDs ieee_primary_6266966
PublicationCentury 2000
PublicationDate 2012-July
PublicationDateYYYYMMDD 2012-07-01
PublicationDate_xml – month: 07
  year: 2012
  text: 2012-July
PublicationDecade 2010
PublicationTitle 2012 International Conference on High Performance Computing and Simulation
PublicationTitleAbbrev HPCSim
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000781156
Score 1.6167023
Snippet Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are...
SourceID ieee
SourceType Publisher
StartPage 511
SubjectTerms Arrays
CUDA
data-parallel algorithms
GPGPU
Graphics processing unit
Instruction sets
Kernel
Optimization
parallel reduction
segmented parallel reduction
Synchronization
Throughput
Title Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
URI https://ieeexplore.ieee.org/document/6266966
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA21ePCk0orf5ODRbZP9SDbHUi09lYIVeiub2YkubHdLu_XgrzfZXSuCF29JCCHMQOZNkveGkIcIjbKBIPW0MDZBUQZsS4ee1BgB0ypOTFNsQs5m8XKp5h3yeODCIGL9-QwHrlm_5acl7N1V2dCCb2Hh-RE5klI0XK3DfYoTrbG5SM3dEjLwg6jV_YsP_bhVHeJMDafz8UvmmOjcH7TL_qqvUoeXyen_NnZG-j88PTo_RKBz0sGiR2aj_K20Kf_7OgO6q76lIKhFp7S0B8Q6-7STqQV-1Ol-5znmdOsEXJ2L6MaV-XJHIM0KOn59GvXJYvK8GE-9tmqClylWeaiCgKdxZHQSoG_8UAJgLIALhEBzCZLJNFEJZ4kUgEykTEIEmgEmKrJw74J0i7LAS0I51ww1BJCaMPRRamOzM8lEjGhX0-EV6TlTrDaNLsaqtcL138M35MRZu_nqeku61XaPd-QYPqpst72vnfkF7j-e8A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zCnpS2cTf5uDRbumvpDmO6Zg4y8AJu40mfdFC146t8-Bfb9LWiuDFWxJCCO9B3veSfN9D6NYHxXUgiC1BlU5QuJK6JTyLCfAlETyIVFVsgoVhMJ_zaQvdNVwYACg_n0HPNMu3_DiXW3NV1tfgm2p4voN2fc9zSMXWam5UjGyNzkZK9hZlruP6tfJf0PSDWnfIJrw_ng5fEsNFt51evfCvCitlgBkd_m9rR6j7w9TD0yYGHaMWZB0UDtK3XCf978tE4k3xLQaBNT7FuT4ilsmnnow19MNG-TtNIcVrI-FqnIRXptCXOQRxkuHh6_2gi2ajh9lwbNV1E6yEk8IC7rp2HPhKRC44yvGYlBBQaVOQrrCZZITFEY9sEjEqgdCYMOlLQSRE3NeA7wS1szyDU4RtWxAQ0pWx0gYHJpTOzxihAYBeTXhnqGNMsVhVyhiL2grnfw_foP3x7HmymDyGTxfowFi--vh6idrFegtXaE9-FMlmfV069gtlCaI3
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+International+Conference+on+High+Performance+Computing+and+Simulation&rft.atitle=Algorithmic+strategies+for+optimizing+the+parallel+reduction+primitive+in+CUDA&rft.au=Martin%2C+P.+J.&rft.au=Ayuso%2C+L.+F.&rft.au=Torres%2C+R.&rft.au=Gavilanes%2C+A.&rft.date=2012-07-01&rft.pub=IEEE&rft.isbn=9781467323598&rft.spage=511&rft.epage=519&rft_id=info:doi/10.1109%2FHPCSim.2012.6266966&rft.externalDocID=6266966
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/sc.gif&client=summon&freeimage=true