Low communication FMM-accelerated FFT on GPUs

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate executi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International Conference for High Performance Computing, Networking, Storage and Analysis (Online) S. 1 - 11
1. Verfasser:	Cecka, Cris
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	New York, NY, USA ACM 12.11.2017
Schriftenreihe:	ACM Conferences
Schlagworte:	Analytical models Computational modeling Costs FFT FMM GPU Memory management Multi-GPU Predictive models Tensors Throughput FFT multi-GPU GPU FMM
ISBN:	9781450351140, 145035114X
ISSN:	2167-4337
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Abstract	Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
AbstractList	Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm. Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1. 3\times and 2. 2\times against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
Author	Cecka, Cris
Author_xml	– sequence: 1 givenname: Cris surname: Cecka fullname: Cecka, Cris email: ccecka@nvidia.com organization: NVIDIA
BookMark	eNqNkDtPwzAUhc1LopTMDCwZWRJ87TixR1SRgpQKhna2rl9SRJOgpAjx7zE0ExN3OdL9dM7wXZHzfug9ITdAc4BC3HNgpaIy_01QJyRRlYyAchE5PSULBmWVFZxXZ3_YJUmmqTVUUChBFHRBsmb4TO3QdR99a_HQDn1abzYZWuv3fsSDd2ldb9P4Xr_upmtyEXA_-WTOJdnVj9vVU9a8rJ9XD02GHKpDFsCXQiq0FJVTIAQ4Lr3BYMBRESxKJisnLS18JXm8UDgMiNwVpmRM8SW5Pe623nv9PrYdjl9aKVYCY5HmR4q202YY3iYNVP-40bMbPbvRZmx9iIW7fxb4N4hjXwE
ContentType	Conference Proceeding
Copyright	2017 ACM
Copyright_xml	– notice: 2017 ACM
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/3126908.3126919
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781450351140 145035114X
EISSN	2167-4337
EndPage	11
ExternalDocumentID	9926122
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN ABLEC ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK OCL RIB RIC RIE RIL 6IH 6IK AAWTH ADZIZ CHZPO IPLJI
ID	FETCH-LOGICAL-a317t-f1e6589ac0a9d91551d38ebafb1d05fca8287d8c04e783333f4dafaa3d4b62293
IEDL.DBID	RIE
ISBN	9781450351140 145035114X
ISICitedReferencesCount	6
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000458161700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:19:14 EDT 2025 Wed Jan 31 06:44:57 EST 2024 Wed Jan 31 06:44:12 EST 2024
IsPeerReviewed	false
IsScholarly	false
Keywords	FFT multi-GPU GPU FMM
Language	English
License	Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org.
LinkModel	DirectLink
MeetingName	SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis
MergedId	FETCHMERGED-LOGICAL-a317t-f1e6589ac0a9d91551d38ebafb1d05fca8287d8c04e783333f4dafaa3d4b62293
PageCount	11
ParticipantIDs	acm_books_10_1145_3126908_3126919_brief ieee_primary_9926122 acm_books_10_1145_3126908_3126919
PublicationCentury	2000
PublicationDate	20171112 2017-Nov.-12
PublicationDateYYYYMMDD	2017-11-12
PublicationDate_xml	– month: 11 year: 2017 text: 20171112 day: 12
PublicationDecade	2010
PublicationPlace	New York, NY, USA
PublicationPlace_xml	– name: New York, NY, USA
PublicationSeriesTitle	ACM Conferences
PublicationTitle	International Conference for High Performance Computing, Networking, Storage and Analysis (Online)
PublicationTitleAbbrev	SC
PublicationYear	2017
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssib050161540 ssj0003204180
Score	1.699429
Snippet	Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the...
SourceID	ieee acm
SourceType	Publisher
StartPage	1
SubjectTerms	Analytical models Computational modeling Costs FFT FMM GPU Memory management Multi-GPU Predictive models Tensors Throughput
Title	Low communication FMM-accelerated FFT on GPUs
URI	https://ieeexplore.ieee.org/document/9926122
WOSCitedRecordID	wos000458161700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LSgMxFL20xYWrqq1YX4wguDHt5DGdZCni6MKWLlrobsgTXNhKH_r7Jum0UhDEbCaEWQyHTM69Sc65ALdECWUlCYoeQRHrM4M8z_eRj3a50cxxR1UsNpEPh3w6FaMa3O-0MNbaePnMdkM3nuWbuV6HrbKeEMHwyi-49TzPN1qt7dzJYuhS-ZaEVZiSlGGeVm4-mGU9iolPBXk3PoOzTl3q972iKpFTiub_vuYI2j_ivGS0o51jqNnZCTS31RmS6mdtAXqdfyV7-o-kGAzQg9aeaYJBhEmKYpz44efRZNmGSfE0fnxBVXEEJD3lr5DD1gcPQupUChNM3rGh3CrpFDZp5rQMTvaG65TZnFPfHDPSSUkNU33iSf4UGrP5zJ5BklGFHXaSKCx9uuVETkiqDWGe4ZSjtAM3HqkyRP3LciNkzsoKzbJCswN3f75TKp_9uw60Apblx8ZNo6xgPP99-AIOSSDScPGOXEJjtVjbKzjQn6u35eI6ToFvAkCo-Q
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LSgMxFL3UKuiqaivW5wiCG9NOHtNOliKOFdvSRQvdhTzBha30ob9vMp1WCoKYzYQwi-GQybk3yTkX4JYorqwkQdHDKWItZpDn-Rby0W5qNHOpoyovNtHu99PxmA9KcL_Rwlhr88tnthG6-Vm-mepl2Cprch4Mr_yCu5swRvBKrbWePUkevBTOJWEdpiRmOI0LPx_MkibFxCeDaSN_Bm-dHanft8qq5KySVf73PYdQ-5HnRYMN8RxByU6OobKuzxAVv2sVUHf6FW0pQKKs10MPWnuuCRYRJsqyYeSHnwejeQ1G2dPwsYOK8ghIetJfIIetDx-41LHkJti8Y0NTq6RT2MSJ0zJ42ZtUx8y2U-qbY0Y6KalhqkU8zZ9AeTKd2FOIEqqww04ShaVPuBxvExJrQ5jnOOUorcONR0qEuH8uVlLmRBRoigLNOtz9-Y5QPv93dagGLMXHyk9DFDCe_T58DfudYa8rui_913M4IIFWwzU8cgHlxWxpL2FPfy7e5rOrfDp8AwMQrEA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis+%28Online%29&rft.atitle=Low+Communication+FMM-Accelerated+FFT+on+GPUs&rft.au=Cecka%2C+Cris&rft.date=2017-11-12&rft.pub=ACM&rft.eissn=2167-4337&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1145%2F3126908.3126919&rft.externalDocID=9926122
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/sc.gif&client=summon&freeimage=true