Low communication FMM-accelerated FFT on GPUs

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate executi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International Conference for High Performance Computing, Networking, Storage and Analysis (Online) S. 1 - 11
1. Verfasser: Cecka, Cris
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: New York, NY, USA ACM 12.11.2017
Schriftenreihe:ACM Conferences
Schlagworte:
ISBN:9781450351140, 145035114X
ISSN:2167-4337
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
AbstractList Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1. 3\times and 2. 2\times against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
Author Cecka, Cris
Author_xml – sequence: 1
  givenname: Cris
  surname: Cecka
  fullname: Cecka, Cris
  email: ccecka@nvidia.com
  organization: NVIDIA
BookMark eNqNkDtPwzAUhc1LopTMDCwZWRJ87TixR1SRgpQKhna2rl9SRJOgpAjx7zE0ExN3OdL9dM7wXZHzfug9ITdAc4BC3HNgpaIy_01QJyRRlYyAchE5PSULBmWVFZxXZ3_YJUmmqTVUUChBFHRBsmb4TO3QdR99a_HQDn1abzYZWuv3fsSDd2ldb9P4Xr_upmtyEXA_-WTOJdnVj9vVU9a8rJ9XD02GHKpDFsCXQiq0FJVTIAQ4Lr3BYMBRESxKJisnLS18JXm8UDgMiNwVpmRM8SW5Pe623nv9PrYdjl9aKVYCY5HmR4q202YY3iYNVP-40bMbPbvRZmx9iIW7fxb4N4hjXwE
ContentType Conference Proceeding
Copyright 2017 ACM
Copyright_xml – notice: 2017 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3126908.3126919
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450351140
145035114X
EISSN 2167-4337
EndPage 11
ExternalDocumentID 9926122
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
ABLEC
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
OCL
RIB
RIC
RIE
RIL
6IH
6IK
AAWTH
ADZIZ
CHZPO
IPLJI
ID FETCH-LOGICAL-a317t-f1e6589ac0a9d91551d38ebafb1d05fca8287d8c04e783333f4dafaa3d4b62293
IEDL.DBID RIE
ISBN 9781450351140
145035114X
ISICitedReferencesCount 6
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000458161700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:19:14 EDT 2025
Wed Jan 31 06:44:57 EST 2024
Wed Jan 31 06:44:12 EST 2024
IsPeerReviewed false
IsScholarly false
Keywords FFT
multi-GPU
GPU
FMM
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org.
LinkModel DirectLink
MeetingName SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis
MergedId FETCHMERGED-LOGICAL-a317t-f1e6589ac0a9d91551d38ebafb1d05fca8287d8c04e783333f4dafaa3d4b62293
PageCount 11
ParticipantIDs acm_books_10_1145_3126908_3126919_brief
ieee_primary_9926122
acm_books_10_1145_3126908_3126919
PublicationCentury 2000
PublicationDate 20171112
2017-Nov.-12
PublicationDateYYYYMMDD 2017-11-12
PublicationDate_xml – month: 11
  year: 2017
  text: 20171112
  day: 12
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle International Conference for High Performance Computing, Networking, Storage and Analysis (Online)
PublicationTitleAbbrev SC
PublicationYear 2017
Publisher ACM
Publisher_xml – name: ACM
SSID ssib050161540
ssj0003204180
Score 1.699429
Snippet Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the...
SourceID ieee
acm
SourceType Publisher
StartPage 1
SubjectTerms Analytical models
Computational modeling
Costs
FFT
FMM
GPU
Memory management
Multi-GPU
Predictive models
Tensors
Throughput
Title Low communication FMM-accelerated FFT on GPUs
URI https://ieeexplore.ieee.org/document/9926122
WOSCitedRecordID wos000458161700054&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGH7ZhgdPUzdxflFB8GK2Nknb9Chi9eDGDhvsVvLxBjy4ybrp3zfpuslAEHNpCD2UJx_PmybP8wLcssxomjBO4gQV4QlVRKBIiHZcKGxKkUldJZtIRyMxm2XjBtzvtDCIWF0-w76vVmf5ZqHX_lfZIMu84ZVbcJtpmm60WtuxE1ehS-1b4ldhRkMeibB284l4PGARdVtB0a-e3lmnKfX7XlKVilPy9v--5gi6P-K8YLyjnWNo4PwE2tvsDEE9WTtAXhdfwZ7-I8iHQ_KgtWMabxBhgjyfBK75eTwtuzDNnyaPL6ROjkCko_wVsRG64CGTOpSZ8SbvkWEClbQqMmFstfRO9kbokGMqmCuWG2mlZIarhDqSP4XWfDHHMwgUz5RJYu_NbjiGKKh0UZ_rKmaVct3XgxuHVOGj_rLYCJnjokazqNHswd2f7xTK7f5tDzoey-Jj46ZR1DCe_958AYfUE6m_eEcvobVarvEKDvTn6q1cXldD4BuSEKlM
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwGP2YU9CnqZs4rxUEX8zWJmmXPopYJ25jDxvsLeQKPrjKLvr3TbpuMhDEvDSEPpSTy_nS5JwP4JakWuGEUBQnRiKaYImYYQlSjguZ7WBDhCqSTXQGAzaZpMMK3G-0MMaY4vKZaflqcZavc7X0v8raaeoNr9yCuxtTiqOVWms9euIieCmdS_w6THBIIxaWfj4Rjdskwm4zyFrF03vr7Aj1vpVWpWCVrPa_7zmExo88LxhuiOcIKmZ6DLV1foagnK51QL38K9hSgARZv48elHJc4y0idJBlo8A1Pw_H8waMs6fRYxeV6RGQcKS_QDYyLnxIhQpFqr3Ne6QJM1JYGekwtkp4L3vNVEhNhxFXLNXCCkE0lQl2NH8C1Wk-NacQSJpKncTenV1TExqGhYv7XGcRK6XrwCbcOKS4j_vnfCVljnmJJi_RbMLdn-9w6fb_tgl1jyX_WPlp8BLGs9-br2G_O-r3eO9l8HoOB9jTqr-Ghy-gupgtzSXsqc_F23x2VQyHb5cyrJM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis+%28Online%29&rft.atitle=Low+Communication+FMM-Accelerated+FFT+on+GPUs&rft.au=Cecka%2C+Cris&rft.date=2017-11-12&rft.pub=ACM&rft.eissn=2167-4337&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1145%2F3126908.3126919&rft.externalDocID=9926122
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450351140/sc.gif&client=summon&freeimage=true