Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 10
Main Authors: Nukada, Akira, Matsuoka, Satoshi
Format: Conference Proceeding
Language:English
Published: New York, NY, USA ACM 14.11.2009
Series:ACM Conferences
Subjects:
ISBN:1605587443, 9781605587448
ISSN:2167-4329
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.
AbstractList Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.
Author Matsuoka, Satoshi
Nukada, Akira
Author_xml – sequence: 1
  givenname: Akira
  surname: Nukada
  fullname: Nukada, Akira
  organization: Tokyo Institute of Technology and Japan Science and Technology Agency, CREST
– sequence: 2
  givenname: Satoshi
  surname: Matsuoka
  fullname: Matsuoka, Satoshi
  organization: Tokyo Institute of Technology and National Institute of Informatics and Japan Science and Technology Agency, CREST
BookMark eNqNkL1PwzAQxY0oEm3pzMDiEQYHf8ceo5QUpEowNLPlxDYKtAlK0oH_voZmYOR00tPpPZ3ufgswa7vWA3BLcEIIF49ECo6FTn5V4wuwIBILoVLO2eXfYQbmlMgUcUb1NVgNwweOpQhlSszBQ3YcOzQe26Z9hwytYVHs4L6pett_w9D1MC_XGdy8lcMNuAp2P_jVpEtQFk-7_BltXzcvebZFlol0RL4SNthUO-6cJYpRYrmqKSWeaOaCxlhQTanHXknHZKVqy4Koa2d1bGrZEtyd9zbee_PVN4d4ipEsFfHR6N6fXVsfTNV1n4Mh2PwQMRMRMxGJ0eSfUVP1jQ_sBLpIWzQ
CODEN IEEPAD
ContentType Conference Proceeding
Copyright 2009 ACM
Copyright_xml – notice: 2009 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/1654059.1654090
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1605587443
9781605587448
EndPage 10
ExternalDocumentID 6375540
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
AARBI
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
OCL
RIE
RIL
6IH
6IK
AAWTH
ABLEC
ADZIZ
CHZPO
IEGSK
IPLJI
ID FETCH-LOGICAL-a357t-eb5afa79d4dda18321a48c221e193df90052922e0e86d36b8ca3f5ccda9da92a3
IEDL.DBID RIE
ISBN 1605587443
9781605587448
ISICitedReferencesCount 30
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000320136800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2167-4329
IngestDate Wed Jul 30 06:14:25 EDT 2025
Wed Jan 31 06:46:33 EST 2024
Wed Jan 31 06:45:55 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis
MergedId FETCHMERGED-LOGICAL-a357t-eb5afa79d4dda18321a48c221e193df90052922e0e86d36b8ca3f5ccda9da92a3
PageCount 10
ParticipantIDs acm_books_10_1145_1654059_1654090
ieee_primary_6375540
acm_books_10_1145_1654059_1654090_brief
PublicationCentury 2000
PublicationDate 2009-11-14
PublicationDateYYYYMMDD 2009-11-14
PublicationDate_xml – month: 11
  year: 2009
  text: 2009-11-14
  day: 14
PublicationDecade 2000
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PublicationTitleAbbrev SUPERC
PublicationYear 2009
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0000812385
ssj0003204180
Score 1.7196482
Snippet Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not...
SourceID ieee
acm
SourceType Publisher
StartPage 1
SubjectTerms Bandwidth
Codes
Computing methodologies -- Computer graphics -- Graphics systems and interfaces -- Graphics processors
Graphics processing units
Hardware
Instruction sets
Kernel
Mathematics of computing -- Mathematical analysis -- Functional analysis -- Approximation
Mathematics of computing -- Mathematical analysis -- Numerical analysis -- Computation of transforms
Mathematics of computing -- Mathematical software
Programming
Registers
Theory of computation -- Design and analysis of algorithms -- Approximation algorithms analysis
Three-dimensional displays
Transforms
Title Auto-tuning 3-D FFT library for CUDA GPUs
URI https://ieeexplore.ieee.org/document/6375540
WOSCitedRecordID wos000320136800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFA7b8MGnqZs4b0QQRDBbm7RN8zg2q09jDxvsraTJCezBTbbW32-S1YogiFDolVAOTb9zyXc-hO6NtKhRcE2Uja9IZAMOIpKwIFJEOmUMIOCFF5vgs1m6Wol5Cz01XBgA8IvPYOgOfS1fb1XlUmWjhHGLfjZAb3OeHLhaTT7FQptFn7g5ZzSIQi-cRn1rb0ZF3dknjOKR4_BYx2Lo9-5_3Jbq7YfAiseXrPu_NztB_W-iHp43EHSKWrA5Q90vpQZcT9weehxX5ZaUlUuCYEamOMsWuE7gYOu24slyOsYv8-W-j5bZ82LySmqRBCJZzEsCRSyN5EJHWks3P0MZpYrSEKxrpo3wpTxKIYA00SwpUiWZiZXSUtiNSnaOOpvtBi4Q1oYWYFiQOvJbqEAokVBjbAwrE2PvDNCdtVLuvP99fiA0x3ltyby25AA9_PlMXuzWbrSes2P-fuiqkdcmvPz98hU69nUctwAvukadclfBDTpSH-V6v7v1n8In4hCpkA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwGA1zCvo0dRPnNYIggtnapLc8js06cY49bLC3kOYCe3CTrfX3m2S1IggiFHollI-m57vkfAeAW80NamSxRMLEVygwAQeikZ8hTgOZEKKUF2dObCIej5P5nE5q4KHiwiil3OIz1bGHrpYvV6KwqbJuRGKDfiZA37XKWSVbq8qoGHAz-BNW5wR7ge-k07Br7k0wLXv7-EHYtSwe41p03N7-kXe4ePshseIQJm38790OQeubqgcnFQgdgZpaHoPGl1YDLKduE9z3inyF8sKmQSBBA5imU1imcKBxXGF_NujBp8ls0wKz9HHaH6JSJgFxEsY5UlnINY-pDKTkdob6PEgExr4yzpnU1BXzMFaeSiJJoiwRnOhQCMmp2TAnJ6C-XC3VKYBS40xp4iWW_uYLRQWNsNYmiuWRNnfa4MZYiVn_f8O2lOaQlZZkpSXb4O7PZ1i2XtjRmtaO7H3bV4OVJjz7_fI12B9OX0ds9Dx-OQcHrqpjl-MFF6Cerwt1CfbER77YrK_cZ_EJ8U-s2Q
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+Conference+on+High+Performance+Computing+Networking%2C+Storage+and+Analysis&rft.atitle=Auto-tuning+3-D+FFT+library+for+CUDA+GPUs&rft.au=Nukada%2C+Akira&rft.au=Matsuoka%2C+Satoshi&rft.series=ACM+Conferences&rft.date=2009-11-14&rft.pub=ACM&rft.isbn=1605587443&rft.spage=1&rft.epage=10&rft_id=info:doi/10.1145%2F1654059.1654090
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2167-4329&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2167-4329&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2167-4329&client=summon