Low communication FMM-accelerated FFT on GPUs

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate executi...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International Conference for High Performance Computing, Networking, Storage and Analysis (Online) s. 1 - 11
Hlavní autor: Cecka, Cris
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: New York, NY, USA ACM 12.11.2017
Edice:ACM Conferences
Témata:
ISBN:9781450351140, 145035114X
ISSN:2167-4337
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
ISBN:9781450351140
145035114X
ISSN:2167-4337
DOI:10.1145/3126908.3126919