Low communication FMM-accelerated FFT on GPUs

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate executi...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:International Conference for High Performance Computing, Networking, Storage and Analysis (Online) s. 1 - 11
Hlavný autor: Cecka, Cris
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: New York, NY, USA ACM 12.11.2017
Edícia:ACM Conferences
Predmet:
ISBN:9781450351140, 145035114X
ISSN:2167-4337
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
ISBN:9781450351140
145035114X
ISSN:2167-4337
DOI:10.1145/3126908.3126919