Low communication FMM-accelerated FFT on GPUs

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate executi...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	International Conference for High Performance Computing, Networking, Storage and Analysis (Online) s. 1 - 11
Hlavný autor:	Cecka, Cris
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	New York, NY, USA ACM 12.11.2017
Edícia:	ACM Conferences
Predmet:	Analytical models Computational modeling Costs FFT FMM GPU Memory management Multi-GPU Predictive models Tensors Throughput FFT multi-GPU GPU FMM
ISBN:	9781450351140, 145035114X
ISSN:	2167-4337
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
ISBN:	9781450351140 145035114X
ISSN:	2167-4337
DOI:	10.1145/3126908.3126919