Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-b...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) s. 1 - 12
Hlavní autori:	Williams, Samuel, Oliker, Leonid, Carter, Jonathan, Shalf, John
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	New York, NY, USA ACM 12.11.2011 IEEE
Edícia:	ACM Conferences
Predmet:	Auto-tuning BlueGene Distribution functions Hybrid Programming Models Lattice Boltzmann Lattices Mathematics of computing > Mathematical analysis > Numerical analysis Mathematics of computing > Mathematical analysis > Numerical analysis > Numerical differentiation Mathematics of computing > Mathematical software Multicore processing OpenMP Optimization SIMD Theory of computation > Design and analysis of algorithms Three dimensional displays Tuning Vectors hybrid programming models BlueGene Lattice Boltzmann SIMD OpenMP auto-tuning
ISBN:	145030771X, 9781450307710
ISSN:	2167-4329
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.
ISBN:	145030771X 9781450307710
ISSN:	2167-4329
DOI:	10.1145/2063384.2063458