Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

We investigate several state-of-the-practice shared-memory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel Knights Landing processor, as a representative of the processors in contemp...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on parallel and distributed systems Ročník 29; číslo 10; s. 2317 - 2332
Hlavní autori:	Al Farhan, Mohammed A., Keyes, David E.
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York IEEE 01.10.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Aerodynamics AVX-512 computational aerodynamics Computational modeling Computer architecture Computer memory data-level parallelism Floating point arithmetic Hardware Intel Xeon Phi Kernel Knights Landing Landing Microprocessors Optimization Optimization techniques Parallel processing Performance optimizations Routines SIMD Supercomputers thread-level parallelism unstructured meshes
ISSN:	1045-9219, 1558-2183
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	We investigate several state-of-the-practice shared-memory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel Knights Landing processor, as a representative of the processors in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating point numerics of the original code. We employ low and high-level architecture-specific code optimizations involving thread and data-level parallelism. Our approach is based upon a multi-level hierarchical distribution of work and data across both the threads and the SIMD units within every hardware core. On a 64-core Knights Landing chip, we achieve nearly 2.9x speedup of the dominant routines relative to the baseline. These exhibit almost linear strong scalability up to 64 threads, and thereafter some improvement with hyperthreading. At substantially fewer Watts, we achieve up to 1.7x speedup relative to the performance of 72 threads of a 36-core Haswell CPU and roughly equivalent performance to 112 threads of a 56-core Skylake scalable processor. These optimizations are expected to be of value for many other unstructured mesh PDE-based scientific applications as multi and many-core architecture evolves.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2018.2826533