Exploiting concurrent kernel execution on graphic processing units

Graphics processing units (CPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of CPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2011 International Conference on High Performance Computing and Simulation s. 24 - 32
Hlavní autori:	Lingyuan Wang, Miaoqing Huang, El-Ghazawi, Tarek
Médium:	Konferenčný príspevok..
Jazyk:	English Japanese
Vydavateľské údaje:	IEEE 01.07.2011
Predmet:	Concurrent kernel execution Context GPU computing Graphics processing unit Instruction sets Kernel Manuals Multi-threaded programming Programming Switches
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Graphics processing units (CPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of CPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA's Fermi architecture pioneers the feature of concurrent kernel execution; however, only kernels of the same thread context can execute in parallel. In order to get the best use of a GPU device in a multi-threaded application environment, this paper explores the techniques to effectively share a context, i.e., context funneling, which could be done either manually at application level, or automatically at the GPU runtime starting from CUDA v4.0. For synthetic microbenchmark tests, we find that both funneling mechanisms are more capable of exploring the benefit of concurrent kernel execution than traditional context switching, therefore improving the overall application performance. We also find that the manual funneling mechanism provides the highest performance and more explicit control, while CUDA v4.0 provides better productivity with good performance. Finally, we assess the impact of such techniques on a compact application benchmark, SSCA#3 - SAR sensor processing.
DOI:	10.1109/HPCSim.2011.5999803