Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing togeth...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on parallel and distributed systems Ročník 22; číslo 1; s. 105 - 118
Hlavní autoři:	Byunghyun Jang, Schaa, Dana, Mistry, Perhaad, Kaeli, David
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York IEEE 01.01.2011 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Acceleration Architecture Benchmarking Computation Computer architecture Concurrent computing data parallelism data-parallel architectures Delay General-purpose computation on GPUs (GPGPUs) GPU computing Graphics processing unit Kernel Kernels Logic design memory access pattern memory coalescing memory optimization memory selection Microprocessors Parallel architectures Parallel processing Performance enhancement Transformations vectorization
ISSN:	1045-9219, 1558-2183
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2010.107