The Batched DOACROSS loop parallelization algorithm

Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far fro...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2015 International Conference on High Performance Computing and Simulation (HPCS) s. 476 - 483
Hlavní autori:	Lucas, Divino Cesar S., Araujo, Guido
Médium:	Konferenčný príspevok.. Journal Article
Jazyk:	English
Vydavateľské údaje:	IEEE 01.07.2015
Predmet:	Algorithm design and analysis Algorithms Architecture Computation Computer architecture Computer simulation Fine-grain Parallelism Instruction sets Loop Parallelization Algorithm Multicore Processors Parallel processing Quantitative analysis Synchronism Synchronization
ISBN:	1467378127, 9781467378123
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.
Bibliografia:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISBN:	1467378127 9781467378123
DOI:	10.1109/HPCSim.2015.7237079