Time stamp algorithms for runtime parallelization of DOACROSS loops with dynamic dependences

This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves up...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on parallel and distributed systems Vol. 12; no. 5; pp. 433 - 450
Main Authors:	Xu, C.-Z., Chaudhary, V.
Format:	Journal Article
Language:	English
Published:	New York IEEE 01.05.2001 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithm design and analysis Algorithms Computation Computational fluid dynamics Computational modeling Computer Society Dynamics Fluid dynamics Gain Parallel processing Performance analysis Runtime Sequential analysis Servers Sparse matrices Studies Tradeoffs Workload
ISSN:	1045-9219, 1558-2183
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23
ISSN:	1045-9219 1558-2183
DOI:	10.1109/71.926166