The Data Locality of Work Stealing

This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, a...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Theory of computing systems Ročník 35; číslo 3; s. 321 - 347
Hlavní autoři:	Acar, Umut A., Blelloch, Guy E., Blumofe, Robert D.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York Springer Nature B.V 01.05.2002
Témata:	Affinity Algorithms Computation Critical path Hardware Multiprocessor Processors Scheduling Upper bounds Workloads
ISSN:	1432-4350, 1433-0490
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. {As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G sub(n) whose members perform Theta (n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Omega (n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected $O(C\lceil m/s\rceil $PT$_{\infty})$ more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T sub( epsilon ) fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.
Bibliografie:	SourceType-Scholarly Journals-1 ObjectType-General Information-1 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
ISSN:	1432-4350 1433-0490
DOI:	10.1007/s00224-002-1057-3