Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs

Task-based parallel programming frameworks offer compelling productivity and performance benefits for modern chip multi-processors (CMPs). At the same time, CMPs also provide packed-SIMD units to exploit fine-grain data parallelism. Two fundamental challenges make using packed-SIMD units with task-p...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA s. 759 - 773
Hlavní autoři: Kim, Ji, Jiang, Shunning, Torng, Christopher, Wang, Moyang, Srinath, Shreesha, Ilbeyi, Berkin, Al-Hawaj, Khalid, Batten, Christopher
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: New York, NY, USA ACM 14.10.2017
Edice:ACM Conferences
Témata:
ISBN:1450349528, 9781450349529
ISSN:2379-3155
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Task-based parallel programming frameworks offer compelling productivity and performance benefits for modern chip multi-processors (CMPs). At the same time, CMPs also provide packed-SIMD units to exploit fine-grain data parallelism. Two fundamental challenges make using packed-SIMD units with task-parallel programs particularly difficult: (1) the intra-core parallel abstraction gap; and (2) inefficient execution of irregular tasks. To address these challenges, we propose augmenting CMPs with intra-core loop-task accelerators (LTAs). We introduce a lightweight hint in the instruction set to elegantly encode loop-task execution and an LTA microarchitectural template that can be configured at design time for different amounts of spatial/temporal decoupling to efficiently execute both regular and irregular loop tasks. Compared to an in-order CMP baseline, CMP+LTA results in an average speedup of 4.2X (1.8X area normalized) and similar energy efficiency. Compared to an out-of-order CMP baseline, CMP+LTA results in an average speedup of 2.3X (1.5X area normalized) and also improves energy efficiency by 3.2X. Our work suggests augmenting CMPs with lightweight LTAs can improve performance and efficiency on both regular and irregular loop-task parallel programs with minimal software changes.
ISBN:1450349528
9781450349529
ISSN:2379-3155
DOI:10.1145/3123939.3136952