Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programs

Task-based parallel programming frameworks offer compelling productivity and performance benefits for modern chip multi-processors (CMPs). At the same time, CMPs also provide packed-SIMD units to exploit fine-grain data parallelism. Two fundamental challenges make using packed-SIMD units with task-p...

Full description

Saved in:
Bibliographic Details
Published in:MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA pp. 759 - 773
Main Authors: Kim, Ji, Jiang, Shunning, Torng, Christopher, Wang, Moyang, Srinath, Shreesha, Ilbeyi, Berkin, Al-Hawaj, Khalid, Batten, Christopher
Format: Conference Proceeding
Language:English
Published: New York, NY, USA ACM 14.10.2017
Series:ACM Conferences
Subjects:
ISBN:1450349528, 9781450349529
ISSN:2379-3155
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Task-based parallel programming frameworks offer compelling productivity and performance benefits for modern chip multi-processors (CMPs). At the same time, CMPs also provide packed-SIMD units to exploit fine-grain data parallelism. Two fundamental challenges make using packed-SIMD units with task-parallel programs particularly difficult: (1) the intra-core parallel abstraction gap; and (2) inefficient execution of irregular tasks. To address these challenges, we propose augmenting CMPs with intra-core loop-task accelerators (LTAs). We introduce a lightweight hint in the instruction set to elegantly encode loop-task execution and an LTA microarchitectural template that can be configured at design time for different amounts of spatial/temporal decoupling to efficiently execute both regular and irregular loop tasks. Compared to an in-order CMP baseline, CMP+LTA results in an average speedup of 4.2X (1.8X area normalized) and similar energy efficiency. Compared to an out-of-order CMP baseline, CMP+LTA results in an average speedup of 2.3X (1.5X area normalized) and also improves energy efficiency by 3.2X. Our work suggests augmenting CMPs with lightweight LTAs can improve performance and efficiency on both regular and irregular loop-task parallel programs with minimal software changes.
ISBN:1450349528
9781450349529
ISSN:2379-3155
DOI:10.1145/3123939.3136952