BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads

OpenMP is widely used by a number of applications, computational libraries, and runtime systems. As a result, multiple levels of the software stack use OpenMP independently of one another, often leading to nested parallel regions. Although exploiting such nested parallelism is a potential opportunit...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings / International Conference on Parallel Architectures and Compilation Techniques pp. 29 - 42
Main Authors:	Iwasaki, Shintaro, Amer, Abdelhalim, Taura, Kenjiro, Seo, Sangmin, Balaji, Pavan
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.09.2019
Subjects:	Concurrency control Fasteners Libraries Multithreading OpenMP Parallel processing Runtime Runtime Systems Software Task analysis User Level Threads
ISSN:	2641-7936
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	OpenMP is widely used by a number of applications, computational libraries, and runtime systems. As a result, multiple levels of the software stack use OpenMP independently of one another, often leading to nested parallel regions. Although exploiting such nested parallelism is a potential opportunity for performance improvement, it often causes destructive performance with leading OpenMP runtimes because of their reliance on heavyweight OS-level threads. User-level threads (ULTs) are more lightweight alternatives but existing ULT-based runtimes suffer from several shortcomings: 1) thread management costs remain significant and outweigh the benefits from additional parallelism; 2) the shift to ULTs often hurts the more common flat parallelism case; and 3) absence of user control over thread-to-CPU binding, a critical feature on modern systems. This paper presents BOLT, a practical ULT-based OpenMP runtime system that efficiently supports both flat and nested parallelism. This is accomplished on three fronts: 1) advanced data reuse and thread synchronization strategies; 2) thread coordination that adapts to the level of oversubscription; and 3) an implementation of the modern OpenMP thread-to-CPU binding interface tailored to ULT-based runtimes. The result is a highly optimized runtime that transparently achieves similar performance compared with leading state-of-the-art widely used OpenMP runtimes under flat parallelism, while outperforming all existing runtimes under nested parallelism.
ISSN:	2641-7936
DOI:	10.1109/PACT.2019.00011