An Efficient GPU Implementation of Inclusion-Based Pointer Analysis

We present an efficient GPU implementation of Andersen's whole-program inclusion-based pointer analysis, a fundamental analysis on which many others are based, including optimising compilers, bug detection and security analyses. Andersen's algorithm makes extensive modifications to the gra...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on parallel and distributed systems Vol. 27; no. 2; pp. 353 - 366
Main Authors:	Su, Yu, Ye, Ding, Xue, Jingling, Liao, Xiang-Ke
Format:	Journal Article
Language:	English
Published:	New York IEEE 01.02.2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Adaptation models Algorithm design and analysis Algorithms Balancing Benchmarks compilers Computer information security GPGPU Graphics processing units Graphs Instruction sets Optimization Parallel graph algorithms Partitioning algorithms pointer analysis Switches Synchronization Vectors Workload pointer analysis compilers GPGPU Parallel graph algorithms
ISSN:	1045-9219, 1558-2183
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We present an efficient GPU implementation of Andersen's whole-program inclusion-based pointer analysis, a fundamental analysis on which many others are based, including optimising compilers, bug detection and security analyses. Andersen's algorithm makes extensive modifications to the graph that represents the pointer-manipulating statements in a program. These modifications are highly irregular, input-dependent and statically unpredictable, making it much more challenging to balance such graph workloads across a multitude of GPU cores than those dealt with by traditional graph algorithms such as DFS and BFS. To parallelise Andersen's analysis efficiently on GPUs, we introduce an imbalance-aware workload partitioning scheme that divides its workload dynamically among the concurrent warps, initially in a warp-centric manner (during the coarsegrain stage) but later switches to a task-pool-based model when a workload imbalance is detected (during the fine-grain stage). We improve further its performance by using an adaptive group propagation scheme to reduce some redundant traversals. For a set of 14 C benchmarks evaluated, our parallel implementation of Andersen's analysis achieves a significant speedup of 46 percent on average over the state-of-the art on an NVIDIA Tesla K20c GPU.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2015.2397933