Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency

Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. S...

Full description

Saved in:

Bibliographic Details
Published in:	2012 IEEE 26th International Parallel and Distributed Processing Symposium pp. 378 - 389
Main Authors:	Chhugani, J., Satish, N., Changkyu Kim, Sewall, J., Dubey, P.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.05.2012
Subjects:	Arrays Bandwidth efficient Graph traversal Instruction sets multi-socket Partitioning algorithms single node Sockets
ISBN:	1467309753, 9781467309752
ISSN:	1530-2075
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel ® Xeon ® X5570 (Intel microarchitecture code name Nehalem) system achieves 1.5X-13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale-free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resources.
ISBN:	1467309753 9781467309752
ISSN:	1530-2075
DOI:	10.1109/IPDPS.2012.43