Program generation for the all-pairs shortest path problem

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators prod...

Full description

Saved in:

Bibliographic Details
Published in:	PACT 2006 : proceedings of the Fifteenth International Conference on Parallel Architectures and Compilation Techniques : September 16-20, 2006, Seattle, Washington, USA. pp. 222 - 232
Main Authors:	Han, Sung-Chul, Franchetti, Franz, Puschel, Markus
Format:	Conference Proceeding
Language:	English
Published:	ACM 01.09.2006
Subjects:	Algorithms Automatic programming blocking empirical search Floyd-Warshall algorithm Generators Libraries Open area test sites Shortest path problem SIMD vectorization tiling
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications. As the first contribution, we derive variants of the FW algorithm that make it possible to apply many of the optimization techniques developed for matrix-matrix multiplication. The second contribution is the actual program generator, which uses tiling, loop unrolling, and SIMD vectorization combined with a hill climbing search to produce the best code (float or integer) for a given platform. Using the program generator, we demonstrate a speedup over a straightforward single-precision implementation of up to a factor of 1.3 on Pentium 4 and 1.8 on Athlon 64. Use of 4-way vectorization further improves the performance by another factor of up to 5.7 on Pentium 4 and 3.0 on Athlon 64. For data type short integers, 8-way vectorization provides a speed-up of up to 4.6 on Pentium 4 and 5.0 on Athlon 64 over the best scalar code.
DOI:	10.1145/1152154.1152189