Algorithmic optimizations of a conjugate gradient solver on shared memory architectures

OpenMP is an architecture-independent language for programming in the shared memory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent abstractions sometimes come with the price of low parallel performance. This is es...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of parallel, emergent and distributed systems Vol. 21; no. 5; pp. 345 - 363
Main Authors:	Löf, Henrik, Rantakokko, Jarmo
Format:	Journal Article
Language:	English
Published:	Taylor & Francis Group 01.10.2006
Subjects:	Bandwidth minimization Conjugate gradients Iterative solvers OpenMP Reversed Cuthill-McKee Shared memory programming
ISSN:	1744-5760, 1744-5779, 1744-5779
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	OpenMP is an architecture-independent language for programming in the shared memory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent abstractions sometimes come with the price of low parallel performance. This is especially true for applications with an unstructured data access pattern running on distributed shared memory systems (DSM). Here, proper data distribution and algorithmic optimizations play a vital role for performance. In this article, we have investigated ways of improving the performance of an industrial class conjugate gradient (CG) solver, implemented in OpenMP running on two types of shared memory systems. We have evaluated bandwidth minimization, graph partitioning and reformulations of the original algorithm reducing global barriers. By a detailed analysis of barrier time and memory system performance, we found that bandwidth minimization is the most important optimization reducing both L2 misses and remote memory accesses. On a uniform memory system, we get perfect scaling. On a NUMA system, the performance is significantly improved with the algorithmic optimizations leaving the system dependent global reduction operations as a bottleneck.
ISSN:	1744-5760 1744-5779 1744-5779
DOI:	10.1080/17445760600568139