Parallel matrix transpose algorithms on distributed memory concurrent computers

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability....

Full description

Saved in:

Bibliographic Details
Published in:	Parallel computing Vol. 21; no. 9; pp. 1387 - 1405
Main Authors:	Choi, Jaeyoung, Dongarra, Jack J., Walker, David W.
Format:	Journal Article
Language:	English
Published:	Amsterdam Elsevier B.V 01.09.1995 Elsevier
Subjects:	Algorithmics. Computability. Computer arithmetics Applied sciences Computer science; control theory; systems Computer systems and distributed systems. User interface Distributed memory multiprocessors Exact sciences and technology Intel Touchstone Delta Linear algebra Matrix transpose algorithm Memory and file management (including protection and security) Memory organisation. Data processing Point-to-point communication Software Theoretical computing Distributed memory multiprocessors Matrix transpose algorithm Intel Touchstone Delta Point-to-point communication Linear algebra Matrix diagonalization Parallel algorithm Distributed memory multiprocessor system Matrix inversion Distributed algorithm Matrix calculus Matrix product Parallelism Distributed system Implementation Point to point communication
ISSN:	0167-8191, 1872-7336
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P × Q processor template with a block cyclic data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor ( GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A · B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A T · B T , in the PUMMA package [5]. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
ISSN:	0167-8191 1872-7336
DOI:	10.1016/0167-8191(95)00016-H