Efficient pre-processing in the parallel block-Jacobi SVD algorithm

One way, how to speed up the computation of the singular value decomposition of a given matrix A ∈ C m × n , m ⩾ n , by the parallel two-sided block-Jacobi method, consists of applying some pre-processing steps that would concentrate the Frobenius norm near the diagonal. Such a concentration should...

Full description

Saved in:

Bibliographic Details
Published in:	Parallel computing Vol. 32; no. 2; pp. 166 - 176
Main Authors:	Okša, Gabriel, Vajteršic, Marián
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 01.02.2006
Subjects:	Cluster of personal computers LQ factorization Message passing interface Parallel computation QR factorization with column pivoting Singular value decomposition Two-sided block-Jacobi method QR factorization with column pivoting LQ factorization Message passing interface Two-sided block-Jacobi method Parallel computation Cluster of personal computers Singular value decomposition
ISSN:	0167-8191, 1872-7336
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	One way, how to speed up the computation of the singular value decomposition of a given matrix A ∈ C m × n , m ⩾ n , by the parallel two-sided block-Jacobi method, consists of applying some pre-processing steps that would concentrate the Frobenius norm near the diagonal. Such a concentration should hopefully lead to fewer outer parallel iteration steps needed for the convergence of the entire algorithm. It is shown experimentally, that the QR factorization with the complete column pivoting, optionally followed by the LQ factorization of the R-factor, can lead to a substantial decrease of the number of outer parallel iteration steps, whereby the details depend on the condition number and on the distribution of singular values including their multiplicity. A subset of ill-conditioned matrices has been identified, for which the dynamic ordering becomes inefficient. Best results in numerical experiments performed on the cluster of personal computers were achieved for well-conditioned matrices with a multiple minimal singular value, where the number of parallel iteration steps was reduced by two orders of magnitude. However, the gain in speed, as measured by the total parallel execution time, depends decisively on the implementation of the distributed QR and LQ factorizations on a given parallel architecture. In general, the reduction of the total parallel execution time up to one order of magnitude has been achieved.
ISSN:	0167-8191 1872-7336
DOI:	10.1016/j.parco.2005.06.006