A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels

Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framewor...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Concurrency and computation Ročník 14; číslo 10; s. 805 - 839
Hlavní autori: Valsalam, Vinod, Skjellum, Anthony
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Chichester, UK John Wiley & Sons, Ltd 25.08.2002
Predmet:
ISSN:1532-0626, 1532-0634
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high‐performance matrix‐multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary‐sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object‐oriented or compiler‐based approaches. Copyright © 2002 John Wiley & Sons, Ltd.
Bibliografia:ArticleID:CPE630
istex:21C9A0127277F9639D76036338CF43B355654D0D
ark:/67375/WNG-Z76XTJPL-C
ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.630