Restructuring and implementations of 2D matrix transpose algorithm using SSE4 vector instructions

Current general-purpose processors are augmented with vector instructions that can process many elements of matrices and vectors in parallel. Transposing a matrix in-place is a main kernel operation required by many scientific and engineering applications to shuttle data before, during, or after pro...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2015 International Conference on Applied Research in Computer Science and Engineering (ICAR) S. 1 - 7
1. Verfasser:	Zekri, Ahmed S.
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.10.2015
Schlagworte:	Kernel Multicore processing Parallel processing Program processors Registers Signal processing algorithms Yttrium
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Current general-purpose processors are augmented with vector instructions that can process many elements of matrices and vectors in parallel. Transposing a matrix in-place is a main kernel operation required by many scientific and engineering applications to shuttle data before, during, or after processing. This operation increases the traffic on the memory bus and hence clever techniques such as blocking are required to enhance the performance. In this paper, we present an enhanced version of a previously published algorithm for transposing a matrix on a two-dimensional processor arrays. We restructured this algorithm to fit the one-dimensional vector register architecture augmented to general-purpose CPUs. We implemented the new vector algorithm using Intel SSE4 vector instruction set and compare its performance with the standard sequential algorithm in addition to an already employed implementation of Ekhlundh's algorithm. We also studied the automatic compiler optimizations and their effect on the vectorization of the algorithm. The best of our implementations showed a maximum speedup of 1.6 compared with the sequential algorithm, and an almost equal performance compared with Eklundh's algorithm implementation.
DOI:	10.1109/ARCSE.2015.7338144