High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelinin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems Jg. 18; H. 10; S. 1377 - 1392
Hauptverfasser:	Ling Zhuo, Morris, G.R., Prasanna, V.K.
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York IEEE 01.10.2007 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:	Acceleration Adders Buffers C.3.e Reconfigurable hardware Circuits Clocks Consumption Delay Design Design engineering Design methodology Devices Field programmable gate arrays Floating point arithmetic G.1.0.g Parallel algorithms Hazards Parallel processing Pipeline processing Reduction
ISSN:	1045-9219, 1558-2183
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.
Bibliographie:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2007.1068