High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs

Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelinin...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on parallel and distributed systems Ročník 18; číslo 10; s. 1377 - 1392
Hlavní autori:	Ling Zhuo, Morris, G.R., Prasanna, V.K.
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York IEEE 01.10.2007 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Acceleration Adders Buffers C.3.e Reconfigurable hardware Circuits Clocks Consumption Delay Design Design engineering Design methodology Devices Field programmable gate arrays Floating point arithmetic G.1.0.g Parallel algorithms Hazards Parallel processing Pipeline processing Reduction
ISSN:	1045-9219, 1558-2183
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.
Bibliografia:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2007.1068