OctCNN: A High Throughput FPGA Accelerator for CNNs using Octave Convolution Algorithm

With the rapid development of convolutional neural networks (CNNs), FPGAs have become one of the most attractive candidates for deploying CNNs. However, previous FPGA solutions based on the traditional convolution are still limited by computational power. In this article, we introduce the octave con...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on computers Ročník 71; číslo 8; s. 1
Hlavní autoři: Lou, Wenqi, Gong, Lei, Wang, Chao, Du, Zidong, Xuehai, Zhou
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.01.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:0018-9340, 1557-9956
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:With the rapid development of convolutional neural networks (CNNs), FPGAs have become one of the most attractive candidates for deploying CNNs. However, previous FPGA solutions based on the traditional convolution are still limited by computational power. In this article, we introduce the octave convolution (OctConv) into the CNN accelerator design for the first time to improve the hardware acceleration efficiency and design a dedicated OctPU for mapping OctConv to FPGAs, which employs a parallel dataflow pattern to exploit the parallelism of OctConv. Then, we present a novel and scalable architecture that dynamically combines the inter-layer pipelined structure and multi-layer reuse structure. Meanwhile, to obtain the optimized solution, we build a multidimensional performance and resource analysis model and a two-stage search algorithm based on greedy and heuristic algorithms. We evaluate our proposal by implementing VGG16 and ResNet50 on the Xilinx VU9P FPGA. Experimental results show that our prototypes can achieve an average of 3321 GOP/s for the convolutional layers for VGG16 and 2873 GOP/s for the overall ResNet50 using OctConv. Compared to previous works based on the traditional convolution, our prototypes own a 1.72 to 2.33 speedup in throughput and a 2.01 to 5.18 improvement in computational density. Our design also presents an excellent compromise performance and generalization
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9340
1557-9956
DOI:10.1109/TC.2021.3110413