NN2FPGA: Optimizing CNN Inference on FPGAs With Binary Integer Programming

Skip connections have emerged as a key component of modern convolutional neural networks (CNNs) for computer vision tasks, allowing for the creation of more accurate and deeper models by addressing the vanishing gradient problem. However, the existing implementations of field-programmable gate array...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on computer-aided design of integrated circuits and systems Ročník 44; číslo 5; s. 1807 - 1818
Hlavní autoři: Bosio, Roberto, Minnella, Filippo, Urso, Teodoro, Casu, Mario R., Lavagno, Luciano, Lazarescu, Mihai T., Pasini, Paolo
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.05.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:0278-0070, 1937-4151
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Skip connections have emerged as a key component of modern convolutional neural networks (CNNs) for computer vision tasks, allowing for the creation of more accurate and deeper models by addressing the vanishing gradient problem. However, the existing implementations of field-programmable gate array (FPGA)-based accelerators for ResNets and MobileNetV2 often experience decreased performance and increased computational latency due to the implementation of skip blocks. This article presents a novel framework for developing deep learning models on FPGAs that focuses on skip connections, with a unique approach to reduce buffering overhead. This results in a more efficient utilization of resources in the implementation of the skip layer. The nn2fpga compiler follows a thorough set of high-level synthesis (HLS) design principles and optimization strategies, exploiting in novel ways standard techniques to effectively map skip connection-based networks into static dataflow accelerators. To maximize throughput and efficiently use the available resources, our compiler employs a fast and effective design space exploration method based on a binary integer programming model which accurately assigns FPGA resources to the network layers, to maximize global throughput under resource constraints and then minimize resources for the achieved maximum throughput. Experimental results on the CIFAR-10 and ImageNet datasets demonstrate substantial gains in throughput (<inline-formula> <tex-math notation="LaTeX">\mathbf {3\times } </tex-math></inline-formula>-<inline-formula> <tex-math notation="LaTeX">\mathbf {7\times } </tex-math></inline-formula> on the past HLS-based work) for ResNet8, ResNet20, and MobileNetV2 models deployed on various Xilinx FPGA boards. Notably, MobileNetV2 deployed on the ZCU102 achieves a throughput of 2115 frame per second, representing even a 10% speedup over a state-of-the-art highly optimized manual register-transfer level implementation, showing that HLS can actually improve over manual design, thanks to the faster exploration of the design space.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2024.3507570