Research on Efficient CNN Acceleration Through Mixed Precision Quantization: A Comprehensive Methodology

To overcome challenges associated with deploying Convolutional Neural Networks (CNNs) on edge computing devices with limited memory and computing resources, we propose a mixed-precision CNN calculation method on a Field Programmable Gate Array (FPGA). This approach involves a collaborative design en...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of advanced computer science & applications Vol. 14; no. 12
Main Authors:	He, Yizhi, Liu, Wenlong, Tahir, Muhammad, Li, Zhao, Zhang, Shaoshuang, Amur, Hussain Bux
Format:	Journal Article
Language:	English
Published:	West Yorkshire Science and Information (SAI) Organization Limited 2023
Subjects:	Acceleration Accuracy Artificial neural networks Central processing units CPUs Design techniques Edge computing Field programmable gate arrays Floating point arithmetic Inference Memory devices Search algorithms
ISSN:	2158-107X, 2156-5570
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	To overcome challenges associated with deploying Convolutional Neural Networks (CNNs) on edge computing devices with limited memory and computing resources, we propose a mixed-precision CNN calculation method on a Field Programmable Gate Array (FPGA). This approach involves a collaborative design encompassing both software and hardware aspects. Initially, we devised a CNN quantization method tailored for the fixed-point operation characteristics of FPGA, addressing the computational challenges posed by floating-point parameters. We introduce a bit-width strategy search algorithm that assigns bit-widths to each layer based on CNN loss variation induced by quantization. Through retraining, this strategy mitigates the degradation in CNN inference accuracy. For FPGA acceleration design, we employ a flow processing architecture with multiple Processing Elements (PEs) to support mixed-precision CNNs. Our approach incorporates a folding design method to implement shared PEs between layers, significantly reducing FPGA resource usage. Furthermore, we designed a data reading method, incorporating a register set buffer between memory and processing elements to alleviate issues related to mismatched data reading and computing speeds. Our implementation of the mixed-precision ResNet20 model on the Kintex-7 Eco R2 development board achieves an inference accuracy of 91.68% and a computing speed 4.27 times faster than the Central Processing Unit (CPU) on the CIFAR-10 dataset, with an accuracy drop of only 1.21%. Compared to a unified 16-bit FPGA accelerator design method, our proposed approach demonstrates an 89-fold increase in computing speed while maintaining similar accuracy.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2158-107X 2156-5570
DOI:	10.14569/IJACSA.2023.0141282