Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have al...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International Conference on Field-programmable Logic and Applications S. 1 - 8
Hauptverfasser:	Yufei Ma, Suda, Naveen, Yu Cao, Jae-sun Seo, Vrudhula, Sarma
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	EPFL 01.08.2016
Schlagworte:	Acceleration Algorithm design and analysis Convolution Convolutional neural networks Field programmable gate arrays FPGA Hardware hardware acceleration Kernel Memory management
ISSN:	1946-1488
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Abstract	Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
AbstractList	Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Author	Suda, Naveen Vrudhula, Sarma Yufei Ma Yu Cao Jae-sun Seo
Author_xml	– sequence: 1 surname: Yufei Ma fullname: Yufei Ma email: yufeima@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 2 givenname: Naveen surname: Suda fullname: Suda, Naveen email: nsuda@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 3 surname: Yu Cao fullname: Yu Cao email: yu.cao@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 4 surname: Jae-sun Seo fullname: Jae-sun Seo email: jaesun.seo@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 5 givenname: Sarma surname: Vrudhula fullname: Vrudhula, Sarma email: vrudhula@asu.edu organization: Sch. of Comput., Inf., Decision Syst. Eng., Arizona State Univ., Tempe, AZ, USA
BookMark	eNotkLtOwzAYRg0CiVKyI7H4BRLs2PFlrCJSkCKoaJkr2_ktBZy4ygUET8-tZzk6yzd8l-isjz0gdE1JRinRt9WmznJCRSYLKVkhTlCipcoV05oqzvNTtKCai5RypS5QMo6v5IeCS1WIBdpunQnGBsCmb3AXmzmYof2CBj_vauxid2iDmdrY4-hxGfv3GObfNAE_wjz8afqIw9uIYz9FXG3Wqyt07k0YITl6iV6qu115n9ZP64dyVadtzumUeicM5OAEWK-5JUILa3xBtfVcKPDWWWKASQXcNYQJxzwHlVNDiPdSM7ZEN_-7LQDsD0PbmeFzf7yBfQNGN1O0
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/FPL.2016.7577356
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9782839918442 2839918447
EISSN	1946-1488
EndPage	8
ExternalDocumentID	7577356
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
IEDL.DBID	RIE
ISICitedReferencesCount	99
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 01:40:15 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
PageCount	8
ParticipantIDs	ieee_primary_7577356
PublicationCentury	2000
PublicationDate	2016-08
PublicationDateYYYYMMDD	2016-08-01
PublicationDate_xml	– month: 08 year: 2016 text: 2016-08
PublicationDecade	2010
PublicationTitle	International Conference on Field-programmable Logic and Applications
PublicationTitleAbbrev	FPL
PublicationYear	2016
Publisher	EPFL
Publisher_xml	– name: EPFL
SSID	ssj0000547856
Score	2.0658615
Snippet	Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Acceleration Algorithm design and analysis Convolution Convolutional neural networks Field programmable gate arrays FPGA Hardware hardware acceleration Kernel Memory management
Title	Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
URI	https://ieeexplore.ieee.org/document/7577356
WOSCitedRecordID	wos000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA7r4sGTyq74JgePdrfdtnkcRawelqW4K-xtyWMCBW1lXwd_vZm2VAQvnhICSWAy4RuSb74h5M4KD7ITZIl7Fw4SzU3gYUgFTsnEOWlspGxdbILPZmK5lHmP3He5MABQk89ghN36L99WZodPZWOech6n7IAccM6bXK3uPSVEYaq0-4kM5TjLp0jdYqN22q_6KTV8ZMf_2_iEDH_y8GjeIcwp6UE5IPO5NysmPFFVWvpRWSSSFl9g6etiSpEiXjT8Nlo56hfat86l3ikqcdRNTf3eUNQuoFn-_DAkb9nT4vElaEsjBIWH3G3gDFMwAcNAO5nokEmmlfPRmHYJE-C00aGCmAtIjPUgZWKXgJhEKgyd81cyPiP9sirhnFAMmxzzKI5C-EKA1CxmqYJIm8jE0lyQARpk9dmoX6xaW1z-PXxFjtDmDUXumvS36x3ckEOz3xab9W19ZN_AIZmE
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEB7WVdCTyq74NgePdm23bdocRawr1qW4K-xtyWMCBW1lXwd_vUlbKoIXTwmBJDCZ8A3JN98AXKvYgOzQssSNCzuBiKRjYIg7mrNAayaVx1VVbCIaj-PZjGUduGlzYRCxIp_hwHarv3xVyrV9KruNwijyQ7oF26HZwKuztdoXFddKU4XtX6TLbpMsteQtOmgm_qqgUgFIsv-_rQ-g_5OJR7IWYw6hg0UPJhNjWJvyRHihyEepLJU0_0JFXqcpsSTxvGa4kVITs9CmcS_-TqwWR9VU5O8lseoFJMke7_rwljxM70dOUxzByQ3orhwtKcchSopCs0C4lFHBtYnHhA5ojFpI4XL0oxgDqQxMSV8HGA897rpam0vpH0G3KAs8BmIDJ00Njlsp_DhGJqhPQ46ekJ70mTyBnjXI_LPWv5g3tjj9e_gKdkfTl3SePo2fz2DP2r8mzJ1Dd7VY4wXsyM0qXy4uq-P7BurunMs
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Scalable+and+modularized+RTL+compilation+of+Convolutional+Neural+Networks+onto+FPGA&rft.au=Yufei+Ma&rft.au=Suda%2C+Naveen&rft.au=Yu+Cao&rft.au=Jae-sun+Seo&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FFPL.2016.7577356&rft.externalDocID=7577356