Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have al...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International Conference on Field-programmable Logic and Applications S. 1 - 8
Hauptverfasser: Yufei Ma, Suda, Naveen, Yu Cao, Jae-sun Seo, Vrudhula, Sarma
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: EPFL 01.08.2016
Schlagworte:
ISSN:1946-1488
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
AbstractList Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Author Suda, Naveen
Vrudhula, Sarma
Yufei Ma
Yu Cao
Jae-sun Seo
Author_xml – sequence: 1
  surname: Yufei Ma
  fullname: Yufei Ma
  email: yufeima@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 2
  givenname: Naveen
  surname: Suda
  fullname: Suda, Naveen
  email: nsuda@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 3
  surname: Yu Cao
  fullname: Yu Cao
  email: yu.cao@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 4
  surname: Jae-sun Seo
  fullname: Jae-sun Seo
  email: jaesun.seo@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 5
  givenname: Sarma
  surname: Vrudhula
  fullname: Vrudhula, Sarma
  email: vrudhula@asu.edu
  organization: Sch. of Comput., Inf., Decision Syst. Eng., Arizona State Univ., Tempe, AZ, USA
BookMark eNotkLtOwzAYRg0CiVKyI7H4BRLs2PFlrCJSkCKoaJkr2_ktBZy4ygUET8-tZzk6yzd8l-isjz0gdE1JRinRt9WmznJCRSYLKVkhTlCipcoV05oqzvNTtKCai5RypS5QMo6v5IeCS1WIBdpunQnGBsCmb3AXmzmYof2CBj_vauxid2iDmdrY4-hxGfv3GObfNAE_wjz8afqIw9uIYz9FXG3Wqyt07k0YITl6iV6qu115n9ZP64dyVadtzumUeicM5OAEWK-5JUILa3xBtfVcKPDWWWKASQXcNYQJxzwHlVNDiPdSM7ZEN_-7LQDsD0PbmeFzf7yBfQNGN1O0
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FPL.2016.7577356
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9782839918442
2839918447
EISSN 1946-1488
EndPage 8
ExternalDocumentID 7577356
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
IEDL.DBID RIE
ISICitedReferencesCount 99
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:40:15 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
PageCount 8
ParticipantIDs ieee_primary_7577356
PublicationCentury 2000
PublicationDate 2016-08
PublicationDateYYYYMMDD 2016-08-01
PublicationDate_xml – month: 08
  year: 2016
  text: 2016-08
PublicationDecade 2010
PublicationTitle International Conference on Field-programmable Logic and Applications
PublicationTitleAbbrev FPL
PublicationYear 2016
Publisher EPFL
Publisher_xml – name: EPFL
SSID ssj0000547856
Score 2.0658615
Snippet Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Acceleration
Algorithm design and analysis
Convolution
Convolutional neural networks
Field programmable gate arrays
FPGA
Hardware
hardware acceleration
Kernel
Memory management
Title Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
URI https://ieeexplore.ieee.org/document/7577356
WOSCitedRecordID wos000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA7r4sGTyq74JgePdrfdtnkcRawelqW4K-xtyWMCBW1lXwd_vZm2VAQvnhICSWAy4RuSb74h5M4KD7ITZIl7Fw4SzU3gYUgFTsnEOWlspGxdbILPZmK5lHmP3He5MABQk89ghN36L99WZodPZWOech6n7IAccM6bXK3uPSVEYaq0-4kM5TjLp0jdYqN22q_6KTV8ZMf_2_iEDH_y8GjeIcwp6UE5IPO5NysmPFFVWvpRWSSSFl9g6etiSpEiXjT8Nlo56hfat86l3ikqcdRNTf3eUNQuoFn-_DAkb9nT4vElaEsjBIWH3G3gDFMwAcNAO5nokEmmlfPRmHYJE-C00aGCmAtIjPUgZWKXgJhEKgyd81cyPiP9sirhnFAMmxzzKI5C-EKA1CxmqYJIm8jE0lyQARpk9dmoX6xaW1z-PXxFjtDmDUXumvS36x3ckEOz3xab9W19ZN_AIZmE
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEB7WVdCTyq74NgePdm23bdocRawr1qW4K-xtyWMCBW1lXwd_vUlbKoIXTwmBJDCZ8A3JN98AXKvYgOzQssSNCzuBiKRjYIg7mrNAayaVx1VVbCIaj-PZjGUduGlzYRCxIp_hwHarv3xVyrV9KruNwijyQ7oF26HZwKuztdoXFddKU4XtX6TLbpMsteQtOmgm_qqgUgFIsv-_rQ-g_5OJR7IWYw6hg0UPJhNjWJvyRHihyEepLJU0_0JFXqcpsSTxvGa4kVITs9CmcS_-TqwWR9VU5O8lseoFJMke7_rwljxM70dOUxzByQ3orhwtKcchSopCs0C4lFHBtYnHhA5ojFpI4XL0oxgDqQxMSV8HGA897rpam0vpH0G3KAs8BmIDJ00Njlsp_DhGJqhPQ46ekJ70mTyBnjXI_LPWv5g3tjj9e_gKdkfTl3SePo2fz2DP2r8mzJ1Dd7VY4wXsyM0qXy4uq-P7BurunMs
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Scalable+and+modularized+RTL+compilation+of+Convolutional+Neural+Networks+onto+FPGA&rft.au=Yufei+Ma&rft.au=Suda%2C+Naveen&rft.au=Yu+Cao&rft.au=Jae-sun+Seo&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FFPL.2016.7577356&rft.externalDocID=7577356