Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have al...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International Conference on Field-programmable Logic and Applications s. 1 - 8
Hlavní autoři: Yufei Ma, Suda, Naveen, Yu Cao, Jae-sun Seo, Vrudhula, Sarma
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: EPFL 01.08.2016
Témata:
ISSN:1946-1488
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
AbstractList Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
Author Suda, Naveen
Vrudhula, Sarma
Yufei Ma
Yu Cao
Jae-sun Seo
Author_xml – sequence: 1
  surname: Yufei Ma
  fullname: Yufei Ma
  email: yufeima@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 2
  givenname: Naveen
  surname: Suda
  fullname: Suda, Naveen
  email: nsuda@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 3
  surname: Yu Cao
  fullname: Yu Cao
  email: yu.cao@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 4
  surname: Jae-sun Seo
  fullname: Jae-sun Seo
  email: jaesun.seo@asu.edu
  organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA
– sequence: 5
  givenname: Sarma
  surname: Vrudhula
  fullname: Vrudhula, Sarma
  email: vrudhula@asu.edu
  organization: Sch. of Comput., Inf., Decision Syst. Eng., Arizona State Univ., Tempe, AZ, USA
BookMark eNotkLtOwzAYRg0CiVKyI7H4BRLs2PFlrCJSkCKoaJkr2_ktBZy4ygUET8-tZzk6yzd8l-isjz0gdE1JRinRt9WmznJCRSYLKVkhTlCipcoV05oqzvNTtKCai5RypS5QMo6v5IeCS1WIBdpunQnGBsCmb3AXmzmYof2CBj_vauxid2iDmdrY4-hxGfv3GObfNAE_wjz8afqIw9uIYz9FXG3Wqyt07k0YITl6iV6qu115n9ZP64dyVadtzumUeicM5OAEWK-5JUILa3xBtfVcKPDWWWKASQXcNYQJxzwHlVNDiPdSM7ZEN_-7LQDsD0PbmeFzf7yBfQNGN1O0
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FPL.2016.7577356
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore Digital Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9782839918442
2839918447
EISSN 1946-1488
EndPage 8
ExternalDocumentID 7577356
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
IEDL.DBID RIE
ISICitedReferencesCount 100
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 01:40:15 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933
PageCount 8
ParticipantIDs ieee_primary_7577356
PublicationCentury 2000
PublicationDate 2016-08
PublicationDateYYYYMMDD 2016-08-01
PublicationDate_xml – month: 08
  year: 2016
  text: 2016-08
PublicationDecade 2010
PublicationTitle International Conference on Field-programmable Logic and Applications
PublicationTitleAbbrev FPL
PublicationYear 2016
Publisher EPFL
Publisher_xml – name: EPFL
SSID ssj0000547856
Score 2.0651534
Snippet Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Acceleration
Algorithm design and analysis
Convolution
Convolutional neural networks
Field programmable gate arrays
FPGA
Hardware
hardware acceleration
Kernel
Memory management
Title Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
URI https://ieeexplore.ieee.org/document/7577356
WOSCitedRecordID wos000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pa8MgFJa27LDTNtqx33jYcbamMZocx1i2Qylh7UZvRc0TAlsy2rSH_fVTEzIGu-ykKCp8T3kP_d4nQrfChtBhLhOiA8MI06EgUgWSgOaB1oZq8HJNbzMxn8erVZL10F2XCwMAnnwGY1f1b_l5pXfuqmwiIiHCiPdRXwjR5Gp19ynUCVNF3UskTSZpNnPULT5uh_36P8W7j_Tofwsfo9FPHh7OOg9zgnpQDtFiYWF1CU9Yljn-qHJHJC2-IMcvyxl2FPGi4bfhymA70b7dXPIdOyUOX3jq9xY77QKcZk_3I_SaPi4fnkn7NQIprMutidFcwtQCCsokTFGecCWNjcaUYTwGo7SiEkIRA9O5dVI6NAziaSApNcYeyfAUDcqqhDOEQ9tl47rI8EiyKTOSKxVY01Ke22AukOdo6ABZfzbqF-sWi4u_my_RocO8ochdoUG92cE1OtD7uthubrzJvgHwwppR
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zCnpS2cTf5uDRbGmbpu1RxDqxjuKm7DaS9AUK2sp-HfzrTdpSEbx4SkhIAt9LeI_ke18Qug5MCO1lIiLK0Yww5QVESEcQUNxRSlMFlVzTWxKMx-FsFqUddNPmwgBART6Dga1Wb_lZqdb2qmwY-EHg-XwLbfuMuU6drdXeqFArTeW3b5E0GsZpYslbfNAM_PWDSuVA4v3_LX2A-j-ZeDhtfcwh6kDRQ5OJAdamPGFRZPijzCyVNP-CDL9ME2xJ4nnNcMOlxmaiTbO9xDu2WhxVUZG_l9iqF-A4fbjto9f4fno3Is3nCCQ3TndFtOICXAMpSB0xSXnEpdAmHpOa8RC0VJIK8IIQmMqMm1KeZhC6jqBUa3MovSPULcoCjhH2TJeJ7HzNfcFcpgWX0jHGpTwz4ZwjTlDPAjL_rPUv5g0Wp383X6Hd0fQ5mSeP46cztGfxrwlz56i7WqzhAu2ozSpfLi4r830DJjydmA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Scalable+and+modularized+RTL+compilation+of+Convolutional+Neural+Networks+onto+FPGA&rft.au=Yufei+Ma&rft.au=Suda%2C+Naveen&rft.au=Yu+Cao&rft.au=Jae-sun+Seo&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FFPL.2016.7577356&rft.externalDocID=7577356