Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have al...
Gespeichert in:
| Veröffentlicht in: | International Conference on Field-programmable Logic and Applications S. 1 - 8 |
|---|---|
| Hauptverfasser: | , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
EPFL
01.08.2016
|
| Schlagworte: | |
| ISSN: | 1946-1488 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. |
|---|---|
| AbstractList | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. |
| Author | Suda, Naveen Vrudhula, Sarma Yufei Ma Yu Cao Jae-sun Seo |
| Author_xml | – sequence: 1 surname: Yufei Ma fullname: Yufei Ma email: yufeima@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 2 givenname: Naveen surname: Suda fullname: Suda, Naveen email: nsuda@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 3 surname: Yu Cao fullname: Yu Cao email: yu.cao@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 4 surname: Jae-sun Seo fullname: Jae-sun Seo email: jaesun.seo@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 5 givenname: Sarma surname: Vrudhula fullname: Vrudhula, Sarma email: vrudhula@asu.edu organization: Sch. of Comput., Inf., Decision Syst. Eng., Arizona State Univ., Tempe, AZ, USA |
| BookMark | eNotkLtOwzAYRg0CiVKyI7H4BRLs2PFlrCJSkCKoaJkr2_ktBZy4ygUET8-tZzk6yzd8l-isjz0gdE1JRinRt9WmznJCRSYLKVkhTlCipcoV05oqzvNTtKCai5RypS5QMo6v5IeCS1WIBdpunQnGBsCmb3AXmzmYof2CBj_vauxid2iDmdrY4-hxGfv3GObfNAE_wjz8afqIw9uIYz9FXG3Wqyt07k0YITl6iV6qu115n9ZP64dyVadtzumUeicM5OAEWK-5JUILa3xBtfVcKPDWWWKASQXcNYQJxzwHlVNDiPdSM7ZEN_-7LQDsD0PbmeFzf7yBfQNGN1O0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/FPL.2016.7577356 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9782839918442 2839918447 |
| EISSN | 1946-1488 |
| EndPage | 8 |
| ExternalDocumentID | 7577356 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL |
| ID | FETCH-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 99 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 01:40:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_7577356 |
| PublicationCentury | 2000 |
| PublicationDate | 2016-08 |
| PublicationDateYYYYMMDD | 2016-08-01 |
| PublicationDate_xml | – month: 08 year: 2016 text: 2016-08 |
| PublicationDecade | 2010 |
| PublicationTitle | International Conference on Field-programmable Logic and Applications |
| PublicationTitleAbbrev | FPL |
| PublicationYear | 2016 |
| Publisher | EPFL |
| Publisher_xml | – name: EPFL |
| SSID | ssj0000547856 |
| Score | 2.0658615 |
| Snippet | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Acceleration Algorithm design and analysis Convolution Convolutional neural networks Field programmable gate arrays FPGA Hardware hardware acceleration Kernel Memory management |
| Title | Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA |
| URI | https://ieeexplore.ieee.org/document/7577356 |
| WOSCitedRecordID | wos000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEA7r4sGTyq74JgePdrfdtnkcRawelqW4K-xtyWMCBW1lXwd_vZm2VAQvnhICSWAy4RuSb74h5M4KD7ITZIl7Fw4SzU3gYUgFTsnEOWlspGxdbILPZmK5lHmP3He5MABQk89ghN36L99WZodPZWOech6n7IAccM6bXK3uPSVEYaq0-4kM5TjLp0jdYqN22q_6KTV8ZMf_2_iEDH_y8GjeIcwp6UE5IPO5NysmPFFVWvpRWSSSFl9g6etiSpEiXjT8Nlo56hfat86l3ikqcdRNTf3eUNQuoFn-_DAkb9nT4vElaEsjBIWH3G3gDFMwAcNAO5nokEmmlfPRmHYJE-C00aGCmAtIjPUgZWKXgJhEKgyd81cyPiP9sirhnFAMmxzzKI5C-EKA1CxmqYJIm8jE0lyQARpk9dmoX6xaW1z-PXxFjtDmDUXumvS36x3ckEOz3xab9W19ZN_AIZmE |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8QwEB7WVdCTyq74NgePdm23bdocRawr1qW4K-xtyWMCBW1lXwd_vUlbKoIXTwmBJDCZ8A3JN98AXKvYgOzQssSNCzuBiKRjYIg7mrNAayaVx1VVbCIaj-PZjGUduGlzYRCxIp_hwHarv3xVyrV9KruNwijyQ7oF26HZwKuztdoXFddKU4XtX6TLbpMsteQtOmgm_qqgUgFIsv-_rQ-g_5OJR7IWYw6hg0UPJhNjWJvyRHihyEepLJU0_0JFXqcpsSTxvGa4kVITs9CmcS_-TqwWR9VU5O8lseoFJMke7_rwljxM70dOUxzByQ3orhwtKcchSopCs0C4lFHBtYnHhA5ojFpI4XL0oxgDqQxMSV8HGA897rpam0vpH0G3KAs8BmIDJ00Njlsp_DhGJqhPQ46ekJ70mTyBnjXI_LPWv5g3tjj9e_gKdkfTl3SePo2fz2DP2r8mzJ1Dd7VY4wXsyM0qXy4uq-P7BurunMs |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Scalable+and+modularized+RTL+compilation+of+Convolutional+Neural+Networks+onto+FPGA&rft.au=Yufei+Ma&rft.au=Suda%2C+Naveen&rft.au=Yu+Cao&rft.au=Jae-sun+Seo&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FFPL.2016.7577356&rft.externalDocID=7577356 |