Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have al...
Uloženo v:
| Vydáno v: | International Conference on Field-programmable Logic and Applications s. 1 - 8 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
EPFL
01.08.2016
|
| Témata: | |
| ISSN: | 1946-1488 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. |
|---|---|
| AbstractList | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning. |
| Author | Suda, Naveen Vrudhula, Sarma Yufei Ma Yu Cao Jae-sun Seo |
| Author_xml | – sequence: 1 surname: Yufei Ma fullname: Yufei Ma email: yufeima@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 2 givenname: Naveen surname: Suda fullname: Suda, Naveen email: nsuda@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 3 surname: Yu Cao fullname: Yu Cao email: yu.cao@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 4 surname: Jae-sun Seo fullname: Jae-sun Seo email: jaesun.seo@asu.edu organization: Sch. of Electr., Comput. & Energy Eng., Arizona State Univ., Tempe, AZ, USA – sequence: 5 givenname: Sarma surname: Vrudhula fullname: Vrudhula, Sarma email: vrudhula@asu.edu organization: Sch. of Comput., Inf., Decision Syst. Eng., Arizona State Univ., Tempe, AZ, USA |
| BookMark | eNotkLtOwzAYRg0CiVKyI7H4BRLs2PFlrCJSkCKoaJkr2_ktBZy4ygUET8-tZzk6yzd8l-isjz0gdE1JRinRt9WmznJCRSYLKVkhTlCipcoV05oqzvNTtKCai5RypS5QMo6v5IeCS1WIBdpunQnGBsCmb3AXmzmYof2CBj_vauxid2iDmdrY4-hxGfv3GObfNAE_wjz8afqIw9uIYz9FXG3Wqyt07k0YITl6iV6qu115n9ZP64dyVadtzumUeicM5OAEWK-5JUILa3xBtfVcKPDWWWKASQXcNYQJxzwHlVNDiPdSM7ZEN_-7LQDsD0PbmeFzf7yBfQNGN1O0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/FPL.2016.7577356 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore Digital Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9782839918442 2839918447 |
| EISSN | 1946-1488 |
| EndPage | 8 |
| ExternalDocumentID | 7577356 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL |
| ID | FETCH-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 100 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 01:40:15 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i241t-fc6ae2ec6ebf94b0696baf519bf468efbcb0ae378e4cd036c3f4e821a00ff7933 |
| PageCount | 8 |
| ParticipantIDs | ieee_primary_7577356 |
| PublicationCentury | 2000 |
| PublicationDate | 2016-08 |
| PublicationDateYYYYMMDD | 2016-08-01 |
| PublicationDate_xml | – month: 08 year: 2016 text: 2016-08 |
| PublicationDecade | 2010 |
| PublicationTitle | International Conference on Field-programmable Logic and Applications |
| PublicationTitleAbbrev | FPL |
| PublicationYear | 2016 |
| Publisher | EPFL |
| Publisher_xml | – name: EPFL |
| SSID | ssj0000547856 |
| Score | 2.0651534 |
| Snippet | Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Acceleration Algorithm design and analysis Convolution Convolutional neural networks Field programmable gate arrays FPGA Hardware hardware acceleration Kernel Memory management |
| Title | Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA |
| URI | https://ieeexplore.ieee.org/document/7577356 |
| WOSCitedRecordID | wos000386610400058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pa8MgFJa27LDTNtqx33jYcbamMZocx1i2Qylh7UZvRc0TAlsy2rSH_fVTEzIGu-ykKCp8T3kP_d4nQrfChtBhLhOiA8MI06EgUgWSgOaB1oZq8HJNbzMxn8erVZL10F2XCwMAnnwGY1f1b_l5pXfuqmwiIiHCiPdRXwjR5Gp19ynUCVNF3UskTSZpNnPULT5uh_36P8W7j_Tofwsfo9FPHh7OOg9zgnpQDtFiYWF1CU9Yljn-qHJHJC2-IMcvyxl2FPGi4bfhymA70b7dXPIdOyUOX3jq9xY77QKcZk_3I_SaPi4fnkn7NQIprMutidFcwtQCCsokTFGecCWNjcaUYTwGo7SiEkIRA9O5dVI6NAziaSApNcYeyfAUDcqqhDOEQ9tl47rI8EiyKTOSKxVY01Ke22AukOdo6ABZfzbqF-sWi4u_my_RocO8ochdoUG92cE1OtD7uthubrzJvgHwwppR |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA5zCnpS2cTf5uDRbGmbpu1RxDqxjuKm7DaS9AUK2sp-HfzrTdpSEbx4SkhIAt9LeI_ke18Qug5MCO1lIiLK0Yww5QVESEcQUNxRSlMFlVzTWxKMx-FsFqUddNPmwgBART6Dga1Wb_lZqdb2qmwY-EHg-XwLbfuMuU6drdXeqFArTeW3b5E0GsZpYslbfNAM_PWDSuVA4v3_LX2A-j-ZeDhtfcwh6kDRQ5OJAdamPGFRZPijzCyVNP-CDL9ME2xJ4nnNcMOlxmaiTbO9xDu2WhxVUZG_l9iqF-A4fbjto9f4fno3Is3nCCQ3TndFtOICXAMpSB0xSXnEpdAmHpOa8RC0VJIK8IIQmMqMm1KeZhC6jqBUa3MovSPULcoCjhH2TJeJ7HzNfcFcpgWX0jHGpTwz4ZwjTlDPAjL_rPUv5g0Wp383X6Hd0fQ5mSeP46cztGfxrwlz56i7WqzhAu2ozSpfLi4r830DJjydmA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Scalable+and+modularized+RTL+compilation+of+Convolutional+Neural+Networks+onto+FPGA&rft.au=Yufei+Ma&rft.au=Suda%2C+Naveen&rft.au=Yu+Cao&rft.au=Jae-sun+Seo&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FFPL.2016.7577356&rft.externalDocID=7577356 |