Scale-out acceleration for machine learning
The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms an...
Gespeichert in:
| Veröffentlicht in: | MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA S. 367 - 381 |
|---|---|
| Hauptverfasser: | , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
New York, NY, USA
ACM
14.10.2017
|
| Schriftenreihe: | ACM Conferences |
| Schlagworte: | |
| ISBN: | 1450349528, 9781450349529 |
| ISSN: | 2379-3155 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning. |
|---|---|
| AbstractList | The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning. |
| Author | Mahajan, Divya Esmaeilzadeh, Hadi Sharma, Hardik Kim, Joon Kyung Park, Jongse Olds, Preston |
| Author_xml | – sequence: 1 givenname: Jongse surname: Park fullname: Park, Jongse email: jspark@gatech.edu organization: Georgia Institute of Technology – sequence: 2 givenname: Hardik surname: Sharma fullname: Sharma, Hardik email: hsharma@gatech.edu organization: Georgia Institute of Technology – sequence: 3 givenname: Divya surname: Mahajan fullname: Mahajan, Divya email: divya_mahajan@gatech.edu organization: Georgia Institute of Technology – sequence: 4 givenname: Joon Kyung surname: Kim fullname: Kim, Joon Kyung email: jkkim@gatech.edu organization: Georgia Institute of Technology – sequence: 5 givenname: Preston surname: Olds fullname: Olds, Preston email: prestonolds@gatech.edu organization: Georgia Institute of Technology – sequence: 6 givenname: Hadi surname: Esmaeilzadeh fullname: Esmaeilzadeh, Hadi email: hadi@eng.ucsd.edu organization: Georgia Institute of Technology and University of California |
| BookMark | eNqNkDlPxDAQhc0lsbtsTUGTEgkl-IivEq24pJUogNqaOGMwJDFKQsG_J7ApKKme3nxPU3xLctilDgk5ZbRgrJSXgnFhhS1-U9s9spyuVJRWcrNPFlxomwsm5cFfcEzWw_BGKeVMW8XEglw8emgwT59jBt5jgz2MMXVZSH3Wgn-NHWYNQt_F7uWEHAVoBlzPuSLPN9dPm7t8-3B7v7na5sBLPeY11Lri1houSw-UGx1CpWpjQFHKQFkNYWpUKWRSK6k0FbXCqmKSg6FBrMjZ7m9ERPfRxxb6L2eUUcqIiZ7vKPjWVSm9D45R9-PEzU7c7GSaFv-cuqqPGMQ3LPpdQg |
| ContentType | Conference Proceeding |
| Copyright | 2017 ACM |
| Copyright_xml | – notice: 2017 ACM |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1145/3123939.3123979 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 1450349528 9781450349529 |
| EISSN | 2379-3155 |
| EndPage | 381 |
| ExternalDocumentID | 8686683 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IL 6IN AAJGR ABLEC ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI IEGSK OCL RIB RIC RIE RIL AAWTH LHSKQ |
| ID | FETCH-LOGICAL-a247t-dad7b2998254ca0287ffb6d88a6001a697afd88066e157656703d6ebb152a80f3 |
| IEDL.DBID | RIE |
| ISBN | 1450349528 9781450349529 |
| ISICitedReferencesCount | 23 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000455679300028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:38:42 EDT 2025 Wed Jan 31 06:40:42 EST 2024 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Keywords | cloud accelerator distributed scale-out machine learning |
| Language | English |
| License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org |
| LinkModel | DirectLink |
| MeetingName | MICRO-50: The 50th Annual IEEE/ACM International Symposium on Microarchitecture |
| MergedId | FETCHMERGED-LOGICAL-a247t-dad7b2998254ca0287ffb6d88a6001a697afd88066e157656703d6ebb152a80f3 |
| PageCount | 15 |
| ParticipantIDs | ieee_primary_8686683 acm_books_10_1145_3123939_3123979_brief acm_books_10_1145_3123939_3123979 |
| PublicationCentury | 2000 |
| PublicationDate | 20171014 2017-Oct. |
| PublicationDateYYYYMMDD | 2017-10-14 2017-10-01 |
| PublicationDate_xml | – month: 10 year: 2017 text: 20171014 day: 14 |
| PublicationDecade | 2010 |
| PublicationPlace | New York, NY, USA |
| PublicationPlace_xml | – name: New York, NY, USA |
| PublicationSeriesTitle | ACM Conferences |
| PublicationTitle | MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA |
| PublicationTitleAbbrev | MICRO |
| PublicationYear | 2017 |
| Publisher | ACM |
| Publisher_xml | – name: ACM |
| SSID | ssj0002179613 ssib030238632 ssib042476800 ssib023363937 |
| Score | 2.252851 |
| Snippet | The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint... |
| SourceID | ieee acm |
| SourceType | Publisher |
| StartPage | 367 |
| SubjectTerms | Acceleration Accelerator cloud Computer systems organization -- Architectures -- Distributed architectures -- Cloud computing Computing methodologies -- Machine learning distributed Field programmable gate arrays Hardware Hardware -- Integrated circuits -- Reconfigurable logic and FPGAs -- Hardware accelerators Machine learning Machine learning algorithms scale-out Software algorithms System software |
| Title | Scale-out acceleration for machine learning |
| URI | https://ieeexplore.ieee.org/document/8686683 |
| WOSCitedRecordID | wos000455679300028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB7a4sGTj1asLyIIXkybZJvdzVHE4kFrDwq9hX1KD7bSpv5-ZzfbVkEQT3kQQjKZyfdNMvMNwJUpENSF4jFTgsQDm5nY6dzEOlFSIQJRKf2TfmSjEZ9MinEDbja9MMYYX3xmem7V_8vXc7Vyn8r6nHJKOWlCkzFa92qtfScjhJJvUOtm4XC67ZkcZAMk1kGKzr2lkYoXCGVB7Scd5H2SOjmwoueXrrSrKdT7j6ErHnOGe_-72n3obJv3ovEGlg6gYWaHsLee3hCFYG4DhjWCQ_y8qqJbpRB9al-IkMVGT77E0kRBffWtA6_D-5e7hziMTogF3mgVa6GZRKRx-Z8SyCGYtZJqzoUjOIIWTFjcQr5hUsw4coqBr6mREuFc8MSSI2jN5jNzDJFQiSJMG4NnQ8QTXBUisVqkuZS24KILl2in0uUEy7Juc87LYMsy2LIL138eU8rF1NgutJ0ly49aa6MMRjz5ffcp7GYOZn1x3Rm0qsXKnMOO-qymy8WFd5Avjt2xTw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0AmugJFYz4WRMTLy6UbrvdHo2RYATkgAm3Zne7azgIBoq_39mygCYmxlM_0jTtdKbvTTvzBuBGJwjqQnESK0FJaAJNrM4NyXwlFSIQk7J40r14MODjcTIswd2mF0ZrXRSf6aZdLf7lZzO1tJ_KWpxxxjgtw46dnOW6tdbeE1DK6DewtdNwONt2TYZBiNTaidHZ9zSS8QTBzOn9tMOoRdtWECxpFktb3FUW6v3H2JUCdTrV_13vAdS37XvecANMh1DS0yOoruc3eC6ca4CBjfBAXpa5d68U4s_KGzzksV6_KLLUntNffavDa-dx9NAlbngCEXijOclEFkvEGpsBKoEsIjZGsoxzYSmOYEksDG4h49BtzDkihqGfMS0lArrgvqHHUJnOpvoEPKF8ReNMazwbml1wlQjfZKIdSWkSLhpwjXZKbVawSFeNzlHqbJk6Wzbg9s9jUjmfaNOAmrVk-rFS20idEU9_330Fe91Rv5f2ngbPZ7AfWNAtSu3OoZLPl_oCdtVnPlnMLwtn-QIEX7SY |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=MICRO-50+%3A+the+50th+annual+IEEE%2FACM+International+Symposium+on+Microarchitecture+%3A+proceedings+%3A+October+14-18%2C+2017%2C+Cambridge%2C+MA&rft.atitle=Scale-Out+Acceleration+for+Machine+Learning&rft.au=Park%2C+Jongse&rft.au=Sharma%2C+Hardik&rft.au=Mahajan%2C+Divya&rft.au=Kim%2C+Joon+Kyung&rft.date=2017-10-01&rft.pub=ACM&rft.eissn=2379-3155&rft.spage=367&rft.epage=381&rft_id=info:doi/10.1145%2F3123939.3123979&rft.externalDocID=8686683 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/lc.gif&client=summon&freeimage=true |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/mc.gif&client=summon&freeimage=true |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/sc.gif&client=summon&freeimage=true |

