Scale-out acceleration for machine learning

The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms an...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA S. 367 - 381
Hauptverfasser: Park, Jongse, Sharma, Hardik, Mahajan, Divya, Kim, Joon Kyung, Olds, Preston, Esmaeilzadeh, Hadi
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: New York, NY, USA ACM 14.10.2017
Schriftenreihe:ACM Conferences
Schlagworte:
ISBN:1450349528, 9781450349529
ISSN:2379-3155
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.
AbstractList The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.
Author Mahajan, Divya
Esmaeilzadeh, Hadi
Sharma, Hardik
Kim, Joon Kyung
Park, Jongse
Olds, Preston
Author_xml – sequence: 1
  givenname: Jongse
  surname: Park
  fullname: Park, Jongse
  email: jspark@gatech.edu
  organization: Georgia Institute of Technology
– sequence: 2
  givenname: Hardik
  surname: Sharma
  fullname: Sharma, Hardik
  email: hsharma@gatech.edu
  organization: Georgia Institute of Technology
– sequence: 3
  givenname: Divya
  surname: Mahajan
  fullname: Mahajan, Divya
  email: divya_mahajan@gatech.edu
  organization: Georgia Institute of Technology
– sequence: 4
  givenname: Joon Kyung
  surname: Kim
  fullname: Kim, Joon Kyung
  email: jkkim@gatech.edu
  organization: Georgia Institute of Technology
– sequence: 5
  givenname: Preston
  surname: Olds
  fullname: Olds, Preston
  email: prestonolds@gatech.edu
  organization: Georgia Institute of Technology
– sequence: 6
  givenname: Hadi
  surname: Esmaeilzadeh
  fullname: Esmaeilzadeh, Hadi
  email: hadi@eng.ucsd.edu
  organization: Georgia Institute of Technology and University of California
BookMark eNqNkDlPxDAQhc0lsbtsTUGTEgkl-IivEq24pJUogNqaOGMwJDFKQsG_J7ApKKme3nxPU3xLctilDgk5ZbRgrJSXgnFhhS1-U9s9spyuVJRWcrNPFlxomwsm5cFfcEzWw_BGKeVMW8XEglw8emgwT59jBt5jgz2MMXVZSH3Wgn-NHWYNQt_F7uWEHAVoBlzPuSLPN9dPm7t8-3B7v7na5sBLPeY11Lri1houSw-UGx1CpWpjQFHKQFkNYWpUKWRSK6k0FbXCqmKSg6FBrMjZ7m9ERPfRxxb6L2eUUcqIiZ7vKPjWVSm9D45R9-PEzU7c7GSaFv-cuqqPGMQ3LPpdQg
ContentType Conference Proceeding
Copyright 2017 ACM
Copyright_xml – notice: 2017 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3123939.3123979
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1450349528
9781450349529
EISSN 2379-3155
EndPage 381
ExternalDocumentID 8686683
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
ABLEC
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
IEGSK
OCL
RIB
RIC
RIE
RIL
AAWTH
LHSKQ
ID FETCH-LOGICAL-a247t-dad7b2998254ca0287ffb6d88a6001a697afd88066e157656703d6ebb152a80f3
IEDL.DBID RIE
ISBN 1450349528
9781450349529
ISICitedReferencesCount 23
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000455679300028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:38:42 EDT 2025
Wed Jan 31 06:40:42 EST 2024
IsPeerReviewed false
IsScholarly true
Keywords cloud
accelerator
distributed
scale-out
machine learning
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName MICRO-50: The 50th Annual IEEE/ACM International Symposium on Microarchitecture
MergedId FETCHMERGED-LOGICAL-a247t-dad7b2998254ca0287ffb6d88a6001a697afd88066e157656703d6ebb152a80f3
PageCount 15
ParticipantIDs ieee_primary_8686683
acm_books_10_1145_3123939_3123979_brief
acm_books_10_1145_3123939_3123979
PublicationCentury 2000
PublicationDate 20171014
2017-Oct.
PublicationDateYYYYMMDD 2017-10-14
2017-10-01
PublicationDate_xml – month: 10
  year: 2017
  text: 20171014
  day: 14
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle MICRO-50 : the 50th annual IEEE/ACM International Symposium on Microarchitecture : proceedings : October 14-18, 2017, Cambridge, MA
PublicationTitleAbbrev MICRO
PublicationYear 2017
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0002179613
ssib030238632
ssib042476800
ssib023363937
Score 2.252851
Snippet The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint...
SourceID ieee
acm
SourceType Publisher
StartPage 367
SubjectTerms Acceleration
Accelerator
cloud
Computer systems organization -- Architectures -- Distributed architectures -- Cloud computing
Computing methodologies -- Machine learning
distributed
Field programmable gate arrays
Hardware
Hardware -- Integrated circuits -- Reconfigurable logic and FPGAs -- Hardware accelerators
Machine learning
Machine learning algorithms
scale-out
Software algorithms
System software
Title Scale-out acceleration for machine learning
URI https://ieeexplore.ieee.org/document/8686683
WOSCitedRecordID wos000455679300028&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB7a4sGTj1asLyIIXkybZJvdzVHE4kFrDwq9hX1KD7bSpv5-ZzfbVkEQT3kQQjKZyfdNMvMNwJUpENSF4jFTgsQDm5nY6dzEOlFSIQJRKf2TfmSjEZ9MinEDbja9MMYYX3xmem7V_8vXc7Vyn8r6nHJKOWlCkzFa92qtfScjhJJvUOtm4XC67ZkcZAMk1kGKzr2lkYoXCGVB7Scd5H2SOjmwoueXrrSrKdT7j6ErHnOGe_-72n3obJv3ovEGlg6gYWaHsLee3hCFYG4DhjWCQ_y8qqJbpRB9al-IkMVGT77E0kRBffWtA6_D-5e7hziMTogF3mgVa6GZRKRx-Z8SyCGYtZJqzoUjOIIWTFjcQr5hUsw4coqBr6mREuFc8MSSI2jN5jNzDJFQiSJMG4NnQ8QTXBUisVqkuZS24KILl2in0uUEy7Juc87LYMsy2LIL138eU8rF1NgutJ0ly49aa6MMRjz5ffcp7GYOZn1x3Rm0qsXKnMOO-qymy8WFd5Avjt2xTw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0AmugJFYz4WRMTLy6UbrvdHo2RYATkgAm3Zne7azgIBoq_39mygCYmxlM_0jTtdKbvTTvzBuBGJwjqQnESK0FJaAJNrM4NyXwlFSIQk7J40r14MODjcTIswd2mF0ZrXRSf6aZdLf7lZzO1tJ_KWpxxxjgtw46dnOW6tdbeE1DK6DewtdNwONt2TYZBiNTaidHZ9zSS8QTBzOn9tMOoRdtWECxpFktb3FUW6v3H2JUCdTrV_13vAdS37XvecANMh1DS0yOoruc3eC6ca4CBjfBAXpa5d68U4s_KGzzksV6_KLLUntNffavDa-dx9NAlbngCEXijOclEFkvEGpsBKoEsIjZGsoxzYSmOYEksDG4h49BtzDkihqGfMS0lArrgvqHHUJnOpvoEPKF8ReNMazwbml1wlQjfZKIdSWkSLhpwjXZKbVawSFeNzlHqbJk6Wzbg9s9jUjmfaNOAmrVk-rFS20idEU9_330Fe91Rv5f2ngbPZ7AfWNAtSu3OoZLPl_oCdtVnPlnMLwtn-QIEX7SY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=MICRO-50+%3A+the+50th+annual+IEEE%2FACM+International+Symposium+on+Microarchitecture+%3A+proceedings+%3A+October+14-18%2C+2017%2C+Cambridge%2C+MA&rft.atitle=Scale-Out+Acceleration+for+Machine+Learning&rft.au=Park%2C+Jongse&rft.au=Sharma%2C+Hardik&rft.au=Mahajan%2C+Divya&rft.au=Kim%2C+Joon+Kyung&rft.date=2017-10-01&rft.pub=ACM&rft.eissn=2379-3155&rft.spage=367&rft.epage=381&rft_id=info:doi/10.1145%2F3123939.3123979&rft.externalDocID=8686683
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450349529/sc.gif&client=summon&freeimage=true