ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration

Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design S. 1 - 9
Hauptverfasser:	Yang, Xiaoxuan, Yan, Bonan, Li, Hai, Chen, Yiran
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	Association on Computer Machinery 02.11.2020
Schlagworte:	Acceleration autoregressive decoder Computational modeling Computer architecture convolutional neural networks Decoding deep neural network model hardware acceleration solution learning (artificial intelligence) mathematics computing matrix decomposition matrix multiplication matrix-matrix multiplication memory architecture multi-threading natural language processing neural language processing applications neural machine translation performance evaluation Pipelines processing-in-memory recurrent neural nets recurrent neural networks ReRAM ReRAM-based PIM architecture ReRAM-based processing-in-memory architecture ReTransformer scaled dot-product attention mechanism submatrix pipeline design Transformer Virtual machine monitors
ISSN:	1558-2434
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Abstract	Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.
AbstractList	Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.
Author	Yang, Xiaoxuan Chen, Yiran Li, Hai Yan, Bonan
Author_xml	– sequence: 1 givenname: Xiaoxuan surname: Yang fullname: Yang, Xiaoxuan email: xy92@duke.edu organization: Duke University,Durham,NC,USA – sequence: 2 givenname: Bonan surname: Yan fullname: Yan, Bonan email: bonan.yan@duke.edu organization: Duke University,Durham,NC,USA – sequence: 3 givenname: Hai surname: Li fullname: Li, Hai email: hai.li@duke.edu organization: Duke University,Durham,NC,USA – sequence: 4 givenname: Yiran surname: Chen fullname: Chen, Yiran email: yiran.chen@duke.edu organization: Duke University,Durham,NC,USA
BookMark	eNpNjL1OwzAYRQ0CiVIyM7DkBVz88zm22aKKP6kVKCoDU2U7nyFS4yAnDH17IsHAdM9wzr0kZ2lISMg1ZyvOQd1KYEwysZLAVQXshBRWG15VCoQUIE_Jgitl6IxwQYpx7DwDYKCsMQvy3uAuuzTGIfeY78oGm3pLvRuxLV_zEHD20wftEt1iP-RjWefw2U0Ypu-M5VyV__KyDgEPmN3UDemKnEd3GLH42yV5e7jfrZ_o5uXxeV1vqBOgJxq0a5FbcI5Zy5x21iKwVgZ0yke0BmMURnOlfWwhMs-5D-CNQtVqHkAuyc3vb4eI-6_c9S4f91aoSgkpfwAj6lXI
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1145/3400302.3415640
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	9781665423243 1665423242
EISSN	1558-2434
EndPage	9
ExternalDocumentID	9256523
Genre	orig-research
GrantInformation_xml	– fundername: ARO grantid: W911NF-19-2-0107. funderid: 10.13039/100000183 – fundername: NSF grantid: 1955246,1910299,1725456 funderid: 10.13039/501100001809
GroupedDBID	6IE 6IF 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO FEDTE IEGSK IJVOP M43 OCL RIE RIL RIO
ID	FETCH-LOGICAL-a247t-c7ade194aa0990a7a99e40d3cea5bfe98eff287157bfd4f0b11bc4b85e5d71c43
IEDL.DBID	RIE
ISICitedReferencesCount	92
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000671087100051&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:28:32 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a247t-c7ade194aa0990a7a99e40d3cea5bfe98eff287157bfd4f0b11bc4b85e5d71c43
PageCount	9
ParticipantIDs	ieee_primary_9256523
PublicationCentury	2000
PublicationDate	2020-Nov.-2
PublicationDateYYYYMMDD	2020-11-02
PublicationDate_xml	– month: 11 year: 2020 text: 2020-Nov.-2 day: 02
PublicationDecade	2020
PublicationTitle	Digest of technical papers - IEEE/ACM International Conference on Computer-Aided Design
PublicationTitleAbbrev	ICCAD
PublicationYear	2020
Publisher	Association on Computer Machinery
Publisher_xml	– name: Association on Computer Machinery
SSID	ssib044045988 ssj0020286
Score	2.494524
Snippet	Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Acceleration autoregressive decoder Computational modeling Computer architecture convolutional neural networks Decoding deep neural network model hardware acceleration solution learning (artificial intelligence) mathematics computing matrix decomposition matrix multiplication matrix-matrix multiplication memory architecture multi-threading natural language processing neural language processing applications neural machine translation performance evaluation Pipelines processing-in-memory recurrent neural nets recurrent neural networks ReRAM ReRAM-based PIM architecture ReRAM-based processing-in-memory architecture ReTransformer scaled dot-product attention mechanism submatrix pipeline design Transformer Virtual machine monitors
Title	ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration
URI	https://ieeexplore.ieee.org/document/9256523
WOSCitedRecordID	wos000671087100051&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LawIxEB5UemgvfWjpmxx6bHSzScymNymVXhQRC_YkecyCl7VYLfTfN1nXdg-99LaEBEJ2yDffTOYbgPvgCWWJM45aKzMqbJpToz2jmmMAIN7nnLuy2YQaj7P5XE8a8PBTC4OI5eMz7MbPMpfvV24bQ2U9HfA5EKcmNJXq72q19rYTZe5kKb1Vka2Am_1KyocJ2eMimnPa5ZGwxEhHrZdKCSXD4_9t4gQ6vzV5ZPKDNqfQwOIMjmpygm14m-Js74fi-pFMcToY0YhTnlQFAWEeXRZ0FN_XfpFBLYtAwipSW04GzgVI2hlIB16Hz7OnF1q1TqAmFWpDnTIemRbGxMSXUUZrFInnDo20OeoM8zxyJals7kWeWMasEzaTKL1iTvBzaBWrAi-AqCzxwWdKveBehKvRMOXR-jDTGs21vIR2PKTF-04dY1Gdz9Xfw9dwmEbGGgOz6Q20Nust3sKB-9wsP9Z35S_9BthsovY
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmqgXH2B824NHC7vbLt16I0aCEQghmOCJ9DGbcFkMgon_3nZZcA9evDWbNtlMJ_36zXS-Abh3N6EkMMpQreOEch2lVEkbUsnQARBrMcZM3mxCDAbJZCKHFXjY1sIgYv74DBt-mOfy7dysfKisKR0-O-K0A7u-c1ZRrbXxHi90F-fiWwXdcsjZKsR8Qh43GfcOHTWYpyw-1lHqppKDSefof79xDPXfqjwy3OLNCVQwO4XDkqBgDd5HON7cRHHxSEY4avepRypLipIAN4_OMtr3L2y_SbuURyBuFSktJ21jHCitXaQOb53n8VOXFs0TqIq4WFIjlMVQcqV86ksJJSXywDKDKtYpygTT1LOlWOjU8jTQYagN10mMsRWh4ewMqtk8w3MgIgmss3dkObPcHY4qFBa1dTO1kkzGF1DzRpp-rPUxpoV9Lv_-fAf73XG_N-29DF6v4CDy_NWHaaNrqC4XK7yBPfO1nH0ubvPt_QGEVaY_
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Digest+of+technical+papers+-+IEEE%2FACM+International+Conference+on+Computer-Aided+Design&rft.atitle=ReTransformer%3A+ReRAM-based+Processing-in-Memory+Architecture+for+Transformer+Acceleration&rft.au=Yang%2C+Xiaoxuan&rft.au=Yan%2C+Bonan&rft.au=Li%2C+Hai&rft.au=Chen%2C+Yiran&rft.date=2020-11-02&rft.pub=Association+on+Computer+Machinery&rft.eissn=1558-2434&rft.spage=1&rft.epage=9&rft_id=info:doi/10.1145%2F3400302.3415640&rft.externalDocID=9256523