Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks

Emerging edge computing platforms often contain machine learning (ML) accelerators that can accelerate inference for a wide range of neural network (NN) models. These models are designed to fit within the limited area and energy constraints of the edge computing platforms, each targeting various app...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT) s. 159 - 172
Hlavní autoři: Boroumand, Amirali, Ghose, Saugata, Akin, Berkin, Narayanaswami, Ravi, Oliveira, Geraldo F., Ma, Xiaoyu, Shiu, Eric, Mutlu, Onur
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.09.2021
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Emerging edge computing platforms often contain machine learning (ML) accelerators that can accelerate inference for a wide range of neural network (NN) models. These models are designed to fit within the limited area and energy constraints of the edge computing platforms, each targeting various applications (e.g., face detection, speech recognition, translation, image captioning, video analytics). To understand how edge ML accelerators perform, we characterize the performance of a commercial Google Edge TPU, using 24 Google edge NN models (which span a wide range of NN model types) and analyzing each NN layer within each model. We find that the Edge TPU suffers from three major shortcomings: (1) it operates significantly below peak computational throughput, (2) it operates significantly below its theoretical energy efficiency, and (3) its memory system is a large energy and performance bottleneck. Our characterization reveals that the one-size-fits-all, monolithic design of the Edge TPU ignores the high degree of heterogeneity both across different NN models and across different NN layers within the same NN model, leading to the shortcomings we observe. We propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous edge ML accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of NN models and layers. During NN inference, for each NN layer, Mensa decides which accelerator to schedule the layer on, taking into account both the optimality of each accelerator for the layer and layer-to-layer communication costs. Our comprehensive analysis of the Google edge NN models shows that all of the layers naturally group into a small number of clusters, which allows us to design an efficient implementation of Mensa for these models with only three specialized accelerators. Averaged across all 24 Google edge NN models, Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Edge TPU, and by 2.4x and 4.3x over Eyeriss v2, a state-of-the-art accelerator.
AbstractList Emerging edge computing platforms often contain machine learning (ML) accelerators that can accelerate inference for a wide range of neural network (NN) models. These models are designed to fit within the limited area and energy constraints of the edge computing platforms, each targeting various applications (e.g., face detection, speech recognition, translation, image captioning, video analytics). To understand how edge ML accelerators perform, we characterize the performance of a commercial Google Edge TPU, using 24 Google edge NN models (which span a wide range of NN model types) and analyzing each NN layer within each model. We find that the Edge TPU suffers from three major shortcomings: (1) it operates significantly below peak computational throughput, (2) it operates significantly below its theoretical energy efficiency, and (3) its memory system is a large energy and performance bottleneck. Our characterization reveals that the one-size-fits-all, monolithic design of the Edge TPU ignores the high degree of heterogeneity both across different NN models and across different NN layers within the same NN model, leading to the shortcomings we observe. We propose a new acceleration framework called Mensa. Mensa incorporates multiple heterogeneous edge ML accelerators (including both on-chip and near-data accelerators), each of which caters to the characteristics of a particular subset of NN models and layers. During NN inference, for each NN layer, Mensa decides which accelerator to schedule the layer on, taking into account both the optimality of each accelerator for the layer and layer-to-layer communication costs. Our comprehensive analysis of the Google edge NN models shows that all of the layers naturally group into a small number of clusters, which allows us to design an efficient implementation of Mensa for these models with only three specialized accelerators. Averaged across all 24 Google edge NN models, Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Edge TPU, and by 2.4x and 4.3x over Eyeriss v2, a state-of-the-art accelerator.
Author Boroumand, Amirali
Narayanaswami, Ravi
Ma, Xiaoyu
Ghose, Saugata
Shiu, Eric
Mutlu, Onur
Oliveira, Geraldo F.
Akin, Berkin
Author_xml – sequence: 1
  givenname: Amirali
  surname: Boroumand
  fullname: Boroumand, Amirali
  organization: Carnegie Mellon Univ
– sequence: 2
  givenname: Saugata
  surname: Ghose
  fullname: Ghose, Saugata
  organization: Univ. of Illinois Urbana-Champaign
– sequence: 3
  givenname: Berkin
  surname: Akin
  fullname: Akin, Berkin
  organization: Google
– sequence: 4
  givenname: Ravi
  surname: Narayanaswami
  fullname: Narayanaswami, Ravi
  organization: Google
– sequence: 5
  givenname: Geraldo F.
  surname: Oliveira
  fullname: Oliveira, Geraldo F.
  organization: ETH Zürich
– sequence: 6
  givenname: Xiaoyu
  surname: Ma
  fullname: Ma, Xiaoyu
  organization: Google
– sequence: 7
  givenname: Eric
  surname: Shiu
  fullname: Shiu, Eric
  organization: Google
– sequence: 8
  givenname: Onur
  surname: Mutlu
  fullname: Mutlu, Onur
  organization: ETH Zürich
BookMark eNotjMFOAyEUADHRRK39Aj3wA1vhsSzgrdZam7TqoZ4bFt6upCsYdtXUr9dGT5OZw5yT45giEnLF2YRzZq6fp7ONBGXkBBjwCWOMmyMyNkrzqpJlCUqrUzLu-1AzqZRQYPgZiYuU2g7pI35k2_1i-Ep5R9fJY9fTJmU69y3SO_wMDvsbOo2223-H2FIbPV2HIbR2OOjautcQka7Q5ngIy9hgxuiQ3qZh6DCi2_UX5KSxXY_jf47Iy_18M3soVk-L5Wy6KixoORTANS89CEAvmecWgJm6BlHVTvuSMV-jFyW4prEG0WHlFUPutABXGiucGJHLv29AxO17Dm8277dGVoKBFj8WrFvj
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/PACT52795.2021.00019
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781665442787
1665442786
EndPage 172
ExternalDocumentID 9563028
Genre orig-research
GrantInformation_xml – fundername: Semiconductor Research Corporation
  funderid: 10.13039/100000028
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
LHSKQ
RIE
RIL
ID FETCH-LOGICAL-a285t-21814d232ed50d1a2209bb236bc8d400dbed342cffa9eece6d70e1c832c49a3c3
IEDL.DBID RIE
ISICitedReferencesCount 56
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000758464500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Tue May 06 03:33:13 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a285t-21814d232ed50d1a2209bb236bc8d400dbed342cffa9eece6d70e1c832c49a3c3
PageCount 14
ParticipantIDs ieee_primary_9563028
PublicationCentury 2000
PublicationDate 2021-Sept.
PublicationDateYYYYMMDD 2021-09-01
PublicationDate_xml – month: 09
  year: 2021
  text: 2021-Sept.
PublicationDecade 2020
PublicationTitle 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)
PublicationTitleAbbrev PACT
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib057737291
Score 2.5233393
Snippet Emerging edge computing platforms often contain machine learning (ML) accelerators that can accelerate inference for a wide range of neural network (NN)...
SourceID ieee
SourceType Publisher
StartPage 159
SubjectTerms Analytical models
Artificial neural networks
Computational modeling
Image edge detection
Machine learning
Throughput
Visual analytics
Title Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks
URI https://ieeexplore.ieee.org/document/9563028
WOSCitedRecordID wos000758464500012&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcAEqEW85YGRtLGdxDEblBYYWnUoUrfK8V2qiipBfTDw67GdUITEwhTHknOSz5e7S-6-j5AbnosEEsEDiJS2CQqKQDN0SJhRhmh9PoS5J5uQo1E6napxg9zuemEQ0RefYccN_b98KM3WfSrrKgdmxdM9sidlUvVqfZ-dWDq-FcXq7jgWqu74vjeJuVSxzQI56_ho5heHinchg8P_CT8i7Z9ePDreeZlj0sCiRYqnspwvkTpoDb20F1_LTR2x2XJNbRxK-zBH-oj-PXBHPfTIp11PdQF0uKiANezt0NdSIq1hVuf0ZSf0oXTwxgWat3WbvA76k95zUDMnBJqn8SZwfjsCGywhxCEwzXmosoyLJDMpWKuFDEFE3OS5VogGE5AhMmOt21iVCSNOSLMoCzwlNIZUsJznOgm1zY4gsw9lWiqpQsA0Emek5fZq9l6BY8zqbTr_e_qCHDhlVEVal6S5WW3xiuybj81ivbr2Gv0CCR2jpA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKQYIJUIt444GRtH7kZTYoLa1oqw5F6lY59qWqqBLUBwO_HtsJRUgsTHEsOSf5fLm75O77ELplKQ91yJmnfSFNggLckxQsEqafABifr0nqyCai4TCeTMSogu62vTAA4IrPoGGH7l--ztXGfiprCgtmxeIdtBv4PiNFt9b36Qkiy7giaNkfR4lojh5a44BFIjB5IKMNF8_8YlFxTqRz-D_xR6j-042HR1s_c4wqkNVQ9pznswVgC64hF-biqrmxpTZbrLCJRHFbzwA_gXsT3GMHPvJp1mOZaTyYF9Aa5nbgqikBl0CrM9zbCn3MLcBxBuptVUevnfa41fVK7gRPsjhYe9Zz-9qES6ADoqlkjIgkYTxMVKyN3eoENPeZSlMpABSEOiJAlbFvZZTGFT9B1SzP4BThQMecpiyVIZEmP9KJeSiVkYgE0RD7_AzV7F5N3wt4jGm5Ted_T9-g_e540J_2e8OXC3RgFVOUbF2i6nq5gSu0pz7W89Xy2mn3C7ihpus
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+30th+International+Conference+on+Parallel+Architectures+and+Compilation+Techniques+%28PACT%29&rft.atitle=Google+Neural+Network+Models+for+Edge+Devices%3A+Analyzing+and+Mitigating+Machine+Learning+Inference+Bottlenecks&rft.au=Boroumand%2C+Amirali&rft.au=Ghose%2C+Saugata&rft.au=Akin%2C+Berkin&rft.au=Narayanaswami%2C+Ravi&rft.date=2021-09-01&rft.pub=IEEE&rft.spage=159&rft.epage=172&rft_id=info:doi/10.1109%2FPACT52795.2021.00019&rft.externalDocID=9563028