A high-performance dataflow-centric optimization framework for deep learning inference on the edge

Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly develop...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of systems architecture Ročník 152; s. 103180
Hlavní autoři: Zhang, Runhua, Jiang, Hongxu, Geng, Jinkun, Tian, Fangzheng, Ma, Yuhang, Wang, Haojie
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.07.2024
Témata:
ISSN:1383-7621, 1873-6165
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance. Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%–84.9% and 17.9%–89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1×–1.9×. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68×–3.78× compared with the single device.
AbstractList Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance. Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%–84.9% and 17.9%–89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1×–1.9×. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68×–3.78× compared with the single device.
ArticleNumber 103180
Author Tian, Fangzheng
Geng, Jinkun
Wang, Haojie
Zhang, Runhua
Ma, Yuhang
Jiang, Hongxu
Author_xml – sequence: 1
  givenname: Runhua
  orcidid: 0000-0003-3487-5289
  surname: Zhang
  fullname: Zhang, Runhua
  email: rhzhang20@buaa.edu.cn
  organization: Beihang University, China
– sequence: 2
  givenname: Hongxu
  surname: Jiang
  fullname: Jiang, Hongxu
  email: jianghx@buaa.edu.cn
  organization: Beihang University, China
– sequence: 3
  givenname: Jinkun
  orcidid: 0000-0002-6574-8349
  surname: Geng
  fullname: Geng, Jinkun
  email: gjk1994@stanford.edu
  organization: Stanford University, United States of America
– sequence: 4
  givenname: Fangzheng
  surname: Tian
  fullname: Tian, Fangzheng
  email: amazingtian@buaa.edu.cn
  organization: Beihang University, China
– sequence: 5
  givenname: Yuhang
  surname: Ma
  fullname: Ma, Yuhang
  email: buaa_mayuhang@buaa.edu.cn
  organization: Beihang University, China
– sequence: 6
  givenname: Haojie
  surname: Wang
  fullname: Wang, Haojie
  email: wanghaojie@tsinghua.edu.cn
  organization: Tsinghua University, China
BookMark eNqFkE1LAzEQhoNUsK3-Aw_5A1uTzX7Vg1CKX1DwoueQnUza1O1mSYKl_np3XU8e9DTDMM8L7zMjk9a1SMg1ZwvOeHGzX4RTUB4WKUuz_iR4xc7IlFelSApe5JN-F5VIyiLlF2QWwp4xluc8nZJ6RXd2u0s69Mb5g2oBqVZRmcYdE8A2egvUddEe7KeK1rXUeHXAo_PvtAeoRuxog8q3tt1S2xr0OGT0j3GHFPUWL8m5UU3Aq585J28P96_rp2Tz8vi8Xm0SEKyIiaihzpasZsuighqyMgNUGdNaVBoqjrrCGgolVMrTWucsNSyvtGEAWGpujJiTbMwF70LwaGTn7UH5k-RMDp7kXo6e5OBJjp567PYXBjZ-V41e2eY_-G6EsS_2YdHLAHYQoK1HiFI7-3fAF1Yxi64
CitedBy_id crossref_primary_10_3390_electronics14071345
crossref_primary_10_1145_3716876
crossref_primary_10_1016_j_sysarc_2025_103508
crossref_primary_10_1088_2631_8695_adf412
Cites_doi 10.1145/3337821.3337892
10.1145/3229543.3229544
10.1145/2491956.2462176
10.1145/3322795.3331463
10.1109/TKDE.2019.2923638
10.1145/3343180.3343192
10.1016/j.jpdc.2008.09.002
10.1145/3469116.3470012
10.1109/TKDE.2013.132
10.1109/TNET.2021.3117042
10.1145/3373376.3378508
10.1145/3453483.3454083
10.1145/3322795.3331461
10.1145/3358192
10.1109/TKDE.2022.3178211
10.1145/3318216.3363312
10.1109/CVPR.2016.90
10.1109/CVPR.2019.00881
10.1145/3341301.3359630
10.14778/3229863.3236256
ContentType Journal Article
Copyright 2024 Elsevier B.V.
Copyright_xml – notice: 2024 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.sysarc.2024.103180
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1873-6165
ExternalDocumentID 10_1016_j_sysarc_2024_103180
S1383762124001176
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABFRF
ABJNI
ABMAC
ABXDB
ACDAQ
ACGFO
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEFWE
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BKOMP
BLXMC
CS3
DU5
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
PQQKQ
Q38
R2-
RIG
ROL
RPZ
RXW
SBC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TAE
TN5
U5U
UHS
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-3bcb490b0968cbc474cea40dd38dc81ed8ebc6a3a212bd502f058df0cce7d1ff3
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001246467900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1383-7621
IngestDate Sat Nov 29 01:35:59 EST 2025
Tue Nov 18 21:45:05 EST 2025
Tue Jun 18 08:51:17 EDT 2024
IsPeerReviewed true
IsScholarly true
Keywords Computation graph
Dataflow-centric
Data locality
Model inference
Edge computing
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-3bcb490b0968cbc474cea40dd38dc81ed8ebc6a3a212bd502f058df0cce7d1ff3
ORCID 0000-0002-6574-8349
0000-0003-3487-5289
ParticipantIDs crossref_primary_10_1016_j_sysarc_2024_103180
crossref_citationtrail_10_1016_j_sysarc_2024_103180
elsevier_sciencedirect_doi_10_1016_j_sysarc_2024_103180
PublicationCentury 2000
PublicationDate July 2024
2024-07-00
PublicationDateYYYYMMDD 2024-07-01
PublicationDate_xml – month: 07
  year: 2024
  text: July 2024
PublicationDecade 2020
PublicationTitle Journal of systems architecture
PublicationYear 2024
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Xilinx, Xilinx ZCU102
J. Geng, D. Li, S. Wang, Elasticpipe: An efficient and dynamic model-parallel solution to dnn training, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 5–9.
Huang, Ma, Fan, Liu, Gong (b1) 2017
Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections, in: 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 21, 2021, pp. 37–54.
Xiang, Kim (b50) 2019
W. Niu, J. Guan, Y. Wang, G. Agrawal, B. Ren, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, in: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898.
Jiang, Sha, Zhang, Yang, Zhuge, Shi, Hu (b54) 2019; 18
NVIDIA, NVIDIA TensorRT
Dong, Gao, Huang, Wawrzynek, So, Keutzer (b42) 2021
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013.
Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, Adam (b18) 2017
Marculescu, Stamoulis, Cai (b40) 2018
J. Geng, D. Li, S. Wang, Accelerating distributed machine learning by smart parameter server, in: Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019, 2019, pp. 92–98.
M. Li, et al., Scaling distributed machine learning with the parameter server, in: OSDI’14.
Hong, Li, Liu, Yang, Tang (b43) 2022
Zhang, Lu, Hao, Li, Cheng, Li, Rupnow, Xiong, Huang, Shi (b55) 2020; 2
Hochstetler, Padidela, Chen, Yang, Fu (b2) 2018
Cheng, Li, Guo, Jiang, Geng, Bai, Wu, Xiong (b17) 2020; 32
.
Zhang, Jiang, Tian, Geng, Li, Ma, Zhu, Dong, Li, Wang (b56) 2023
L. Zhou, M.H. Samavatian, A. Bacha, S. Majumdar, R. Teodorescu, Adaptive parallel execution of deep neural networks on heterogeneous edge devices, in: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019, pp. 195–208.
Dautov, Distefano (b5) 2021; 33
Wang, Hu, Wu (b29) 2020
Z. Jia, J. Thomas, T. Warzawski, M. Gao, M. Zaharia, A. Aiken, Optimizing DNN Computation with Relaxed Graph Substitutions, in: Proceedings of the 2nd Conference on Systems and Machine Learning, SysML ’19, 2019.
Xilinx, Xilinx U series
K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, Haq: Hardware-aware automated quantization with mixed precision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620.
S. Laskaridis, A. Kouris, N.D. Lane, Adaptive inference through early-exit networks: Design, challenges and directions, in: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, 2021, pp. 1–6.
Hadidi, Cao, Ryoo, Kim (b39) 2019
L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al., Ansor: generating high-performance tensor programs for deep learning, in: 14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI} 20, 2020, pp. 863–879.
H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, et al., Exploiting bounded staleness to speed up big data analytics, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. 37–48.
Geng, Li, Wang (b48) 2020
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al., {TVM}: An automated {End-to-End} optimizing compiler for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
J. Geng, D. Li, S. Wang, Horizontal or vertical? a hybrid approach to large-scale distributed machine learning, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 1–4.
Jiang, Xu, Xu, Wang, Qiao, Zhao (b20) 2022
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
Wang, Xie (b3) 2020
Wang, Geng, Li (b13) 2022; 30
H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, Z. Jia
Li, Chen, You, Wang, Lin (b41) 2020
Grulich, Nawab (b6) 2018; 11
Devlin, Chang, Lee, Toutanova (b23) 2018
Geng, Li, Wang (b53) 2019
Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, A. Aiken, TASO: optimizing deep learning computation with automatic generation of graph substitutions, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 47–62.
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
Apache TVM, TVM-deploy
Hapfelmeier, Pfahringer, Kramer (b19) 2014; 26
Yao, Zhang, Yao, Wang, Ma, Zhang, Chu, Ji, Jia, Shen, Wu, Zhang, Tan, Kuang, Wu, Wu, Zhou, Yang (b4) 2022
Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, J. Geng, X. Yu, W. Bai, L. Qu, et al., Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines, in: Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–11.
Galjaard, Cox, Ghiassi, Chen, Birke (b38) 2021
J. Geng, D. Li, Y. Cheng, S. Wang, J. Li, HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning, in: Proceedings of the 2018 Workshop on Network Meets AI & ML, 2018, pp. 1–7.
S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 859–873.
Li, Andersen, Park, Smola, Ahmed, Josifovski, Long, Shekita, Su (b25) 2014
Zhu, Wang, Xu, Cheng, Qiu, Yuan, Huang (b21) 2022
Tandon, Lei, Dimakis, Karampatziakis (b52) 2017
OpenXLA, XLA: Optimizing Compiler for Machine Learning
Patarasuk, Yuan (b11) 2009; 69
Texas Instruments, TI TMS320C6678
Radford, Narasimhan, Salimans, Sutskever (b24) 2018
Wang (10.1016/j.sysarc.2024.103180_b29) 2020
Howard (10.1016/j.sysarc.2024.103180_b18) 2017
Dong (10.1016/j.sysarc.2024.103180_b42) 2021
Wang (10.1016/j.sysarc.2024.103180_b13) 2022; 30
Geng (10.1016/j.sysarc.2024.103180_b48) 2020
Jiang (10.1016/j.sysarc.2024.103180_b54) 2019; 18
10.1016/j.sysarc.2024.103180_b22
Li (10.1016/j.sysarc.2024.103180_b41) 2020
10.1016/j.sysarc.2024.103180_b16
10.1016/j.sysarc.2024.103180_b14
10.1016/j.sysarc.2024.103180_b15
Devlin (10.1016/j.sysarc.2024.103180_b23) 2018
10.1016/j.sysarc.2024.103180_b30
10.1016/j.sysarc.2024.103180_b31
Geng (10.1016/j.sysarc.2024.103180_b53) 2019
10.1016/j.sysarc.2024.103180_b34
10.1016/j.sysarc.2024.103180_b35
10.1016/j.sysarc.2024.103180_b32
Radford (10.1016/j.sysarc.2024.103180_b24) 2018
10.1016/j.sysarc.2024.103180_b33
10.1016/j.sysarc.2024.103180_b27
10.1016/j.sysarc.2024.103180_b28
10.1016/j.sysarc.2024.103180_b26
Hapfelmeier (10.1016/j.sysarc.2024.103180_b19) 2014; 26
Tandon (10.1016/j.sysarc.2024.103180_b52) 2017
Hochstetler (10.1016/j.sysarc.2024.103180_b2) 2018
Cheng (10.1016/j.sysarc.2024.103180_b17) 2020; 32
Patarasuk (10.1016/j.sysarc.2024.103180_b11) 2009; 69
Li (10.1016/j.sysarc.2024.103180_b25) 2014
Huang (10.1016/j.sysarc.2024.103180_b1) 2017
10.1016/j.sysarc.2024.103180_b45
10.1016/j.sysarc.2024.103180_b46
Zhu (10.1016/j.sysarc.2024.103180_b21) 2022
Hadidi (10.1016/j.sysarc.2024.103180_b39) 2019
10.1016/j.sysarc.2024.103180_b44
Dautov (10.1016/j.sysarc.2024.103180_b5) 2021; 33
10.1016/j.sysarc.2024.103180_b36
10.1016/j.sysarc.2024.103180_b37
Grulich (10.1016/j.sysarc.2024.103180_b6) 2018; 11
Zhang (10.1016/j.sysarc.2024.103180_b56) 2023
Wang (10.1016/j.sysarc.2024.103180_b3) 2020
Xiang (10.1016/j.sysarc.2024.103180_b50) 2019
Zhang (10.1016/j.sysarc.2024.103180_b55) 2020; 2
10.1016/j.sysarc.2024.103180_b9
10.1016/j.sysarc.2024.103180_b51
10.1016/j.sysarc.2024.103180_b8
10.1016/j.sysarc.2024.103180_b12
10.1016/j.sysarc.2024.103180_b7
Marculescu (10.1016/j.sysarc.2024.103180_b40) 2018
10.1016/j.sysarc.2024.103180_b10
Hong (10.1016/j.sysarc.2024.103180_b43) 2022
Galjaard (10.1016/j.sysarc.2024.103180_b38) 2021
10.1016/j.sysarc.2024.103180_b49
10.1016/j.sysarc.2024.103180_b47
Jiang (10.1016/j.sysarc.2024.103180_b20) 2022
Yao (10.1016/j.sysarc.2024.103180_b4) 2022
References_xml – volume: 26
  start-page: 2072
  year: 2014
  end-page: 2076
  ident: b19
  article-title: Pruning incremental linear model trees with approximate lookahead
  publication-title: IEEE Trans. Knowl. Data Eng.
– start-page: 392
  year: 2019
  end-page: 405
  ident: b50
  article-title: Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference
  publication-title: 2019 IEEE Real-Time Systems Symposium
– reference: M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
– reference: . OpenXLA, XLA: Optimizing Compiler for Machine Learning,
– reference: S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 859–873.
– reference: . Xilinx, Xilinx ZCU102,
– start-page: 535
  year: 2023
  end-page: 545
  ident: b56
  article-title: Xenos: Dataflow-centric optimization to accelerate model inference on edge devices
  publication-title: International Conference on Database Systems for Advanced Applications
– reference: M. Li, et al., Scaling distributed machine learning with the parameter server, in: OSDI’14.
– reference: H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, et al., Exploiting bounded staleness to speed up big data analytics, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. 37–48.
– reference: T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al., {TVM}: An automated {End-to-End} optimizing compiler for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
– start-page: 50
  year: 2021
  end-page: 59
  ident: b42
  article-title: Hao: Hardware-aware neural architecture optimization for efficient inference
  publication-title: 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines
– start-page: 1
  year: 2018
  end-page: 8
  ident: b40
  article-title: Hardware-aware machine learning: Modeling and optimization
  publication-title: 2018 IEEE/ACM International Conference on Computer-Aided Design
– start-page: 1393
  year: 2020
  end-page: 1404
  ident: b48
  article-title: FELA: Incorporating flexible parallelism and elastic tuning to accelerate large-scale DML
  publication-title: 2020 IEEE 36th International Conference on Data Engineering
– volume: 33
  start-page: 55
  year: 2021
  end-page: 69
  ident: b5
  article-title: Automating IoT data-intensive application allocation in clustered edge computing
  publication-title: IEEE Trans. Knowl. Data Eng.
– start-page: 500
  year: 2020
  end-page: 518
  ident: b41
  article-title: Halo: Hardware-aware learning to optimize
  publication-title: European Conference on Computer Vision
– year: 2022
  ident: b43
  article-title: Multi-objective evolutionary optimization for hardware-aware neural network pruning
  publication-title: Fund. Res.
– start-page: 1
  year: 2022
  ident: b4
  article-title: Edge-cloud polarization and collaboration: A comprehensive survey for AI
  publication-title: IEEE Trans. Knowl. Data Eng.
– year: 2020
  ident: b29
  article-title: Kubeedge. ai: Ai platform for edge devices
– start-page: 1
  year: 2017
  end-page: 2
  ident: b1
  article-title: When deep learning meets edge computing
  publication-title: 2017 IEEE 25th International Conference on Network Protocols
– start-page: 767
  year: 2022
  end-page: 779
  ident: b20
  article-title: Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing
  publication-title: 2022 IEEE 38th International Conference on Data Engineering
– reference: J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013.
– reference: J. Geng, D. Li, S. Wang, Elasticpipe: An efficient and dynamic model-parallel solution to dnn training, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 5–9.
– reference: }: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections, in: 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 21, 2021, pp. 37–54.
– reference: . Apache TVM, TVM-deploy,
– volume: 32
  start-page: 1802
  year: 2020
  end-page: 1814
  ident: b17
  article-title: Accelerating end-to-end deep learning workflow with codesign of data preprocessing and scheduling
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– reference: J. Geng, D. Li, Y. Cheng, S. Wang, J. Li, HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning, in: Proceedings of the 2018 Workshop on Network Meets AI & ML, 2018, pp. 1–7.
– reference: . Xilinx, Xilinx U series,
– reference: J. Geng, D. Li, S. Wang, Accelerating distributed machine learning by smart parameter server, in: Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019, 2019, pp. 92–98.
– reference: W. Niu, J. Guan, Y. Wang, G. Agrawal, B. Ren, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, in: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898.
– start-page: 100
  year: 2019
  end-page: 111
  ident: b53
  article-title: Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization
  publication-title: 2019 IEEE 35th International Conference on Data Engineering
– volume: 30
  start-page: 572
  year: 2022
  end-page: 585
  ident: b13
  article-title: Impact of synchronization topology on DML performance: Both logical topology and physical topology
  publication-title: IEEE/ACM Trans. Netw.
– start-page: 2168
  year: 2022
  end-page: 2181
  ident: b21
  article-title: PSP: Progressive space pruning for efficient graph neural architecture search
  publication-title: 2022 IEEE 38th International Conference on Data Engineering
– reference: Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, A. Aiken, TASO: optimizing deep learning computation with automatic generation of graph substitutions, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 47–62.
– reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
– reference: . NVIDIA, NVIDIA TensorRT,
– volume: 2
  start-page: 216
  year: 2020
  end-page: 229
  ident: b55
  article-title: SkyNet: a hardware-efficient method for object detection and tracking on embedded systems
  publication-title: Proc. Mach. Learn. Syst.
– reference: J. Geng, D. Li, S. Wang, Horizontal or vertical? a hybrid approach to large-scale distributed machine learning, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 1–4.
– reference: K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, Haq: Hardware-aware automated quantization with mixed precision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620.
– start-page: 3368
  year: 2017
  end-page: 3376
  ident: b52
  article-title: Gradient coding: Avoiding stragglers in distributed learning
  publication-title: International Conference on Machine Learning
– reference: L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al., Ansor: generating high-performance tensor programs for deep learning, in: 14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI} 20, 2020, pp. 863–879.
– year: 2018
  ident: b24
  article-title: Improving language understanding by generative pre-training
– reference: S. Laskaridis, A. Kouris, N.D. Lane, Adaptive inference through early-exit networks: Design, challenges and directions, in: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, 2021, pp. 1–6.
– start-page: 1379
  year: 2020
  end-page: 1388
  ident: b3
  article-title: User preference based energy-aware mobile AR system with edge computing
  publication-title: IEEE INFOCOM 2020-IEEE Conference on Computer Communications
– reference: H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, Z. Jia, {
– reference: L. Zhou, M.H. Samavatian, A. Bacha, S. Majumdar, R. Teodorescu, Adaptive parallel execution of deep neural networks on heterogeneous edge devices, in: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019, pp. 195–208.
– year: 2019
  ident: b39
  article-title: Collaborative execution of deep neural networks on internet of things devices
– start-page: 281
  year: 2021
  end-page: 286
  ident: b38
  article-title: Mema: Fast inference of multiple deep models
  publication-title: 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events
– volume: 11
  start-page: 2046
  year: 2018
  end-page: 2049
  ident: b6
  article-title: Collaborative edge and cloud neural networks for real-time video processing
  publication-title: Proc. VLDB Endow.
– reference: Z. Jia, J. Thomas, T. Warzawski, M. Gao, M. Zaharia, A. Aiken, Optimizing DNN Computation with Relaxed Graph Substitutions, in: Proceedings of the 2nd Conference on Systems and Machine Learning, SysML ’19, 2019.
– reference: .
– start-page: 583
  year: 2014
  end-page: 598
  ident: b25
  article-title: Scaling distributed machine learning with the parameter server
  publication-title: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation
– volume: 18
  start-page: 1
  year: 2019
  end-page: 23
  ident: b54
  article-title: Achieving super-linear speedup across multi-fpga for real-time dnn inference
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
– year: 2018
  ident: b23
  article-title: Bert: Pre-training of deep bidirectional transformers for language understanding
– reference: Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, J. Geng, X. Yu, W. Bai, L. Qu, et al., Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines, in: Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–11.
– start-page: 341
  year: 2018
  end-page: 343
  ident: b2
  article-title: Embedded deep learning for vehicular edge computing
  publication-title: 2018 IEEE/ACM Symposium on Edge Computing
– year: 2017
  ident: b18
  article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications
– volume: 69
  start-page: 117
  year: 2009
  end-page: 124
  ident: b11
  article-title: Bandwidth optimal all-reduce algorithms for clusters of workstations
  publication-title: J. Parallel Distrib. Comput.
– reference: . Texas Instruments, TI TMS320C6678,
– start-page: 583
  year: 2014
  ident: 10.1016/j.sysarc.2024.103180_b25
  article-title: Scaling distributed machine learning with the parameter server
– ident: 10.1016/j.sysarc.2024.103180_b47
– ident: 10.1016/j.sysarc.2024.103180_b16
  doi: 10.1145/3337821.3337892
– start-page: 1
  year: 2018
  ident: 10.1016/j.sysarc.2024.103180_b40
  article-title: Hardware-aware machine learning: Modeling and optimization
– ident: 10.1016/j.sysarc.2024.103180_b30
– ident: 10.1016/j.sysarc.2024.103180_b12
  doi: 10.1145/3229543.3229544
– start-page: 341
  year: 2018
  ident: 10.1016/j.sysarc.2024.103180_b2
  article-title: Embedded deep learning for vehicular edge computing
– start-page: 1379
  year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b3
  article-title: User preference based energy-aware mobile AR system with edge computing
– ident: 10.1016/j.sysarc.2024.103180_b15
– ident: 10.1016/j.sysarc.2024.103180_b34
  doi: 10.1145/2491956.2462176
– ident: 10.1016/j.sysarc.2024.103180_b49
  doi: 10.1145/3322795.3331463
– volume: 33
  start-page: 55
  issue: 1
  year: 2021
  ident: 10.1016/j.sysarc.2024.103180_b5
  article-title: Automating IoT data-intensive application allocation in clustered edge computing
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2019.2923638
– start-page: 50
  year: 2021
  ident: 10.1016/j.sysarc.2024.103180_b42
  article-title: Hao: Hardware-aware neural architecture optimization for efficient inference
– ident: 10.1016/j.sysarc.2024.103180_b51
  doi: 10.1145/3343180.3343192
– volume: 69
  start-page: 117
  issue: 2
  year: 2009
  ident: 10.1016/j.sysarc.2024.103180_b11
  article-title: Bandwidth optimal all-reduce algorithms for clusters of workstations
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2008.09.002
– ident: 10.1016/j.sysarc.2024.103180_b9
– start-page: 535
  year: 2023
  ident: 10.1016/j.sysarc.2024.103180_b56
  article-title: Xenos: Dataflow-centric optimization to accelerate model inference on edge devices
– start-page: 767
  year: 2022
  ident: 10.1016/j.sysarc.2024.103180_b20
  article-title: Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing
– ident: 10.1016/j.sysarc.2024.103180_b28
  doi: 10.1145/3469116.3470012
– year: 2018
  ident: 10.1016/j.sysarc.2024.103180_b24
– start-page: 3368
  year: 2017
  ident: 10.1016/j.sysarc.2024.103180_b52
  article-title: Gradient coding: Avoiding stragglers in distributed learning
– ident: 10.1016/j.sysarc.2024.103180_b31
– ident: 10.1016/j.sysarc.2024.103180_b10
– start-page: 281
  year: 2021
  ident: 10.1016/j.sysarc.2024.103180_b38
  article-title: Mema: Fast inference of multiple deep models
– ident: 10.1016/j.sysarc.2024.103180_b14
– ident: 10.1016/j.sysarc.2024.103180_b35
– year: 2017
  ident: 10.1016/j.sysarc.2024.103180_b18
– volume: 26
  start-page: 2072
  issue: 8
  year: 2014
  ident: 10.1016/j.sysarc.2024.103180_b19
  article-title: Pruning incremental linear model trees with approximate lookahead
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2013.132
– volume: 32
  start-page: 1802
  issue: 7
  year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b17
  article-title: Accelerating end-to-end deep learning workflow with codesign of data preprocessing and scheduling
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– volume: 30
  start-page: 572
  issue: 2
  year: 2022
  ident: 10.1016/j.sysarc.2024.103180_b13
  article-title: Impact of synchronization topology on DML performance: Both logical topology and physical topology
  publication-title: IEEE/ACM Trans. Netw.
  doi: 10.1109/TNET.2021.3117042
– ident: 10.1016/j.sysarc.2024.103180_b26
– ident: 10.1016/j.sysarc.2024.103180_b36
  doi: 10.1145/3373376.3378508
– year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b29
– ident: 10.1016/j.sysarc.2024.103180_b45
– ident: 10.1016/j.sysarc.2024.103180_b32
– start-page: 500
  year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b41
  article-title: Halo: Hardware-aware learning to optimize
– ident: 10.1016/j.sysarc.2024.103180_b33
  doi: 10.1145/3453483.3454083
– start-page: 1
  year: 2017
  ident: 10.1016/j.sysarc.2024.103180_b1
  article-title: When deep learning meets edge computing
– year: 2022
  ident: 10.1016/j.sysarc.2024.103180_b43
  article-title: Multi-objective evolutionary optimization for hardware-aware neural network pruning
  publication-title: Fund. Res.
– ident: 10.1016/j.sysarc.2024.103180_b46
  doi: 10.1145/3322795.3331461
– volume: 2
  start-page: 216
  year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b55
  article-title: SkyNet: a hardware-efficient method for object detection and tracking on embedded systems
  publication-title: Proc. Mach. Learn. Syst.
– volume: 18
  start-page: 1
  issue: 5s
  year: 2019
  ident: 10.1016/j.sysarc.2024.103180_b54
  article-title: Achieving super-linear speedup across multi-fpga for real-time dnn inference
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
  doi: 10.1145/3358192
– start-page: 392
  year: 2019
  ident: 10.1016/j.sysarc.2024.103180_b50
  article-title: Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference
– ident: 10.1016/j.sysarc.2024.103180_b7
– start-page: 1
  year: 2022
  ident: 10.1016/j.sysarc.2024.103180_b4
  article-title: Edge-cloud polarization and collaboration: A comprehensive survey for AI
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2022.3178211
– ident: 10.1016/j.sysarc.2024.103180_b37
  doi: 10.1145/3318216.3363312
– ident: 10.1016/j.sysarc.2024.103180_b27
– year: 2018
  ident: 10.1016/j.sysarc.2024.103180_b23
– year: 2019
  ident: 10.1016/j.sysarc.2024.103180_b39
– ident: 10.1016/j.sysarc.2024.103180_b22
  doi: 10.1109/CVPR.2016.90
– ident: 10.1016/j.sysarc.2024.103180_b44
  doi: 10.1109/CVPR.2019.00881
– start-page: 2168
  year: 2022
  ident: 10.1016/j.sysarc.2024.103180_b21
  article-title: PSP: Progressive space pruning for efficient graph neural architecture search
– ident: 10.1016/j.sysarc.2024.103180_b8
  doi: 10.1145/3341301.3359630
– start-page: 1393
  year: 2020
  ident: 10.1016/j.sysarc.2024.103180_b48
  article-title: FELA: Incorporating flexible parallelism and elastic tuning to accelerate large-scale DML
– start-page: 100
  year: 2019
  ident: 10.1016/j.sysarc.2024.103180_b53
  article-title: Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization
– volume: 11
  start-page: 2046
  issue: 12
  year: 2018
  ident: 10.1016/j.sysarc.2024.103180_b6
  article-title: Collaborative edge and cloud neural networks for real-time video processing
  publication-title: Proc. VLDB Endow.
  doi: 10.14778/3229863.3236256
SSID ssj0005512
Score 2.3741574
Snippet Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA,...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 103180
SubjectTerms Computation graph
Data locality
Dataflow-centric
Edge computing
Model inference
Title A high-performance dataflow-centric optimization framework for deep learning inference on the edge
URI https://dx.doi.org/10.1016/j.sysarc.2024.103180
Volume 152
WOSCitedRecordID wos001246467900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1873-6165
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0005512
  issn: 1383-7621
  databaseCode: AIEXJ
  dateStart: 19960101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELa2Sw-9lD5VKFQ-9BZ5lY0dYh9XFQhQhaqKVnuL4kdg0Ta7gg1d8RP6qzuO7WzYrWg59BJFlj15zKfx2P5mBqGPgtFESiqIYbokMEMrIjMhieE2PRV45LqpQ_b9c3Z2xsdj8aXX-xViYW6nWVXx5VLM_6uqoQ2UbUNnH6HuVig0wD0oHa6gdrj-k-JHkU1BTOadiABLAy2ns5-koWJOVDQDQ_HDR2BGZeBnNZRDbcw81JK4aLhaLg-tO1SINshDK4fWJYW-ibpnExv70l_r6rJup4LTiW8-nlUXy7olAxnPE4aFct1hCrvN2iMYc3dp_JTrdywS1rJbg5GFVTEBIzy8Z4VdIltvR23xCVfhacPEu92GqwF8FHzPwD5gsOp-P6P22kzX8g8Dte0qd1JyKyV3Up6grSRLBe-jrdHJ4fh0RRhK3dl5ePsQiNmwBTff5s-OTsd5OX-Bnnsl4ZFDy0vUM9UrtB0qemBv4F8jOcLr4MHr4MFd8OAWPBgGYAseHMCDW_Bg6AjgwRY8b9C3o8PzT8fEV-EgCpaTC0KlkkzEEta6XEnFMqZMwWKtKdeKD43mRqqDghbgBEmdxkkZp1yXsVIm08OypG9Rv5pV5h3ChYwzyY1IDsqEgXShiiFNKFUyFaks2A6i4Zflyqeot5VSpvlDCttBpB01dyla_tI_C9rIvZvp3MccIPbgyN1HPuk9erbC_x7qL65rs4-eqtvF5Ob6g8fXb0wGpXs
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+high-performance+dataflow-centric+optimization+framework+for+deep+learning+inference+on+the+edge&rft.jtitle=Journal+of+systems+architecture&rft.au=Zhang%2C+Runhua&rft.au=Jiang%2C+Hongxu&rft.au=Geng%2C+Jinkun&rft.au=Tian%2C+Fangzheng&rft.date=2024-07-01&rft.issn=1383-7621&rft.volume=152&rft.spage=103180&rft_id=info:doi/10.1016%2Fj.sysarc.2024.103180&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_sysarc_2024_103180
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1383-7621&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1383-7621&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1383-7621&client=summon