A high-performance dataflow-centric optimization framework for deep learning inference on the edge
Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly develop...
Uloženo v:
| Vydáno v: | Journal of systems architecture Ročník 152; s. 103180 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
01.07.2024
|
| Témata: | |
| ISSN: | 1383-7621, 1873-6165 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance.
Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%–84.9% and 17.9%–89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1×–1.9×. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68×–3.78× compared with the single device. |
|---|---|
| AbstractList | Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance.
Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%–84.9% and 17.9%–89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1×–1.9×. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68×–3.78× compared with the single device. |
| ArticleNumber | 103180 |
| Author | Tian, Fangzheng Geng, Jinkun Wang, Haojie Zhang, Runhua Ma, Yuhang Jiang, Hongxu |
| Author_xml | – sequence: 1 givenname: Runhua orcidid: 0000-0003-3487-5289 surname: Zhang fullname: Zhang, Runhua email: rhzhang20@buaa.edu.cn organization: Beihang University, China – sequence: 2 givenname: Hongxu surname: Jiang fullname: Jiang, Hongxu email: jianghx@buaa.edu.cn organization: Beihang University, China – sequence: 3 givenname: Jinkun orcidid: 0000-0002-6574-8349 surname: Geng fullname: Geng, Jinkun email: gjk1994@stanford.edu organization: Stanford University, United States of America – sequence: 4 givenname: Fangzheng surname: Tian fullname: Tian, Fangzheng email: amazingtian@buaa.edu.cn organization: Beihang University, China – sequence: 5 givenname: Yuhang surname: Ma fullname: Ma, Yuhang email: buaa_mayuhang@buaa.edu.cn organization: Beihang University, China – sequence: 6 givenname: Haojie surname: Wang fullname: Wang, Haojie email: wanghaojie@tsinghua.edu.cn organization: Tsinghua University, China |
| BookMark | eNqFkE1LAzEQhoNUsK3-Aw_5A1uTzX7Vg1CKX1DwoueQnUza1O1mSYKl_np3XU8e9DTDMM8L7zMjk9a1SMg1ZwvOeHGzX4RTUB4WKUuz_iR4xc7IlFelSApe5JN-F5VIyiLlF2QWwp4xluc8nZJ6RXd2u0s69Mb5g2oBqVZRmcYdE8A2egvUddEe7KeK1rXUeHXAo_PvtAeoRuxog8q3tt1S2xr0OGT0j3GHFPUWL8m5UU3Aq585J28P96_rp2Tz8vi8Xm0SEKyIiaihzpasZsuighqyMgNUGdNaVBoqjrrCGgolVMrTWucsNSyvtGEAWGpujJiTbMwF70LwaGTn7UH5k-RMDp7kXo6e5OBJjp567PYXBjZ-V41e2eY_-G6EsS_2YdHLAHYQoK1HiFI7-3fAF1Yxi64 |
| CitedBy_id | crossref_primary_10_3390_electronics14071345 crossref_primary_10_1145_3716876 crossref_primary_10_1016_j_sysarc_2025_103508 crossref_primary_10_1088_2631_8695_adf412 |
| Cites_doi | 10.1145/3337821.3337892 10.1145/3229543.3229544 10.1145/2491956.2462176 10.1145/3322795.3331463 10.1109/TKDE.2019.2923638 10.1145/3343180.3343192 10.1016/j.jpdc.2008.09.002 10.1145/3469116.3470012 10.1109/TKDE.2013.132 10.1109/TNET.2021.3117042 10.1145/3373376.3378508 10.1145/3453483.3454083 10.1145/3322795.3331461 10.1145/3358192 10.1109/TKDE.2022.3178211 10.1145/3318216.3363312 10.1109/CVPR.2016.90 10.1109/CVPR.2019.00881 10.1145/3341301.3359630 10.14778/3229863.3236256 |
| ContentType | Journal Article |
| Copyright | 2024 Elsevier B.V. |
| Copyright_xml | – notice: 2024 Elsevier B.V. |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.sysarc.2024.103180 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1873-6165 |
| ExternalDocumentID | 10_1016_j_sysarc_2024_103180 S1383762124001176 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABFRF ABJNI ABMAC ABXDB ACDAQ ACGFO ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD ADTZH AEBSH AECPX AEFWE AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BKOMP BLXMC CS3 DU5 EBS EFJIC EJD EO8 EO9 EP2 EP3 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HVGLF HZ~ IHE J1W JJJVA KOM M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 R2- RIG ROL RPZ RXW SBC SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K TAE TN5 U5U UHS ~G- 9DU AATTM AAXKI AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKYEP ANKPU APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c306t-3bcb490b0968cbc474cea40dd38dc81ed8ebc6a3a212bd502f058df0cce7d1ff3 |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001246467900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1383-7621 |
| IngestDate | Sat Nov 29 01:35:59 EST 2025 Tue Nov 18 21:45:05 EST 2025 Tue Jun 18 08:51:17 EDT 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Computation graph Dataflow-centric Data locality Model inference Edge computing |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-3bcb490b0968cbc474cea40dd38dc81ed8ebc6a3a212bd502f058df0cce7d1ff3 |
| ORCID | 0000-0002-6574-8349 0000-0003-3487-5289 |
| ParticipantIDs | crossref_primary_10_1016_j_sysarc_2024_103180 crossref_citationtrail_10_1016_j_sysarc_2024_103180 elsevier_sciencedirect_doi_10_1016_j_sysarc_2024_103180 |
| PublicationCentury | 2000 |
| PublicationDate | July 2024 2024-07-00 |
| PublicationDateYYYYMMDD | 2024-07-01 |
| PublicationDate_xml | – month: 07 year: 2024 text: July 2024 |
| PublicationDecade | 2020 |
| PublicationTitle | Journal of systems architecture |
| PublicationYear | 2024 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Xilinx, Xilinx ZCU102 J. Geng, D. Li, S. Wang, Elasticpipe: An efficient and dynamic model-parallel solution to dnn training, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 5–9. Huang, Ma, Fan, Liu, Gong (b1) 2017 Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections, in: 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 21, 2021, pp. 37–54. Xiang, Kim (b50) 2019 W. Niu, J. Guan, Y. Wang, G. Agrawal, B. Ren, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, in: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898. Jiang, Sha, Zhang, Yang, Zhuge, Shi, Hu (b54) 2019; 18 NVIDIA, NVIDIA TensorRT Dong, Gao, Huang, Wawrzynek, So, Keutzer (b42) 2021 J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013. Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, Adam (b18) 2017 Marculescu, Stamoulis, Cai (b40) 2018 J. Geng, D. Li, S. Wang, Accelerating distributed machine learning by smart parameter server, in: Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019, 2019, pp. 92–98. M. Li, et al., Scaling distributed machine learning with the parameter server, in: OSDI’14. Hong, Li, Liu, Yang, Tang (b43) 2022 Zhang, Lu, Hao, Li, Cheng, Li, Rupnow, Xiong, Huang, Shi (b55) 2020; 2 Hochstetler, Padidela, Chen, Yang, Fu (b2) 2018 Cheng, Li, Guo, Jiang, Geng, Bai, Wu, Xiong (b17) 2020; 32 . Zhang, Jiang, Tian, Geng, Li, Ma, Zhu, Dong, Li, Wang (b56) 2023 L. Zhou, M.H. Samavatian, A. Bacha, S. Majumdar, R. Teodorescu, Adaptive parallel execution of deep neural networks on heterogeneous edge devices, in: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019, pp. 195–208. Dautov, Distefano (b5) 2021; 33 Wang, Hu, Wu (b29) 2020 Z. Jia, J. Thomas, T. Warzawski, M. Gao, M. Zaharia, A. Aiken, Optimizing DNN Computation with Relaxed Graph Substitutions, in: Proceedings of the 2nd Conference on Systems and Machine Learning, SysML ’19, 2019. Xilinx, Xilinx U series K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, Haq: Hardware-aware automated quantization with mixed precision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620. S. Laskaridis, A. Kouris, N.D. Lane, Adaptive inference through early-exit networks: Design, challenges and directions, in: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, 2021, pp. 1–6. Hadidi, Cao, Ryoo, Kim (b39) 2019 L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al., Ansor: generating high-performance tensor programs for deep learning, in: 14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI} 20, 2020, pp. 863–879. H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, et al., Exploiting bounded staleness to speed up big data analytics, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. 37–48. Geng, Li, Wang (b48) 2020 T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al., {TVM}: An automated {End-to-End} optimizing compiler for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594. J. Geng, D. Li, S. Wang, Horizontal or vertical? a hybrid approach to large-scale distributed machine learning, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 1–4. Jiang, Xu, Xu, Wang, Qiao, Zhao (b20) 2022 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283. Wang, Xie (b3) 2020 Wang, Geng, Li (b13) 2022; 30 H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, Z. Jia Li, Chen, You, Wang, Lin (b41) 2020 Grulich, Nawab (b6) 2018; 11 Devlin, Chang, Lee, Toutanova (b23) 2018 Geng, Li, Wang (b53) 2019 Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, A. Aiken, TASO: optimizing deep learning computation with automatic generation of graph substitutions, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 47–62. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. Apache TVM, TVM-deploy Hapfelmeier, Pfahringer, Kramer (b19) 2014; 26 Yao, Zhang, Yao, Wang, Ma, Zhang, Chu, Ji, Jia, Shen, Wu, Zhang, Tan, Kuang, Wu, Wu, Zhou, Yang (b4) 2022 Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, J. Geng, X. Yu, W. Bai, L. Qu, et al., Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines, in: Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–11. Galjaard, Cox, Ghiassi, Chen, Birke (b38) 2021 J. Geng, D. Li, Y. Cheng, S. Wang, J. Li, HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning, in: Proceedings of the 2018 Workshop on Network Meets AI & ML, 2018, pp. 1–7. S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 859–873. Li, Andersen, Park, Smola, Ahmed, Josifovski, Long, Shekita, Su (b25) 2014 Zhu, Wang, Xu, Cheng, Qiu, Yuan, Huang (b21) 2022 Tandon, Lei, Dimakis, Karampatziakis (b52) 2017 OpenXLA, XLA: Optimizing Compiler for Machine Learning Patarasuk, Yuan (b11) 2009; 69 Texas Instruments, TI TMS320C6678 Radford, Narasimhan, Salimans, Sutskever (b24) 2018 Wang (10.1016/j.sysarc.2024.103180_b29) 2020 Howard (10.1016/j.sysarc.2024.103180_b18) 2017 Dong (10.1016/j.sysarc.2024.103180_b42) 2021 Wang (10.1016/j.sysarc.2024.103180_b13) 2022; 30 Geng (10.1016/j.sysarc.2024.103180_b48) 2020 Jiang (10.1016/j.sysarc.2024.103180_b54) 2019; 18 10.1016/j.sysarc.2024.103180_b22 Li (10.1016/j.sysarc.2024.103180_b41) 2020 10.1016/j.sysarc.2024.103180_b16 10.1016/j.sysarc.2024.103180_b14 10.1016/j.sysarc.2024.103180_b15 Devlin (10.1016/j.sysarc.2024.103180_b23) 2018 10.1016/j.sysarc.2024.103180_b30 10.1016/j.sysarc.2024.103180_b31 Geng (10.1016/j.sysarc.2024.103180_b53) 2019 10.1016/j.sysarc.2024.103180_b34 10.1016/j.sysarc.2024.103180_b35 10.1016/j.sysarc.2024.103180_b32 Radford (10.1016/j.sysarc.2024.103180_b24) 2018 10.1016/j.sysarc.2024.103180_b33 10.1016/j.sysarc.2024.103180_b27 10.1016/j.sysarc.2024.103180_b28 10.1016/j.sysarc.2024.103180_b26 Hapfelmeier (10.1016/j.sysarc.2024.103180_b19) 2014; 26 Tandon (10.1016/j.sysarc.2024.103180_b52) 2017 Hochstetler (10.1016/j.sysarc.2024.103180_b2) 2018 Cheng (10.1016/j.sysarc.2024.103180_b17) 2020; 32 Patarasuk (10.1016/j.sysarc.2024.103180_b11) 2009; 69 Li (10.1016/j.sysarc.2024.103180_b25) 2014 Huang (10.1016/j.sysarc.2024.103180_b1) 2017 10.1016/j.sysarc.2024.103180_b45 10.1016/j.sysarc.2024.103180_b46 Zhu (10.1016/j.sysarc.2024.103180_b21) 2022 Hadidi (10.1016/j.sysarc.2024.103180_b39) 2019 10.1016/j.sysarc.2024.103180_b44 Dautov (10.1016/j.sysarc.2024.103180_b5) 2021; 33 10.1016/j.sysarc.2024.103180_b36 10.1016/j.sysarc.2024.103180_b37 Grulich (10.1016/j.sysarc.2024.103180_b6) 2018; 11 Zhang (10.1016/j.sysarc.2024.103180_b56) 2023 Wang (10.1016/j.sysarc.2024.103180_b3) 2020 Xiang (10.1016/j.sysarc.2024.103180_b50) 2019 Zhang (10.1016/j.sysarc.2024.103180_b55) 2020; 2 10.1016/j.sysarc.2024.103180_b9 10.1016/j.sysarc.2024.103180_b51 10.1016/j.sysarc.2024.103180_b8 10.1016/j.sysarc.2024.103180_b12 10.1016/j.sysarc.2024.103180_b7 Marculescu (10.1016/j.sysarc.2024.103180_b40) 2018 10.1016/j.sysarc.2024.103180_b10 Hong (10.1016/j.sysarc.2024.103180_b43) 2022 Galjaard (10.1016/j.sysarc.2024.103180_b38) 2021 10.1016/j.sysarc.2024.103180_b49 10.1016/j.sysarc.2024.103180_b47 Jiang (10.1016/j.sysarc.2024.103180_b20) 2022 Yao (10.1016/j.sysarc.2024.103180_b4) 2022 |
| References_xml | – volume: 26 start-page: 2072 year: 2014 end-page: 2076 ident: b19 article-title: Pruning incremental linear model trees with approximate lookahead publication-title: IEEE Trans. Knowl. Data Eng. – start-page: 392 year: 2019 end-page: 405 ident: b50 article-title: Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference publication-title: 2019 IEEE Real-Time Systems Symposium – reference: M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283. – reference: . OpenXLA, XLA: Optimizing Compiler for Machine Learning, – reference: S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system, in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 859–873. – reference: . Xilinx, Xilinx ZCU102, – start-page: 535 year: 2023 end-page: 545 ident: b56 article-title: Xenos: Dataflow-centric optimization to accelerate model inference on edge devices publication-title: International Conference on Database Systems for Advanced Applications – reference: M. Li, et al., Scaling distributed machine learning with the parameter server, in: OSDI’14. – reference: H. Cui, J. Cipar, Q. Ho, J.K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G.R. Ganger, P.B. Gibbons, et al., Exploiting bounded staleness to speed up big data analytics, in: 2014 USENIX Annual Technical Conference, USENIX ATC 14, 2014, pp. 37–48. – reference: T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al., {TVM}: An automated {End-to-End} optimizing compiler for deep learning, in: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594. – start-page: 50 year: 2021 end-page: 59 ident: b42 article-title: Hao: Hardware-aware neural architecture optimization for efficient inference publication-title: 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines – start-page: 1 year: 2018 end-page: 8 ident: b40 article-title: Hardware-aware machine learning: Modeling and optimization publication-title: 2018 IEEE/ACM International Conference on Computer-Aided Design – start-page: 1393 year: 2020 end-page: 1404 ident: b48 article-title: FELA: Incorporating flexible parallelism and elastic tuning to accelerate large-scale DML publication-title: 2020 IEEE 36th International Conference on Data Engineering – volume: 33 start-page: 55 year: 2021 end-page: 69 ident: b5 article-title: Automating IoT data-intensive application allocation in clustered edge computing publication-title: IEEE Trans. Knowl. Data Eng. – start-page: 500 year: 2020 end-page: 518 ident: b41 article-title: Halo: Hardware-aware learning to optimize publication-title: European Conference on Computer Vision – year: 2022 ident: b43 article-title: Multi-objective evolutionary optimization for hardware-aware neural network pruning publication-title: Fund. Res. – start-page: 1 year: 2022 ident: b4 article-title: Edge-cloud polarization and collaboration: A comprehensive survey for AI publication-title: IEEE Trans. Knowl. Data Eng. – year: 2020 ident: b29 article-title: Kubeedge. ai: Ai platform for edge devices – start-page: 1 year: 2017 end-page: 2 ident: b1 article-title: When deep learning meets edge computing publication-title: 2017 IEEE 25th International Conference on Network Protocols – start-page: 767 year: 2022 end-page: 779 ident: b20 article-title: Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing publication-title: 2022 IEEE 38th International Conference on Data Engineering – reference: J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, S. Amarasinghe, Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines, in: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, 2013. – reference: J. Geng, D. Li, S. Wang, Elasticpipe: An efficient and dynamic model-parallel solution to dnn training, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 5–9. – reference: }: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections, in: 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 21, 2021, pp. 37–54. – reference: . Apache TVM, TVM-deploy, – volume: 32 start-page: 1802 year: 2020 end-page: 1814 ident: b17 article-title: Accelerating end-to-end deep learning workflow with codesign of data preprocessing and scheduling publication-title: IEEE Trans. Parallel Distrib. Syst. – reference: J. Geng, D. Li, Y. Cheng, S. Wang, J. Li, HiPS: Hierarchical parameter synchronization in large-scale distributed machine learning, in: Proceedings of the 2018 Workshop on Network Meets AI & ML, 2018, pp. 1–7. – reference: . Xilinx, Xilinx U series, – reference: J. Geng, D. Li, S. Wang, Accelerating distributed machine learning by smart parameter server, in: Proceedings of the 3rd Asia-Pacific Workshop on Networking 2019, 2019, pp. 92–98. – reference: W. Niu, J. Guan, Y. Wang, G. Agrawal, B. Ren, DNNFusion: accelerating deep neural networks execution with advanced operator fusion, in: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898. – start-page: 100 year: 2019 end-page: 111 ident: b53 article-title: Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization publication-title: 2019 IEEE 35th International Conference on Data Engineering – volume: 30 start-page: 572 year: 2022 end-page: 585 ident: b13 article-title: Impact of synchronization topology on DML performance: Both logical topology and physical topology publication-title: IEEE/ACM Trans. Netw. – start-page: 2168 year: 2022 end-page: 2181 ident: b21 article-title: PSP: Progressive space pruning for efficient graph neural architecture search publication-title: 2022 IEEE 38th International Conference on Data Engineering – reference: Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, A. Aiken, TASO: optimizing deep learning computation with automatic generation of graph substitutions, in: Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 47–62. – reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. – reference: . NVIDIA, NVIDIA TensorRT, – volume: 2 start-page: 216 year: 2020 end-page: 229 ident: b55 article-title: SkyNet: a hardware-efficient method for object detection and tracking on embedded systems publication-title: Proc. Mach. Learn. Syst. – reference: J. Geng, D. Li, S. Wang, Horizontal or vertical? a hybrid approach to large-scale distributed machine learning, in: Proceedings of the 10th Workshop on Scientific Cloud Computing, 2019, pp. 1–4. – reference: K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, Haq: Hardware-aware automated quantization with mixed precision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620. – start-page: 3368 year: 2017 end-page: 3376 ident: b52 article-title: Gradient coding: Avoiding stragglers in distributed learning publication-title: International Conference on Machine Learning – reference: L. Zheng, C. Jia, M. Sun, Z. Wu, C.H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, et al., Ansor: generating high-performance tensor programs for deep learning, in: 14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI} 20, 2020, pp. 863–879. – year: 2018 ident: b24 article-title: Improving language understanding by generative pre-training – reference: S. Laskaridis, A. Kouris, N.D. Lane, Adaptive inference through early-exit networks: Design, challenges and directions, in: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, 2021, pp. 1–6. – start-page: 1379 year: 2020 end-page: 1388 ident: b3 article-title: User preference based energy-aware mobile AR system with edge computing publication-title: IEEE INFOCOM 2020-IEEE Conference on Computer Communications – reference: H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, Z. Jia, { – reference: L. Zhou, M.H. Samavatian, A. Bacha, S. Majumdar, R. Teodorescu, Adaptive parallel execution of deep neural networks on heterogeneous edge devices, in: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing, 2019, pp. 195–208. – year: 2019 ident: b39 article-title: Collaborative execution of deep neural networks on internet of things devices – start-page: 281 year: 2021 end-page: 286 ident: b38 article-title: Mema: Fast inference of multiple deep models publication-title: 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events – volume: 11 start-page: 2046 year: 2018 end-page: 2049 ident: b6 article-title: Collaborative edge and cloud neural networks for real-time video processing publication-title: Proc. VLDB Endow. – reference: Z. Jia, J. Thomas, T. Warzawski, M. Gao, M. Zaharia, A. Aiken, Optimizing DNN Computation with Relaxed Graph Substitutions, in: Proceedings of the 2nd Conference on Systems and Machine Learning, SysML ’19, 2019. – reference: . – start-page: 583 year: 2014 end-page: 598 ident: b25 article-title: Scaling distributed machine learning with the parameter server publication-title: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation – volume: 18 start-page: 1 year: 2019 end-page: 23 ident: b54 article-title: Achieving super-linear speedup across multi-fpga for real-time dnn inference publication-title: ACM Trans. Embedded Comput. Syst. (TECS) – year: 2018 ident: b23 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding – reference: Y. Cheng, D. Li, Z. Guo, B. Jiang, J. Lin, X. Fan, J. Geng, X. Yu, W. Bai, L. Qu, et al., Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines, in: Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–11. – start-page: 341 year: 2018 end-page: 343 ident: b2 article-title: Embedded deep learning for vehicular edge computing publication-title: 2018 IEEE/ACM Symposium on Edge Computing – year: 2017 ident: b18 article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications – volume: 69 start-page: 117 year: 2009 end-page: 124 ident: b11 article-title: Bandwidth optimal all-reduce algorithms for clusters of workstations publication-title: J. Parallel Distrib. Comput. – reference: . Texas Instruments, TI TMS320C6678, – start-page: 583 year: 2014 ident: 10.1016/j.sysarc.2024.103180_b25 article-title: Scaling distributed machine learning with the parameter server – ident: 10.1016/j.sysarc.2024.103180_b47 – ident: 10.1016/j.sysarc.2024.103180_b16 doi: 10.1145/3337821.3337892 – start-page: 1 year: 2018 ident: 10.1016/j.sysarc.2024.103180_b40 article-title: Hardware-aware machine learning: Modeling and optimization – ident: 10.1016/j.sysarc.2024.103180_b30 – ident: 10.1016/j.sysarc.2024.103180_b12 doi: 10.1145/3229543.3229544 – start-page: 341 year: 2018 ident: 10.1016/j.sysarc.2024.103180_b2 article-title: Embedded deep learning for vehicular edge computing – start-page: 1379 year: 2020 ident: 10.1016/j.sysarc.2024.103180_b3 article-title: User preference based energy-aware mobile AR system with edge computing – ident: 10.1016/j.sysarc.2024.103180_b15 – ident: 10.1016/j.sysarc.2024.103180_b34 doi: 10.1145/2491956.2462176 – ident: 10.1016/j.sysarc.2024.103180_b49 doi: 10.1145/3322795.3331463 – volume: 33 start-page: 55 issue: 1 year: 2021 ident: 10.1016/j.sysarc.2024.103180_b5 article-title: Automating IoT data-intensive application allocation in clustered edge computing publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2019.2923638 – start-page: 50 year: 2021 ident: 10.1016/j.sysarc.2024.103180_b42 article-title: Hao: Hardware-aware neural architecture optimization for efficient inference – ident: 10.1016/j.sysarc.2024.103180_b51 doi: 10.1145/3343180.3343192 – volume: 69 start-page: 117 issue: 2 year: 2009 ident: 10.1016/j.sysarc.2024.103180_b11 article-title: Bandwidth optimal all-reduce algorithms for clusters of workstations publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2008.09.002 – ident: 10.1016/j.sysarc.2024.103180_b9 – start-page: 535 year: 2023 ident: 10.1016/j.sysarc.2024.103180_b56 article-title: Xenos: Dataflow-centric optimization to accelerate model inference on edge devices – start-page: 767 year: 2022 ident: 10.1016/j.sysarc.2024.103180_b20 article-title: Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing – ident: 10.1016/j.sysarc.2024.103180_b28 doi: 10.1145/3469116.3470012 – year: 2018 ident: 10.1016/j.sysarc.2024.103180_b24 – start-page: 3368 year: 2017 ident: 10.1016/j.sysarc.2024.103180_b52 article-title: Gradient coding: Avoiding stragglers in distributed learning – ident: 10.1016/j.sysarc.2024.103180_b31 – ident: 10.1016/j.sysarc.2024.103180_b10 – start-page: 281 year: 2021 ident: 10.1016/j.sysarc.2024.103180_b38 article-title: Mema: Fast inference of multiple deep models – ident: 10.1016/j.sysarc.2024.103180_b14 – ident: 10.1016/j.sysarc.2024.103180_b35 – year: 2017 ident: 10.1016/j.sysarc.2024.103180_b18 – volume: 26 start-page: 2072 issue: 8 year: 2014 ident: 10.1016/j.sysarc.2024.103180_b19 article-title: Pruning incremental linear model trees with approximate lookahead publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2013.132 – volume: 32 start-page: 1802 issue: 7 year: 2020 ident: 10.1016/j.sysarc.2024.103180_b17 article-title: Accelerating end-to-end deep learning workflow with codesign of data preprocessing and scheduling publication-title: IEEE Trans. Parallel Distrib. Syst. – volume: 30 start-page: 572 issue: 2 year: 2022 ident: 10.1016/j.sysarc.2024.103180_b13 article-title: Impact of synchronization topology on DML performance: Both logical topology and physical topology publication-title: IEEE/ACM Trans. Netw. doi: 10.1109/TNET.2021.3117042 – ident: 10.1016/j.sysarc.2024.103180_b26 – ident: 10.1016/j.sysarc.2024.103180_b36 doi: 10.1145/3373376.3378508 – year: 2020 ident: 10.1016/j.sysarc.2024.103180_b29 – ident: 10.1016/j.sysarc.2024.103180_b45 – ident: 10.1016/j.sysarc.2024.103180_b32 – start-page: 500 year: 2020 ident: 10.1016/j.sysarc.2024.103180_b41 article-title: Halo: Hardware-aware learning to optimize – ident: 10.1016/j.sysarc.2024.103180_b33 doi: 10.1145/3453483.3454083 – start-page: 1 year: 2017 ident: 10.1016/j.sysarc.2024.103180_b1 article-title: When deep learning meets edge computing – year: 2022 ident: 10.1016/j.sysarc.2024.103180_b43 article-title: Multi-objective evolutionary optimization for hardware-aware neural network pruning publication-title: Fund. Res. – ident: 10.1016/j.sysarc.2024.103180_b46 doi: 10.1145/3322795.3331461 – volume: 2 start-page: 216 year: 2020 ident: 10.1016/j.sysarc.2024.103180_b55 article-title: SkyNet: a hardware-efficient method for object detection and tracking on embedded systems publication-title: Proc. Mach. Learn. Syst. – volume: 18 start-page: 1 issue: 5s year: 2019 ident: 10.1016/j.sysarc.2024.103180_b54 article-title: Achieving super-linear speedup across multi-fpga for real-time dnn inference publication-title: ACM Trans. Embedded Comput. Syst. (TECS) doi: 10.1145/3358192 – start-page: 392 year: 2019 ident: 10.1016/j.sysarc.2024.103180_b50 article-title: Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference – ident: 10.1016/j.sysarc.2024.103180_b7 – start-page: 1 year: 2022 ident: 10.1016/j.sysarc.2024.103180_b4 article-title: Edge-cloud polarization and collaboration: A comprehensive survey for AI publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2022.3178211 – ident: 10.1016/j.sysarc.2024.103180_b37 doi: 10.1145/3318216.3363312 – ident: 10.1016/j.sysarc.2024.103180_b27 – year: 2018 ident: 10.1016/j.sysarc.2024.103180_b23 – year: 2019 ident: 10.1016/j.sysarc.2024.103180_b39 – ident: 10.1016/j.sysarc.2024.103180_b22 doi: 10.1109/CVPR.2016.90 – ident: 10.1016/j.sysarc.2024.103180_b44 doi: 10.1109/CVPR.2019.00881 – start-page: 2168 year: 2022 ident: 10.1016/j.sysarc.2024.103180_b21 article-title: PSP: Progressive space pruning for efficient graph neural architecture search – ident: 10.1016/j.sysarc.2024.103180_b8 doi: 10.1145/3341301.3359630 – start-page: 1393 year: 2020 ident: 10.1016/j.sysarc.2024.103180_b48 article-title: FELA: Incorporating flexible parallelism and elastic tuning to accelerate large-scale DML – start-page: 100 year: 2019 ident: 10.1016/j.sysarc.2024.103180_b53 article-title: Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization – volume: 11 start-page: 2046 issue: 12 year: 2018 ident: 10.1016/j.sysarc.2024.103180_b6 article-title: Collaborative edge and cloud neural networks for real-time video processing publication-title: Proc. VLDB Endow. doi: 10.14778/3229863.3236256 |
| SSID | ssj0005512 |
| Score | 2.3741574 |
| Snippet | Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA,... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 103180 |
| SubjectTerms | Computation graph Data locality Dataflow-centric Edge computing Model inference |
| Title | A high-performance dataflow-centric optimization framework for deep learning inference on the edge |
| URI | https://dx.doi.org/10.1016/j.sysarc.2024.103180 |
| Volume | 152 |
| WOSCitedRecordID | wos001246467900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1873-6165 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0005512 issn: 1383-7621 databaseCode: AIEXJ dateStart: 19960101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9wwELa2Sw-9lD5VKFQ-9BZ5lY0dYh9XFQhQhaqKVnuL4kdg0Ta7gg1d8RP6qzuO7WzYrWg59BJFlj15zKfx2P5mBqGPgtFESiqIYbokMEMrIjMhieE2PRV45LqpQ_b9c3Z2xsdj8aXX-xViYW6nWVXx5VLM_6uqoQ2UbUNnH6HuVig0wD0oHa6gdrj-k-JHkU1BTOadiABLAy2ns5-koWJOVDQDQ_HDR2BGZeBnNZRDbcw81JK4aLhaLg-tO1SINshDK4fWJYW-ibpnExv70l_r6rJup4LTiW8-nlUXy7olAxnPE4aFct1hCrvN2iMYc3dp_JTrdywS1rJbg5GFVTEBIzy8Z4VdIltvR23xCVfhacPEu92GqwF8FHzPwD5gsOp-P6P22kzX8g8Dte0qd1JyKyV3Up6grSRLBe-jrdHJ4fh0RRhK3dl5ePsQiNmwBTff5s-OTsd5OX-Bnnsl4ZFDy0vUM9UrtB0qemBv4F8jOcLr4MHr4MFd8OAWPBgGYAseHMCDW_Bg6AjgwRY8b9C3o8PzT8fEV-EgCpaTC0KlkkzEEta6XEnFMqZMwWKtKdeKD43mRqqDghbgBEmdxkkZp1yXsVIm08OypG9Rv5pV5h3ChYwzyY1IDsqEgXShiiFNKFUyFaks2A6i4Zflyqeot5VSpvlDCttBpB01dyla_tI_C9rIvZvp3MccIPbgyN1HPuk9erbC_x7qL65rs4-eqtvF5Ob6g8fXb0wGpXs |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+high-performance+dataflow-centric+optimization+framework+for+deep+learning+inference+on+the+edge&rft.jtitle=Journal+of+systems+architecture&rft.au=Zhang%2C+Runhua&rft.au=Jiang%2C+Hongxu&rft.au=Geng%2C+Jinkun&rft.au=Tian%2C+Fangzheng&rft.date=2024-07-01&rft.issn=1383-7621&rft.volume=152&rft.spage=103180&rft_id=info:doi/10.1016%2Fj.sysarc.2024.103180&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_sysarc_2024_103180 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1383-7621&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1383-7621&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1383-7621&client=summon |