A high-performance dataflow-centric optimization framework for deep learning inference on the edge

Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly develop...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of systems architecture Jg. 152; S. 103180
Hauptverfasser: Zhang, Runhua, Jiang, Hongxu, Geng, Jinkun, Tian, Fangzheng, Ma, Yuhang, Wang, Haojie
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 01.07.2024
Schlagworte:
ISSN:1383-7621, 1873-6165
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance. Targeting the existing drawbacks of operator-centric frameworks, we design Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation demonstrates the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 15.0%–84.9% and 17.9%–89.9% , respectively. Besides, Xenos also outperforms the widely-used TVM by 1.1×–1.9×. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68×–3.78× compared with the single device.
ISSN:1383-7621
1873-6165
DOI:10.1016/j.sysarc.2024.103180