Enabling On-Tiny-Device Model Personalization via Gradient Condensing and Alternant Partial Update

On-device training enables the model to adapt to user-specific data by fine-tuning a pre-trained model locally. As embedded devices become ubiquitous, on-device training is increasingly essential since users can benefit from the personalized model without transmitting data and model parameters to th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2025 62nd ACM/IEEE Design Automation Conference (DAC) S. 1 - 7
Hauptverfasser: Jia, Zhenge, Shi, Yiyang, Bao, Zeyu, Wang, Zirui, Pang, Xin, Liu, Huiguo, Duan, Yu, Shen, Zhaoyan, Zhao, Mengying
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 22.06.2025
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:On-device training enables the model to adapt to user-specific data by fine-tuning a pre-trained model locally. As embedded devices become ubiquitous, on-device training is increasingly essential since users can benefit from the personalized model without transmitting data and model parameters to the server. Despite significant efforts toward efficient training, ondevice training still faces a major challenge: The prohibitive cost of multi-layer backpropagation strains the limited resources of tiny devices. In this paper, we propose an algorithm-system cooptimization framework TinyMP that enables self-adaptive on-tiny-device model personalization. To mitigate backpropagation costs, we introduce Gradient Condensing to condense the gradient map structure, significantly reducing the computational complexity and memory consumption of backpropagation while preserving model performance. To further reduce computation overhead, we propose Alternant Partial Update, a mechanism that locally and alternatively selects essential parameters to update without requiring retraining or offline evolutionary search. Our framework is evaluated through extensive experiments using various CNN models (e.g., MobileNetV2, MCUNet) on embedded devices with minimal resources (e.g., OpenMV-H7 with less than 1MB SRAM and 2 MB Flash). Experimental results show that our framework achieves up to 2.4 \times speedup, 80.8% memory saving, and 30.3% accuracy improvement on downstream tasks, outperforming SOTA approaches.
DOI:10.1109/DAC63849.2025.11132925