Enabling On-Tiny-Device Model Personalization via Gradient Condensing and Alternant Partial Update

On-device training enables the model to adapt to user-specific data by fine-tuning a pre-trained model locally. As embedded devices become ubiquitous, on-device training is increasingly essential since users can benefit from the personalized model without transmitting data and model parameters to th...

Full description

Saved in:
Bibliographic Details
Published in:2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors: Jia, Zhenge, Shi, Yiyang, Bao, Zeyu, Wang, Zirui, Pang, Xin, Liu, Huiguo, Duan, Yu, Shen, Zhaoyan, Zhao, Mengying
Format: Conference Proceeding
Language:English
Published: IEEE 22.06.2025
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:On-device training enables the model to adapt to user-specific data by fine-tuning a pre-trained model locally. As embedded devices become ubiquitous, on-device training is increasingly essential since users can benefit from the personalized model without transmitting data and model parameters to the server. Despite significant efforts toward efficient training, ondevice training still faces a major challenge: The prohibitive cost of multi-layer backpropagation strains the limited resources of tiny devices. In this paper, we propose an algorithm-system cooptimization framework TinyMP that enables self-adaptive on-tiny-device model personalization. To mitigate backpropagation costs, we introduce Gradient Condensing to condense the gradient map structure, significantly reducing the computational complexity and memory consumption of backpropagation while preserving model performance. To further reduce computation overhead, we propose Alternant Partial Update, a mechanism that locally and alternatively selects essential parameters to update without requiring retraining or offline evolutionary search. Our framework is evaluated through extensive experiments using various CNN models (e.g., MobileNetV2, MCUNet) on embedded devices with minimal resources (e.g., OpenMV-H7 with less than 1MB SRAM and 2 MB Flash). Experimental results show that our framework achieves up to 2.4 \times speedup, 80.8% memory saving, and 30.3% accuracy improvement on downstream tasks, outperforming SOTA approaches.
DOI:10.1109/DAC63849.2025.11132925