Compressing and Fine-tuning DNNs for Efficient Inference in Mobile Device-Edge Continuum

Pruning deep neural networks (DNN) is a well-known technique that allows for a sensible reduction in inference cost. However, this may severely degrade the accuracy achieved by the model unless the latter is properly fine-tuned, which may, in turn, result in increased computational cost and latency....

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2024 IEEE International Mediterranean Conference on Communications and Networking (MeditCom) s. 305 - 310
Hlavní autoři:	Singh, Gurtaj, Chukhno, Olga, Campolo, Claudia, Molinaro, Antonella, Chiasserini, Carla Fabiana
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 08.07.2024
Témata:	Accuracy Artificial neural networks Computational modeling Costs Edge computing Edge-mobile device continuum Machine learning Machine learning pipeline ML model compression Mobile handsets Pipelines
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Pruning deep neural networks (DNN) is a well-known technique that allows for a sensible reduction in inference cost. However, this may severely degrade the accuracy achieved by the model unless the latter is properly fine-tuned, which may, in turn, result in increased computational cost and latency. Thus, upon deploying a DNN in resource-constrained edge environments, it is critical to find the best trade-off between accuracy (hence, model complexity) and latency and energy consumption. In this work, we explore the different options for the deployment of a machine learning pipeline, encompassing pruning, finetuning, and inference, across a mobile device requesting inference tasks and an edge server, and considering privacy constraints on the data to be used for fine-tuning. Our experimental analysis provides insights for an efficient allocation of the pipeline tasks across network edge and mobile device in terms of energy and network costs, as the target inference latency and accuracy vary. In particular, our results highlight that the higher the edge server load and the number of inference requests, the more convenient it becomes to deploy the entire pipeline at the mobile device using a pruned model, with a cost reduction of up to a factor two compared to deploying the whole pipeline at the edge.
DOI:	10.1109/MeditCom61057.2024.10621155