Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

The rapid advancement and extensive implementation of Large Language Models (LLMs) are milestones in the realm of artificial intelligence. Although Parameter-Efficient Transfer Learning (PETL), a.k.a. Adapter, methods have reduced the barrier for fine-tuning and inference on LLMs, it becomes a chall...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Proceedings (IEEE International Conference on Web Services. Online) s. 799 - 809
Hlavní autori:	Cai, Fenglong, Yuan, Dong, Yang, Zhe, Cui, Lizhen
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 07.07.2024
Predmet:	Adaptation models Collaboration collaborative framework Computational modeling edge computing Frequency modulation Graphics processing units large language model Large language models online scheduling algorithm Quality of service Quantization (signal) Scheduling algorithms service Web services
ISSN:	2836-3868
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	The rapid advancement and extensive implementation of Large Language Models (LLMs) are milestones in the realm of artificial intelligence. Although Parameter-Efficient Transfer Learning (PETL), a.k.a. Adapter, methods have reduced the barrier for fine-tuning and inference on LLMs, it becomes a challenge to efficiently deploy and fine-tuning different adapter models needed for massive AI applications. With the popularity of SoC chips, the computing power of edge devices has improved significantly. To meet the computational resources required by LLM applications and improve quality of service (QoS), we propose Edge-LLM, a server-node collaboration framework for large-scale language model serving, to efficiently utilize edge resources to accelerate LLM fine-tuning and inference in resource-constrained scenarios. In the framework, we implement an adaptive quantization strategy, FM cache mechanism, and value density first (VDF) scheduling algorithm to reduce GPU overhead and accelerate LLM computation. The experimental results demonstrate that Edge-LLM can significantly improve overall computational speed by a factor of 17, decrease the number of tasks experiencing timeouts by 63%, and reduce GPU overhead by up to 43%.
ISSN:	2836-3868
DOI:	10.1109/ICWS62655.2024.00099