Edge-LLM Inference With Cost-Aware Layer Allocation and Adaptive Scheduling

This paper addresses two key challenges in distributed Large Language Model (LLM) inference at the edge: 1) cost-efficient and fair task allocation, and 2) dynamic scheduling under deadline constraints. We propose two mechanisms: the Fair Cost-Efficient Incentive Mechanism (FCIM) for task and layer...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access Vol. 13; pp. 131614 - 131637
Main Authors:	Habibi, Sama, Ercetin, Ozgur
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Adaptive algorithms Adaptive scheduling Completion time Computational modeling Constraints Costs distributed AI Dynamic scheduling edge computing fair incentive mechanism Inference Large language models Memory devices Memory management Performance evaluation Pipelines Processor scheduling Queueing resource allocation Resource management Scalability Scheduling
ISSN:	2169-3536, 2169-3536
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper addresses two key challenges in distributed Large Language Model (LLM) inference at the edge: 1) cost-efficient and fair task allocation, and 2) dynamic scheduling under deadline constraints. We propose two mechanisms: the Fair Cost-Efficient Incentive Mechanism (FCIM) for task and layer assignment, and the Adaptive Dynamic Scheduling Algorithm (ADSA) for execution scheduling on individual devices. FCIM is an auction-based mechanism that selects cost-effective, memory-feasible devices while minimizing task latency, reward cost, and device usage. Its adaptive reward design ensures positive utility and fairness, even under shifting system priorities. ADSA enables preemption-aware, deadline-driven scheduling by dynamically reordering tasks based on arrival time and workload characteristics. Simulations demonstrate that FCIM reduces communication overhead by 54.7% and task completion time by 36.9% compared to static and performance-driven baselines, while ADSA reduces queueing delay by 39% under strict deadline constraints.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2025.3592308