MW-MADDPG: a meta-learning based decision-making method for collaborative UAV swarm

Unmanned Aerial Vehicles (UAVs) have gained popularity due to their low lifecycle cost and minimal human risk, resulting in their widespread use in recent years. In the UAV swarm cooperative decision domain, multi-agent deep reinforcement learning has significant potential. However, current approach...

Full description

Saved in:

Bibliographic Details
Published in:	Frontiers in neurorobotics Vol. 17; p. 1243174
Main Authors:	Zhao, Minrui, Wang, Gang, Fu, Qiang, Guo, Xiangke, Chen, Yu, Li, Tengda, Liu, XiangYu
Format:	Journal Article
Language:	English
Published:	Lausanne Frontiers Research Foundation 21.09.2023 Frontiers Media S.A
Subjects:	Algorithms Cluster analysis Collaboration Decision making Deep learning Efficiency Evacuations & rescues Learning MADDPG meta learning Model Agnostic Meta Learning (MAML) multi-agent reinforcement learning (MARL) Neuroscience Reinforcement Scheduling UAV Unmanned aerial vehicles
ISSN:	1662-5218, 1662-5218
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Unmanned Aerial Vehicles (UAVs) have gained popularity due to their low lifecycle cost and minimal human risk, resulting in their widespread use in recent years. In the UAV swarm cooperative decision domain, multi-agent deep reinforcement learning has significant potential. However, current approaches are challenged by the multivariate mission environment and mission time constraints. In light of this, the present study proposes a meta-learning based multi-agent deep reinforcement learning approach that provides a viable solution to this problem. This paper presents an improved MAML-based multi-agent deep deterministic policy gradient (MADDPG) algorithm that achieves an unbiased initialization network by automatically assigning weights to meta-learning trajectories. In addition, a Reward-TD prioritized experience replay technique is introduced, which takes into account immediate reward and TD-error to improve the resilience and sample utilization of the algorithm. Experiment results show that the proposed approach effectively accomplishes the task in the new scenario, with significantly improved task success rate, average reward, and robustness compared to existing methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Reviewed by: Mu Hua, University of Lincoln, United Kingdom; Yan Fang, Kennesaw State University, United States; Pengyu Yuan, Google, United States Edited by: Ming-Feng Ge, China University of Geosciences Wuhan, China
ISSN:	1662-5218 1662-5218
DOI:	10.3389/fnbot.2023.1243174