Distributed Neural Policy Gradient Algorithm for Global Convergence of Networked Multiagent Reinforcement Learning

This article studies the networked multiagent reinforcement learning problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on automatic control Jg. 70; H. 11; S. 7109 - 7124
Hauptverfasser:	Dai, Pengcheng, Mo, Yuanqiu, Yu, Wenwu, Ren, Wei
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York IEEE 01.11.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:	Algorithms Approximation algorithms Communication networks Convergence Distributed neural policy gradient algorithm Function approximation global convergence Linear functions Linear programming Machine learning Multiagent systems networked multiagent reinforcement learning (NMARL) Neural networks Parameters Reinforcement learning Reviews Scalability Training Vectors
ISSN:	0018-9286, 1558-2523
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This article studies the networked multiagent reinforcement learning problem, where the objective of agents is to collaboratively maximize the discounted average cumulative rewards. Different from the existing methods that suffer from poor expression due to linear function approximation, we propose a distributed neural policy gradient algorithm that features two innovatively designed neural networks, specifically for the approximate <inline-formula><tex-math notation="LaTeX">Q</tex-math></inline-formula>-functions and policy functions of agents. This distributed neural policy gradient algorithm consists of two key components: the distributed critic step and the decentralized actor step. In the distributed critic step, agents receive the approximate <inline-formula><tex-math notation="LaTeX">Q</tex-math></inline-formula>-function parameters from their neighboring agents via a time-varying communication networks to collaboratively evaluate the joint policy. In contrast, in the decentralized actor step, each agent updates its local policy parameter solely based on its own approximate <inline-formula><tex-math notation="LaTeX">Q</tex-math></inline-formula>-function. In the convergence analysis, we first establish the global convergence of agents for the joint policy evaluation in the distributed critic step. Subsequently, we rigorously demonstrate the global convergence of the overall distributed neural policy gradient algorithm with respect to the objective function. Finally, the effectiveness of the proposed algorithm is demonstrated by comparing it with a centralized algorithm through simulation in the robot path planning environment.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9286 1558-2523
DOI:	10.1109/TAC.2025.3570065