Multi-Agent Deep Deterministic Policy Gradient Algorithm Based on Classification Experience Replay

In recent years, multi-agent reinforcement learning has been applied in many fields, such as urban traffic control, autonomous UAV operations, etc. Although the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm has been used in various simulation environments as a classic reinforceme...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) (Online) s. 988 - 992
Hlavní autoři:	Sun, Xiaoying, Chen, Jinchao, Du, Chenglie, Zhan, Mengying
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 03.10.2022
Témata:	classification experience replay Correlation deep reinforcement learning Gradient methods multi-agent systems Neural networks overfitting Reinforcement learning Time series analysis Traffic control Training
ISSN:	2689-6621
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	In recent years, multi-agent reinforcement learning has been applied in many fields, such as urban traffic control, autonomous UAV operations, etc. Although the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm has been used in various simulation environments as a classic reinforcement algorithm, its training efficiency is low and the convergence speed is slow due to its original experience playback mechanism and network structure. The random experience replay mechanism adopted by the algorithm breaks the time series correlation between data samples. However, the experience replay mechanism does not take advantage of important samples. Therefore, the paper proposes a Multi-Agent Deep Deterministic Policy Gradient method based on classification experience replay, which modifies the traditional random experience replay into classification experience replay. Classified storage can make full use of important samples. At the same time, the Critic network and the Actor network are updated asynchronously, and the learned better Critic network is used to guide the Actor network update. Finally, to verify the effectiveness of the proposed algorithm, the improved algorithm is compared with the traditional MADDPG method in a simulation environment.
ISSN:	2689-6621
DOI:	10.1109/IAEAC54830.2022.9929494