Dual Parallel Policy Iteration With Coupled Policy Improvement

In this article, a novel coupled policy improvement mechanism is developed for improving policy iteration (PI) algorithms. In contrast to the common PI, the developed dual parallel policy iteration (DPPI) with coupled policy improvement mechanism consists of two parallel PIs. At each PI step, the pe...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transaction on neural networks and learning systems Ročník 35; číslo 3; s. 1 - 13
Hlavní autori:	Cheng, Yuhu, Huang, Longyang, Chen, C. L. Philip, Wang, Xuesong
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	United States IEEE 01.03.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Algorithms Control tasks Convergence Coupled policy improvement mechanism Divergence dominant policy dual parallel policy iteration (DPPI) Entropy Iterative methods Machine learning Markov processes Policies reinforcement learning (RL) Space exploration Stability criteria Task analysis Training
ISSN:	2162-237X, 2162-2388, 2162-2388
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In this article, a novel coupled policy improvement mechanism is developed for improving policy iteration (PI) algorithms. In contrast to the common PI, the developed dual parallel policy iteration (DPPI) with coupled policy improvement mechanism consists of two parallel PIs. At each PI step, the performances of the two parallel policies are evaluated and the better one is defined as the dominant policy. Then, the dominant policy is used to guide the parallel policy improvement in a soft manner by constraining the Kullback-Liebler (KL) divergence between the dominant policy and the policy to be updated. It is proven that the convergence of DPPI can be guaranteed under the designed coupled policy improvement mechanism. Moreover, it is clearly shown that under certain conditions, the <inline-formula> <tex-math notation="LaTeX">Q</tex-math> </inline-formula>-functions of the two new policies obtained in each parallel policy improvement are larger than those of all the previous dominant policies, which is conductive to accelerate the PI process and improve the policy learning efficiency to some extent. Furthermore, by combining DPPI with the twin delay deep deterministic (TD3) policy gradient, we propose a reinforcement learning (RL) algorithm: parallel TD3 (PTD3). Experimental results on continuous-action control tasks in the MuJoCo and OpenAI Gym platforms show that the proposed PTD3 outperforms the state-of-the-art RL algorithms.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2022.3202192