Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distribu...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transaction on neural networks and learning systems Ročník 33; číslo 11; s. 6584 - 6598
Hlavní autoři: Duan, Jingliang, Guan, Yang, Li, Shengbo Eben, Ren, Yangang, Sun, Qi, Cheng, Bo
Médium: Journal Article
Jazyk:angličtina
Vydáno: Piscataway IEEE 01.11.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:2162-237X, 2162-2388, 2162-2388
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations because it is capable of adaptively adjusting the update step size of the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2021.3082568