Student-t policy in reinforcement learning to acquire global optimum of robot control

This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Applied intelligence (Dordrecht, Netherlands) Ročník 49; číslo 12; s. 4335 - 4347
Hlavný autor: Kobayashi, Taisuke
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York Springer US 01.12.2019
Springer Nature B.V
Predmet:
ISSN:0924-669X, 1573-7497
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in reinforcement learning, and is proved to learn the policy converging on one of the local optima. To avoid the local optima, an exploration ability to escape it and a conservative learning not to be trapped in it are deemed to be empirically effective. The conventional policy parameterized by a normal distribution, however, fundamentally lacks these abilities. The state-of-the-art methods can somewhat but not perfectly compensate for them. Conversely, heavy-tailed distribution, including student-t distribution, possesses an excellent exploration ability, which is called Lévy flight for modeling efficient feed detection of animals. Another property of the heavy tail is its robustness to outliers. Namely, conservative learning is performed to not be trapped in the local optima even when it takes extreme actions. These desired properties of the student-t policy enhance the possibility of the agents reaching the global optimum. Indeed, the student-t policy outperforms the conventional policy in four types of simulations, two of which are difficult to learn faster without sufficient exploration and the others have the local optima.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0924-669X
1573-7497
DOI:10.1007/s10489-019-01510-8