Adaptive Inner-reward Shaping in Sparse Reward Games

Reinforcement learning focuses on goal-directed learning from interaction and the success of its applications strongly depends on how well the reward signal frames the problem and how well it assesses progress in solving it. But in many real-world scenarios, the agent is supplied with extremely spar...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings of ... International Joint Conference on Neural Networks s. 1 - 8
Hlavní autoři: Yang, Dong, Tang, Yuhua
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.07.2020
Témata:
ISSN:2161-4407
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Reinforcement learning focuses on goal-directed learning from interaction and the success of its applications strongly depends on how well the reward signal frames the problem and how well it assesses progress in solving it. But in many real-world scenarios, the agent is supplied with extremely sparse or even no rewards which makes learning fail and fall into ineffective exploration. In psychology, shaping is a method of animal training by reinforcing successive approximations of rewards to finally achieve the desired complex behavior. Inspired by this phenomenon of animal learning and reward as a signal in neuroscience, in this paper we solve the sparse reward problem by constructing a reward generator to generate inner-rewards and guide the agent learning control policies with deep neural networks. The proposed learning-based reward shaping does not require specific domain knowledge, but rather enable the agent to learn how to generate inner rewards to guide itself in any scenarios online jointly with the actual reinforcement learning process. To validate the performance in complex sparse reward problems, the proposed approach is evaluated in a challenging scenario, Football Academy in Google Research Football Environment, a newly released reinforcement learning environment with physics-based 3D simulator, instead of maze environments or grid world that are commonly used in research which are not sufficiently challenging. We compare the performance of our inner-rewards approach with two reinforcement algorithms (PPO and ICM + PPO). Experimental results show that our method improves the learning performance in terms of speed and quality, and also enables the agent to learn generalized skills applied to novel scenarios.
ISSN:2161-4407
DOI:10.1109/IJCNN48605.2020.9207302