An Improved Sarsa( \lambda ) Reinforcement Learning Algorithm for Wireless Communication Systems

In this article, we provide a novel improved model-free temporal-difference control algorithm, namely, Expected Sarsa(λ), using the average value as an update target and introducing eligibility traces in wireless communication networks. In particular, we construct the update target using the average...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE access Ročník 7; s. 115418 - 115427
Hlavní autoři:	Jiang, Hao, Gui, Renjie, Chen, Zhen, Wu, Liang, Dang, Jian, Zhou, Jie
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	<italic xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Q learning Algorithms Communication networks Communications networks Control algorithms Control theory Decision making eligibility traces Machine learning Machine learning algorithms Markov processes Mathematical model Model-free reinforcement learning Numerical models Reinforcement learning Sarsa Wireless communication Wireless communication systems Wireless communications Wireless networks
ISSN:	2169-3536, 2169-3536
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	In this article, we provide a novel improved model-free temporal-difference control algorithm, namely, Expected Sarsa(λ), using the average value as an update target and introducing eligibility traces in wireless communication networks. In particular, we construct the update target using the average action value of all possible successive actions, and apply eligibility traces to record the historical access of every state action pair, which greatly improve the model's convergence property and learning efficiency. Numerical results demonstrate that the proposed algorithm has the advantage of high learning efficiency and a higher learning-rate tolerance range than Q Learning, Sarsa, Expected Sarsa, and Sarsa(λ) in the tabular case of a finite Markov decision process, thereby providing an efficient solution for the study and design wireless communication networks. This provides an efficient and effective solution to design further artificial intelligent communication networks.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2019.2935255