Deterministic policy gradient algorithms for semi‐Markov decision processes

A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the on...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of intelligent systems Vol. 37; no. 7; pp. 4008 - 4019
Main Authors:	Hosseinloo, Ashkan Haji, Dahleh, Munther A.
Format:	Journal Article
Language:	English
Published:	New York John Wiley & Sons, Inc 01.07.2022
Subjects:	Algorithms average reward deterministic policy Intelligent systems Markov analysis Markov processes policy gradient theorem Preventive maintenance reinforcement learning SMDP Theorems
ISSN:	0884-8173, 1098-111X
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0884-8173 1098-111X
DOI:	10.1002/int.22709