Deterministic policy gradient algorithms for semi‐Markov decision processes

A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the on...

Full description

Saved in:
Bibliographic Details
Published in:International journal of intelligent systems Vol. 37; no. 7; pp. 4008 - 4019
Main Authors: Hosseinloo, Ashkan Haji, Dahleh, Munther A.
Format: Journal Article
Language:English
Published: New York John Wiley & Sons, Inc 01.07.2022
Subjects:
ISSN:0884-8173, 1098-111X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.
AbstractList A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.
A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations.
Author Hosseinloo, Ashkan Haji
Dahleh, Munther A.
Author_xml – sequence: 1
  givenname: Ashkan Haji
  orcidid: 0000-0002-1167-1075
  surname: Hosseinloo
  fullname: Hosseinloo, Ashkan Haji
  email: ashkanhh@mit.edu, hhashkan@gmail.com
  organization: MIT
– sequence: 2
  givenname: Munther A.
  orcidid: 0000-0002-1470-2148
  surname: Dahleh
  fullname: Dahleh, Munther A.
  organization: MIT
BookMark eNp1kL1OwzAUhS0EEm1h4A0iMTGk9bXz44yo_FVqYenAFjnOdXFJ42K7oG48As_IkxBoJwTTXb7vnKvTJ4etbZGQM6BDoJSNTBuGjOW0OCA9oIWIAeDxkPSoEEksIOfHpO_9klKAPEl7ZHaFAd3KtMYHo6K1bYzaRgsna4NtiGSzsM6Ep5WPtHWRx5X5fP-YSfdsX6MalfHGttHaWYXeoz8hR1o2Hk_3d0DmN9fz8V08fbidjC-nsWJFXsQiRwlVxSmKTGXANK95qhQg8gJ0pakGpEmWMF1jmkoKVQ6AgoqCZ4nmfEDOd7Fd8csGfSiXduParrFkWVakjFOadtRoRylnvXeoS2WCDN3DwUnTlEDL78nKbrLyZ7LOuPhlrJ1ZSbf9k92nv5kGt_-D5eR-vjO-AO5ifyA
CitedBy_id crossref_primary_10_1016_j_neucom_2025_130096
crossref_primary_10_1049_cim2_70029
crossref_primary_10_1109_TASE_2023_3315549
crossref_primary_10_1080_00423114_2025_2539266
crossref_primary_10_1007_s10462_023_10468_6
Cites_doi 10.1016/j.ejor.2012.08.010
10.1007/s11045-020-00754-9
10.1002/rob.4620010203
10.1016/B978-1-55860-307-3.50045-9
10.1109/MCS.2006.1636313
10.1080/07408170208928908
10.1016/j.apenergy.2020.115451
10.1177/0278364913495721
10.1016/S0377-2217(02)00874-3
10.1109/TAC.2003.811252
10.1080/00207170902823006
10.9746/sicetr1965.14.706
10.1016/j.ejor.2006.02.023
10.1109/ICTAI.2007.12
10.1049/iet-cta.2020.0557
10.1287/mnsc.45.4.560
10.1162/089976698300017746
10.1287/msom.5.4.348.24884
10.1016/S0004-3702(99)00052-1
10.1109/ICARCV.2002.1234955
10.3390/math8091528
10.1016/j.neunet.2011.09.005
10.1038/nature14236
ContentType Journal Article
Copyright 2021 Wiley Periodicals LLC
2022 Wiley Periodicals LLC.
Copyright_xml – notice: 2021 Wiley Periodicals LLC
– notice: 2022 Wiley Periodicals LLC.
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1002/int.22709
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1098-111X
EndPage 4019
ExternalDocumentID 10_1002_int_22709
INT22709
Genre article
GroupedDBID -~X
.3N
.4S
.DC
.GA
.Y3
05W
0R~
10A
1L6
1OB
1OC
24P
31~
33P
3SF
3WU
4.4
50Y
50Z
51W
51X
52M
52N
52O
52P
52S
52T
52U
52W
52X
5GY
5VS
66C
702
7PT
8-0
8-1
8-3
8-4
8-5
8UM
930
A03
AAESR
AAEVG
AAHHS
AAJEY
AANHP
AAONW
AASGY
AAXRX
AAYOK
AAZKR
ABCQN
ABCUV
ABDPE
ABEML
ABIJN
ABJCF
ABJNI
ABPVW
ABTAH
ABUWG
ACAHQ
ACBWZ
ACCFJ
ACCMX
ACCZN
ACGFS
ACIWK
ACPOU
ACRPL
ACSCC
ACXBN
ACXQS
ACYXJ
ADBBV
ADEOM
ADIZJ
ADKYN
ADMGS
ADNMO
ADOZA
ADXAS
ADZMN
ADZOD
AEEZP
AEIMD
AENEX
AEQDE
AEUQT
AFBPY
AFGKR
AFKRA
AFPWT
AFZJQ
AI.
AIURR
AIWBW
AJBDE
AJXKR
ALAGY
ALMA_UNASSIGNED_HOLDINGS
ALUQN
AMBMR
AMYDB
ARAPS
ARCSS
ASPBG
ATUGU
AUFTA
AVWKF
AZBYB
AZFZN
AZQEC
AZVAB
BAFTC
BDRZF
BENPR
BFHJK
BGLVJ
BHBCM
BMNLL
BMXJE
BNHUX
BROTX
BRXPI
BY8
CCPQU
CMOOK
CS3
D-E
D-F
DCZOG
DPXWK
DR2
DRFUL
DRSTM
DU5
DWQXO
EBS
EDO
EJD
F00
F01
F04
FEDTE
G-S
G.N
GNP
GNUQQ
GODZA
H.T
H.X
H13
HBH
HCIFZ
HF~
HHY
HVGLF
HZ~
I-F
IX1
J0M
JPC
K7-
KQQ
LATKE
LAW
LC2
LC3
LEEKS
LH4
LITHE
LOXES
LP6
LP7
LUTES
LW6
LYRES
M59
M7S
MK4
MK~
MRFUL
MRSTM
MSFUL
MSSTM
MVM
MXFUL
MXSTM
N04
N05
N9A
NF~
O66
O9-
OIG
P2P
P2W
P2X
P4D
PALCI
PIMPY
PQQKQ
PTHSS
Q.N
Q11
QB0
QRW
R.K
RHX
RIWAO
RJQFR
ROL
RWI
RX1
RYL
SAMSI
SUPJJ
TN5
TUS
UB1
V2E
VH1
W8V
W99
WBKPD
WH7
WIH
WIK
WOHZO
WQJ
WRC
WWI
WXSBR
WYISQ
WZISG
XG1
XPP
XV2
ZY4
ZZTAW
~IA
~WT
AAMMB
AAYXX
ADMLS
AEFGJ
AFFHD
AGQPQ
AGXDD
AIDQK
AIDYY
AIQQE
CITATION
O8X
PHGZM
PHGZT
PQGLB
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c2979-87ea1bb30e86c612f3d35cc1ee391fbf0f1e04642fde55a01b711e8089364f33
IEDL.DBID DRFUL
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000707853300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0884-8173
IngestDate Fri Jul 25 12:13:24 EDT 2025
Sat Nov 29 04:01:54 EST 2025
Tue Nov 18 21:34:45 EST 2025
Wed Jan 22 16:25:33 EST 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 7
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2979-87ea1bb30e86c612f3d35cc1ee391fbf0f1e04642fde55a01b711e8089364f33
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-1167-1075
0000-0002-1470-2148
PQID 2669523005
PQPubID 1026350
PageCount 12
ParticipantIDs proquest_journals_2669523005
crossref_citationtrail_10_1002_int_22709
crossref_primary_10_1002_int_22709
wiley_primary_10_1002_int_22709_INT22709
PublicationCentury 2000
PublicationDate July 2022
2022-07-00
20220701
PublicationDateYYYYMMDD 2022-07-01
PublicationDate_xml – month: 07
  year: 2022
  text: July 2022
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle International journal of intelligent systems
PublicationYear 2022
Publisher John Wiley & Sons, Inc
Publisher_xml – name: John Wiley & Sons, Inc
References 2011
2009; 82
2002; 34
1999; 45
1997
2002; 3
2013; 224
2020; 14
2000; 1
1978; 14
1995; 7
2020; 8
2004; 155
2007; 178
2021; 32
2013; 32
1984; 1
2006; 26
2007; 8
2019
2015; 518
2003; 48
2018
2017
2003; 5
2020; 277
2015
2014
2013
1999; 112
2012; 26
2007; 1
1998; 10
1993; 298
e_1_2_7_6_1
e_1_2_7_5_1
e_1_2_7_4_1
Li Y (e_1_2_7_24_1) 2011
e_1_2_7_9_1
e_1_2_7_8_1
e_1_2_7_7_1
e_1_2_7_19_1
e_1_2_7_18_1
Silver D (e_1_2_7_37_1) 2014
e_1_2_7_2_1
e_1_2_7_15_1
e_1_2_7_14_1
e_1_2_7_12_1
e_1_2_7_11_1
e_1_2_7_10_1
Schulman J (e_1_2_7_3_1) 2015
e_1_2_7_26_1
e_1_2_7_27_1
e_1_2_7_28_1
e_1_2_7_29_1
Bertsekas DP (e_1_2_7_36_1) 2000
Ghavamzadeh M (e_1_2_7_13_1) 2007; 8
Bradtke SJ (e_1_2_7_16_1) 1995; 7
Sutton RS (e_1_2_7_38_1) 2018
e_1_2_7_30_1
e_1_2_7_25_1
e_1_2_7_31_1
e_1_2_7_32_1
e_1_2_7_23_1
e_1_2_7_33_1
e_1_2_7_22_1
e_1_2_7_34_1
Mahadevan S (e_1_2_7_17_1) 1997
e_1_2_7_21_1
e_1_2_7_20_1
Puterman ML (e_1_2_7_35_1) 2014
References_xml – volume: 298
  start-page: 298
  year: 1993
  end-page: 305
– volume: 3
  start-page: 1268
  year: 2002
  end-page: 1274
– start-page: 926
  year: 2011
  end-page: 931
– volume: 82
  start-page: 1917
  issue: 10
  year: 2009
  end-page: 1928
  article-title: Look‐ahead control of conveyor‐serviced production station by using potential‐based online policy iteration
  publication-title: Int J Control
– volume: 10
  start-page: 251
  issue: 2
  year: 1998
  end-page: 276
  article-title: Natural gradient works efficiently in learning
  publication-title: Neural Comput
– year: 2017
  article-title: Proximal policy optimization algorithms
– volume: 5
  start-page: 348
  issue: 4
  year: 2003
  end-page: 371
  article-title: Price‐directed replenishment of subsets: methodology and its application to inventory routing
  publication-title: Manuf Serv Oper Manage
– volume: 26
  start-page: 96
  issue: 3
  year: 2006
  end-page: 114
  article-title: A survey of iterative learning control
  publication-title: IEEE Control Syst Mag
– volume: 178
  start-page: 808
  issue: 3
  year: 2007
  end-page: 818
  article-title: A policy gradient method for semi‐Markov decision processes with application to call admission control
  publication-title: Eur J Oper Res
– volume: 224
  start-page: 333
  issue: 2
  year: 2013
  end-page: 339
  article-title: A basic formula for performance gradient estimation of semi‐Markov decision processes
  publication-title: Eur J Oper Res
– volume: 155
  start-page: 654
  issue: 3
  year: 2004
  end-page: 674
  article-title: Reinforcement learning for long‐run average cost
  publication-title: Eur J Oper Res
– volume: 112
  start-page: 181
  issue: 1‐2
  year: 1999
  end-page: 211
  article-title: Between MDPs and semi‐MDPs: a framework for temporal abstraction in reinforcement learning
  publication-title: Artif Intell
– volume: 48
  start-page: 758
  issue: 5
  year: 2003
  end-page: 769
  article-title: Semi‐Markov decision problems and performance sensitivity analysis
  publication-title: IEEE Trans Autom Control
– start-page: 1889
  year: 2015
  end-page: 1897
  article-title: Trust region policy optimization
  publication-title: International Conference on Machine Learning PMLR
– year: 2018
– year: 2015
  article-title: Continuous control with deep reinforcement learning
– year: 2014
– volume: 277
  year: 2020
  article-title: Data‐driven control of micro‐climate in buildings: an event‐triggered reinforcement learning approach
  publication-title: Appl Energy
– volume: 14
  start-page: 3344
  issue: 19
  year: 2020
  end-page: 3350
  article-title: Robust point‐to‐point iterative learning control with trial‐varying initial conditions
  publication-title: IET Control Theory Appl
– volume: 45
  start-page: 560
  issue: 4
  year: 1999
  end-page: 574
  article-title: Solving semi‐Markov decision problems using average reward reinforcement learning
  publication-title: Manage Sci
– start-page: 387
  year: 2014
  end-page: 395
– year: 2013
  article-title: Playing atari with deep reinforcement learning
– volume: 1
  start-page: 123
  issue: 2
  year: 1984
  end-page: 140
  article-title: Bettering operation of robots by learning
  publication-title: J Rob Syst
– start-page: 202
  year: 1997
  end-page: 210
– volume: 8
  start-page: 1528
  issue: 9
  year: 2020
  article-title: PD‐type iterative learning control for uncertain spatially interconnected systems
  publication-title: Mathematics
– volume: 7
  start-page: 393
  year: 1995
  end-page: 400
  article-title: Reinforcement learning methods for continuous‐time Markov decision problems
  publication-title: Adv Neural Inf Process Syst
– volume: 1
  year: 2000
– volume: 32
  start-page: 671
  issue: 2
  year: 2021
  end-page: 692
  article-title: Robust PD‐type iterative learning control for discrete systems with multiple time‐delays subjected to polytopic uncertainty and restricted frequency‐domain
  publication-title: Multidimens Syst Signal Process
– volume: 8
  start-page: 2629
  issue: 11
  year: 2007
  end-page: 2669
  article-title: Hierarchical average reward reinforcement learning
  publication-title: J Mach Learn Res
– volume: 518
  start-page: 529
  issue: 7540
  year: 2015
  end-page: 533
  article-title: Human‐level control through deep reinforcement learning
  publication-title: Nature
– volume: 26
  start-page: 118
  year: 2012
  end-page: 129
  article-title: Analysis and improvement of policy gradient estimation
  publication-title: Neural Networks
– volume: 32
  start-page: 1238
  issue: 11
  year: 2013
  end-page: 1274
  article-title: Reinforcement learning in robotics: a survey
  publication-title: Int J Rob Res
– volume: 34
  start-page: 729
  issue: 9
  year: 2002
  end-page: 742
  article-title: A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking
  publication-title: IIE Trans
– volume: 1
  start-page: 11
  year: 2007
  end-page: 18
– volume: 14
  start-page: 706
  issue: 6
  year: 1978
  end-page: 712
  article-title: Formation of high‐speed motion pattern of a mechanical arm by trial
  publication-title: Trans Soc Instrum Control Eng
– year: 2019
  article-title: Discounted reinforcement learning is not an optimization problem
– volume-title: Dynamic Programming and Optimal Control
  year: 2000
  ident: e_1_2_7_36_1
– ident: e_1_2_7_25_1
  doi: 10.1016/j.ejor.2012.08.010
– ident: e_1_2_7_33_1
  doi: 10.1007/s11045-020-00754-9
– volume-title: Reinforcement Learning: an Introduction
  year: 2018
  ident: e_1_2_7_38_1
– ident: e_1_2_7_30_1
  doi: 10.1002/rob.4620010203
– start-page: 926
  volume-title: 2011 8th Asian Control Conference (ASCC)
  year: 2011
  ident: e_1_2_7_24_1
– ident: e_1_2_7_27_1
  doi: 10.1016/B978-1-55860-307-3.50045-9
– ident: e_1_2_7_31_1
  doi: 10.1109/MCS.2006.1636313
– ident: e_1_2_7_18_1
  doi: 10.1080/07408170208928908
– start-page: 1889
  year: 2015
  ident: e_1_2_7_3_1
  article-title: Trust region policy optimization
  publication-title: International Conference on Machine Learning PMLR
– ident: e_1_2_7_14_1
  doi: 10.1016/j.apenergy.2020.115451
– ident: e_1_2_7_7_1
  doi: 10.1177/0278364913495721
– volume-title: Markov Decision Processes: Discrete Stochastic Dynamic Programming
  year: 2014
  ident: e_1_2_7_35_1
– ident: e_1_2_7_19_1
  doi: 10.1016/S0377-2217(02)00874-3
– ident: e_1_2_7_20_1
  doi: 10.1109/TAC.2003.811252
– ident: e_1_2_7_11_1
  doi: 10.1080/00207170902823006
– ident: e_1_2_7_28_1
– ident: e_1_2_7_29_1
  doi: 10.9746/sicetr1965.14.706
– ident: e_1_2_7_8_1
  doi: 10.1016/j.ejor.2006.02.023
– ident: e_1_2_7_4_1
– start-page: 202
  volume-title: International Workshop Then Conference on Machine Learning
  year: 1997
  ident: e_1_2_7_17_1
– ident: e_1_2_7_23_1
  doi: 10.1109/ICTAI.2007.12
– ident: e_1_2_7_5_1
– ident: e_1_2_7_32_1
  doi: 10.1049/iet-cta.2020.0557
– ident: e_1_2_7_2_1
– start-page: 387
  volume-title: International Conference on Machine Learning PMLR
  year: 2014
  ident: e_1_2_7_37_1
– ident: e_1_2_7_10_1
  doi: 10.1287/mnsc.45.4.560
– ident: e_1_2_7_22_1
  doi: 10.1162/089976698300017746
– volume: 7
  start-page: 393
  year: 1995
  ident: e_1_2_7_16_1
  article-title: Reinforcement learning methods for continuous‐time Markov decision problems
  publication-title: Adv Neural Inf Process Syst
– ident: e_1_2_7_9_1
  doi: 10.1287/msom.5.4.348.24884
– ident: e_1_2_7_12_1
  doi: 10.1016/S0004-3702(99)00052-1
– ident: e_1_2_7_21_1
  doi: 10.1109/ICARCV.2002.1234955
– ident: e_1_2_7_34_1
  doi: 10.3390/math8091528
– ident: e_1_2_7_26_1
  doi: 10.1016/j.neunet.2011.09.005
– volume: 8
  start-page: 2629
  issue: 11
  year: 2007
  ident: e_1_2_7_13_1
  article-title: Hierarchical average reward reinforcement learning
  publication-title: J Mach Learn Res
– ident: e_1_2_7_6_1
  doi: 10.1038/nature14236
– ident: e_1_2_7_15_1
SSID ssj0011745
Score 2.361015
Snippet A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be...
SourceID proquest
crossref
wiley
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 4008
SubjectTerms Algorithms
average reward
deterministic policy
Intelligent systems
Markov analysis
Markov processes
policy gradient theorem
Preventive maintenance
reinforcement learning
SMDP
Theorems
Title Deterministic policy gradient algorithms for semi‐Markov decision processes
URI https://onlinelibrary.wiley.com/doi/abs/10.1002%2Fint.22709
https://www.proquest.com/docview/2669523005
Volume 37
WOSCitedRecordID wos000707853300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVWIB
  databaseName: Wiley Online Library Full Collection 2020
  customDbUrl:
  eissn: 1098-111X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011745
  issn: 0884-8173
  databaseCode: DRFUL
  dateStart: 19960101
  isFulltext: true
  titleUrlDefault: https://onlinelibrary.wiley.com
  providerName: Wiley-Blackwell
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF5K68GL9YnVKot48BK7j2weeBJrUdAiUqG3kGx2NdCmJYk9-xP8jf4SdzdJq6AgeEoOs0mY3Zn5sjvzDQCnjubYU4HJEoI5ls1CanlcRhZnHNmCoTDmyDSbcIdDbzz2Hxrgoq6FKfkhlhtu2jKMv9YGHkZ5b0UamqTFOSGuLt5rqQu1m6DVfxw83S0PERTYZiWItC0Pu7QmFkKktxz8PRytMOZXpGpCzaD9r4_cBBsVwoSX5ZLYAg2RboN23b0BVsa8A-77VSaMoWqGc0MQDJ8zkwNWwHDyPMuS4mWaQ4VrYS6mycfbuy7tmS1gXHXmgfOyzkDku2A0uB5d3VhVcwWLE9_1lRcUIY4iioTncAVzJI0p4xwLQX0sI4kkFvrYk8hYMBYiHLkYCw8pfOPYktI90ExnqdgHkCvUYbsxEz5XoY5Fym2p3zJl2YRH0sa4A85qFQe8Ih7X_S8mQUmZTAKlpcBoqQNOlqLzkm3jJ6FuPU9BZXB5oHCGrze4EVOvMzPy-wOC2-HI3Bz8XfQQrBNd-GASdbugWWSv4gis8UWR5NlxtfI-AUGu3Vw
linkProvider Wiley-Blackwell
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB5KK-jF-sRq1UU8eIndTbJNAl7EWlpsg0iF3kKy2dRCXzSxZ3-Cv9Ff4u7mUQUFwVsOkwez8_iyO_MNwGVTcuyJxKRxTpuaSX1Ds1kUaIwybHKK_ZBhNWzCcl17OHQeS3CT98Kk_BDFhpv0DBWvpYPLDenGmjV0PEuudd2S3XsVU5gRLUOl9dR-7hWnCAJt0xRFmppNLCNnFsJ6o7j5ez5ag8yvUFXlmnb1f1-5A9sZxkS3qVHsQonP9qCaz29AmTvvQ7-V1cIosma0UBTBaLRUVWAJ8iej-XKcvExjJJAtivl0_PH2Lpt75isUZrN50CLtNODxAQza94O7jpaNV9CY7liOiIPcJ0FgYG43mQA6kREalDHCueGQKIhwRLg8-NSjkFPqYxJYhHAbC4TTNCPDOITybD7jR4CYwB2mFVLuMJHsaCACl_gxE76tsyAyCanBVa5jj2XU43ICxsRLSZN1T2jJU1qqwUUhukj5Nn4SqucL5WUuF3sCaThyixtT8Tq1JL8_wOu6A3Vx_HfRc9jsDPo9r9d1H05gS5dtEKpstw7lZPnKT2GDrZJxvDzLzPAThQHhTA
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF6KFfFifWK16iIevMTuJtk8wItYi8VailToLSSb2Vroiyb27E_wN_pL3N0krYKC4C2HyYPd_Wa-7M58g9CFozT2ZGAyAJhj2Cy0DI-LyOCMExsYCWNOdLMJt9Px-n2_W0LXRS1Mpg-x3HBTyND-WgEcZrGor1RDh5P0yjRdVb1XtpnvSFiWG0_N5_byFEGybZaxSNvwqGsVykLErC9v_h6PViTzK1XVsaZZ-d9XbqOtnGPim2xR7KASTHZRpejfgHM476HHRp4Lo8Wa8UxLBOPBXGeBpTgcDabzYfoyTrBktjiB8fDj7V0V90wXOM578-BZVmkAyT7qNe96t_dG3l7B4Kbv-tIPQkijyCLgOVwSHWHFFuOcAlg-FZEggoI6-DRFDIyFhEYupeARyXAcW1jWAVqbTCdwiDCXvMN2YwY-l8GORdJxyR8ziW2TR8KmtIouizEOeC49rjpgjIJMNNkM5CgFepSq6HxpOsv0Nn4yqhUTFeSQSwLJNHy1xU2YfJ2ekt8fELQ6PX1x9HfTM7TRbTSDdqvzcIw2TVUFobN2a2gtnb_CCVrni3SYzE_zVfgJPeHgxw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Deterministic+policy+gradient+algorithms+for+semi%E2%80%90Markov+decision+processes&rft.jtitle=International+journal+of+intelligent+systems&rft.au=Hosseinloo%2C+Ashkan+Haji&rft.au=Dahleh%2C+Munther+A.&rft.date=2022-07-01&rft.issn=0884-8173&rft.eissn=1098-111X&rft.volume=37&rft.issue=7&rft.spage=4008&rft.epage=4019&rft_id=info:doi/10.1002%2Fint.22709&rft.externalDBID=n%2Fa&rft.externalDocID=10_1002_int_22709
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0884-8173&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0884-8173&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0884-8173&client=summon