Deterministic policy gradient algorithms for semi‐Markov decision processes
A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the on...
Gespeichert in:
| Veröffentlicht in: | International journal of intelligent systems Jg. 37; H. 7; S. 4008 - 4019 |
|---|---|
| Hauptverfasser: | , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
New York
John Wiley & Sons, Inc
01.07.2022
|
| Schlagworte: | |
| ISSN: | 0884-8173, 1098-111X |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations. |
|---|---|
| AbstractList | A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations. A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be modeled in the framework of semi‐Markov decision processes (SMDPs). Unlike Markov decision processes (MDPs), SMDPs are underexplored in the online and reinforcement learning (RL) settings. In this paper, we extend the well‐known deterministic policy gradient (DPG) theorem in MDPs to SMDPs under average‐reward criterion. The existing stochastic policy gradient methods not only require, in general, a large number of samples for training, but they also suffer from high variance in the gradient estimation when applied to problems with deterministic optimal policy. Our DPG method can potentially remedy these issues. On the basis of this method and depending on the choice of a critic, different actor–critic algorithms can easily be developed in the RL setup. We present two example actor–critic algorithms. Both algorithms employ our developed policy gradient theorem for their actors, but use two different critics; one uses a simple SARSA update while the other one uses the same on‐policy update but with compatible function approximators. We demonstrate the efficacy of our method both mathematically and via simulations. |
| Author | Hosseinloo, Ashkan Haji Dahleh, Munther A. |
| Author_xml | – sequence: 1 givenname: Ashkan Haji orcidid: 0000-0002-1167-1075 surname: Hosseinloo fullname: Hosseinloo, Ashkan Haji email: ashkanhh@mit.edu, hhashkan@gmail.com organization: MIT – sequence: 2 givenname: Munther A. orcidid: 0000-0002-1470-2148 surname: Dahleh fullname: Dahleh, Munther A. organization: MIT |
| BookMark | eNp1kL1OwzAUhS0EEm1h4A0iMTGk9bXz44yo_FVqYenAFjnOdXFJ42K7oG48As_IkxBoJwTTXb7vnKvTJ4etbZGQM6BDoJSNTBuGjOW0OCA9oIWIAeDxkPSoEEksIOfHpO_9klKAPEl7ZHaFAd3KtMYHo6K1bYzaRgsna4NtiGSzsM6Ep5WPtHWRx5X5fP-YSfdsX6MalfHGttHaWYXeoz8hR1o2Hk_3d0DmN9fz8V08fbidjC-nsWJFXsQiRwlVxSmKTGXANK95qhQg8gJ0pakGpEmWMF1jmkoKVQ6AgoqCZ4nmfEDOd7Fd8csGfSiXduParrFkWVakjFOadtRoRylnvXeoS2WCDN3DwUnTlEDL78nKbrLyZ7LOuPhlrJ1ZSbf9k92nv5kGt_-D5eR-vjO-AO5ifyA |
| CitedBy_id | crossref_primary_10_1016_j_neucom_2025_130096 crossref_primary_10_1049_cim2_70029 crossref_primary_10_1109_TASE_2023_3315549 crossref_primary_10_1080_00423114_2025_2539266 crossref_primary_10_1007_s10462_023_10468_6 |
| Cites_doi | 10.1016/j.ejor.2012.08.010 10.1007/s11045-020-00754-9 10.1002/rob.4620010203 10.1016/B978-1-55860-307-3.50045-9 10.1109/MCS.2006.1636313 10.1080/07408170208928908 10.1016/j.apenergy.2020.115451 10.1177/0278364913495721 10.1016/S0377-2217(02)00874-3 10.1109/TAC.2003.811252 10.1080/00207170902823006 10.9746/sicetr1965.14.706 10.1016/j.ejor.2006.02.023 10.1109/ICTAI.2007.12 10.1049/iet-cta.2020.0557 10.1287/mnsc.45.4.560 10.1162/089976698300017746 10.1287/msom.5.4.348.24884 10.1016/S0004-3702(99)00052-1 10.1109/ICARCV.2002.1234955 10.3390/math8091528 10.1016/j.neunet.2011.09.005 10.1038/nature14236 |
| ContentType | Journal Article |
| Copyright | 2021 Wiley Periodicals LLC 2022 Wiley Periodicals LLC. |
| Copyright_xml | – notice: 2021 Wiley Periodicals LLC – notice: 2022 Wiley Periodicals LLC. |
| DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
| DOI | 10.1002/int.22709 |
| DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Computer and Information Systems Abstracts |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1098-111X |
| EndPage | 4019 |
| ExternalDocumentID | 10_1002_int_22709 INT22709 |
| Genre | article |
| GroupedDBID | -~X .3N .4S .DC .GA .Y3 05W 0R~ 10A 1L6 1OB 1OC 24P 31~ 33P 3SF 3WU 4.4 50Y 50Z 51W 51X 52M 52N 52O 52P 52S 52T 52U 52W 52X 5GY 5VS 66C 702 7PT 8-0 8-1 8-3 8-4 8-5 8UM 930 A03 AAESR AAEVG AAHHS AAJEY AANHP AAONW AASGY AAXRX AAYOK AAZKR ABCQN ABCUV ABDPE ABEML ABIJN ABJCF ABJNI ABPVW ABTAH ABUWG ACAHQ ACBWZ ACCFJ ACCMX ACCZN ACGFS ACIWK ACPOU ACRPL ACSCC ACXBN ACXQS ACYXJ ADBBV ADEOM ADIZJ ADKYN ADMGS ADNMO ADOZA ADXAS ADZMN ADZOD AEEZP AEIMD AENEX AEQDE AEUQT AFBPY AFGKR AFKRA AFPWT AFZJQ AI. AIURR AIWBW AJBDE AJXKR ALAGY ALMA_UNASSIGNED_HOLDINGS ALUQN AMBMR AMYDB ARAPS ARCSS ASPBG ATUGU AUFTA AVWKF AZBYB AZFZN AZQEC AZVAB BAFTC BDRZF BENPR BFHJK BGLVJ BHBCM BMNLL BMXJE BNHUX BROTX BRXPI BY8 CCPQU CMOOK CS3 D-E D-F DCZOG DPXWK DR2 DRFUL DRSTM DU5 DWQXO EBS EDO EJD F00 F01 F04 FEDTE G-S G.N GNP GNUQQ GODZA H.T H.X H13 HBH HCIFZ HF~ HHY HVGLF HZ~ I-F IX1 J0M JPC K7- KQQ LATKE LAW LC2 LC3 LEEKS LH4 LITHE LOXES LP6 LP7 LUTES LW6 LYRES M59 M7S MK4 MK~ MRFUL MRSTM MSFUL MSSTM MVM MXFUL MXSTM N04 N05 N9A NF~ O66 O9- OIG P2P P2W P2X P4D PALCI PIMPY PQQKQ PTHSS Q.N Q11 QB0 QRW R.K RHX RIWAO RJQFR ROL RWI RX1 RYL SAMSI SUPJJ TN5 TUS UB1 V2E VH1 W8V W99 WBKPD WH7 WIH WIK WOHZO WQJ WRC WWI WXSBR WYISQ WZISG XG1 XPP XV2 ZY4 ZZTAW ~IA ~WT AAMMB AAYXX ADMLS AEFGJ AFFHD AGQPQ AGXDD AIDQK AIDYY AIQQE CITATION O8X PHGZM PHGZT PQGLB 7SC 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c2979-87ea1bb30e86c612f3d35cc1ee391fbf0f1e04642fde55a01b711e8089364f33 |
| IEDL.DBID | DRFUL |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000707853300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0884-8173 |
| IngestDate | Fri Jul 25 12:13:24 EDT 2025 Sat Nov 29 04:01:54 EST 2025 Tue Nov 18 21:34:45 EST 2025 Wed Jan 22 16:25:33 EST 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c2979-87ea1bb30e86c612f3d35cc1ee391fbf0f1e04642fde55a01b711e8089364f33 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-1167-1075 0000-0002-1470-2148 |
| PQID | 2669523005 |
| PQPubID | 1026350 |
| PageCount | 12 |
| ParticipantIDs | proquest_journals_2669523005 crossref_citationtrail_10_1002_int_22709 crossref_primary_10_1002_int_22709 wiley_primary_10_1002_int_22709_INT22709 |
| PublicationCentury | 2000 |
| PublicationDate | July 2022 2022-07-00 20220701 |
| PublicationDateYYYYMMDD | 2022-07-01 |
| PublicationDate_xml | – month: 07 year: 2022 text: July 2022 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | International journal of intelligent systems |
| PublicationYear | 2022 |
| Publisher | John Wiley & Sons, Inc |
| Publisher_xml | – name: John Wiley & Sons, Inc |
| References | 2011 2009; 82 2002; 34 1999; 45 1997 2002; 3 2013; 224 2020; 14 2000; 1 1978; 14 1995; 7 2020; 8 2004; 155 2007; 178 2021; 32 2013; 32 1984; 1 2006; 26 2007; 8 2019 2015; 518 2003; 48 2018 2017 2003; 5 2020; 277 2015 2014 2013 1999; 112 2012; 26 2007; 1 1998; 10 1993; 298 e_1_2_7_6_1 e_1_2_7_5_1 e_1_2_7_4_1 Li Y (e_1_2_7_24_1) 2011 e_1_2_7_9_1 e_1_2_7_8_1 e_1_2_7_7_1 e_1_2_7_19_1 e_1_2_7_18_1 Silver D (e_1_2_7_37_1) 2014 e_1_2_7_2_1 e_1_2_7_15_1 e_1_2_7_14_1 e_1_2_7_12_1 e_1_2_7_11_1 e_1_2_7_10_1 Schulman J (e_1_2_7_3_1) 2015 e_1_2_7_26_1 e_1_2_7_27_1 e_1_2_7_28_1 e_1_2_7_29_1 Bertsekas DP (e_1_2_7_36_1) 2000 Ghavamzadeh M (e_1_2_7_13_1) 2007; 8 Bradtke SJ (e_1_2_7_16_1) 1995; 7 Sutton RS (e_1_2_7_38_1) 2018 e_1_2_7_30_1 e_1_2_7_25_1 e_1_2_7_31_1 e_1_2_7_32_1 e_1_2_7_23_1 e_1_2_7_33_1 e_1_2_7_22_1 e_1_2_7_34_1 Mahadevan S (e_1_2_7_17_1) 1997 e_1_2_7_21_1 e_1_2_7_20_1 Puterman ML (e_1_2_7_35_1) 2014 |
| References_xml | – volume: 298 start-page: 298 year: 1993 end-page: 305 – volume: 3 start-page: 1268 year: 2002 end-page: 1274 – start-page: 926 year: 2011 end-page: 931 – volume: 82 start-page: 1917 issue: 10 year: 2009 end-page: 1928 article-title: Look‐ahead control of conveyor‐serviced production station by using potential‐based online policy iteration publication-title: Int J Control – volume: 10 start-page: 251 issue: 2 year: 1998 end-page: 276 article-title: Natural gradient works efficiently in learning publication-title: Neural Comput – year: 2017 article-title: Proximal policy optimization algorithms – volume: 5 start-page: 348 issue: 4 year: 2003 end-page: 371 article-title: Price‐directed replenishment of subsets: methodology and its application to inventory routing publication-title: Manuf Serv Oper Manage – volume: 26 start-page: 96 issue: 3 year: 2006 end-page: 114 article-title: A survey of iterative learning control publication-title: IEEE Control Syst Mag – volume: 178 start-page: 808 issue: 3 year: 2007 end-page: 818 article-title: A policy gradient method for semi‐Markov decision processes with application to call admission control publication-title: Eur J Oper Res – volume: 224 start-page: 333 issue: 2 year: 2013 end-page: 339 article-title: A basic formula for performance gradient estimation of semi‐Markov decision processes publication-title: Eur J Oper Res – volume: 155 start-page: 654 issue: 3 year: 2004 end-page: 674 article-title: Reinforcement learning for long‐run average cost publication-title: Eur J Oper Res – volume: 112 start-page: 181 issue: 1‐2 year: 1999 end-page: 211 article-title: Between MDPs and semi‐MDPs: a framework for temporal abstraction in reinforcement learning publication-title: Artif Intell – volume: 48 start-page: 758 issue: 5 year: 2003 end-page: 769 article-title: Semi‐Markov decision problems and performance sensitivity analysis publication-title: IEEE Trans Autom Control – start-page: 1889 year: 2015 end-page: 1897 article-title: Trust region policy optimization publication-title: International Conference on Machine Learning PMLR – year: 2018 – year: 2015 article-title: Continuous control with deep reinforcement learning – year: 2014 – volume: 277 year: 2020 article-title: Data‐driven control of micro‐climate in buildings: an event‐triggered reinforcement learning approach publication-title: Appl Energy – volume: 14 start-page: 3344 issue: 19 year: 2020 end-page: 3350 article-title: Robust point‐to‐point iterative learning control with trial‐varying initial conditions publication-title: IET Control Theory Appl – volume: 45 start-page: 560 issue: 4 year: 1999 end-page: 574 article-title: Solving semi‐Markov decision problems using average reward reinforcement learning publication-title: Manage Sci – start-page: 387 year: 2014 end-page: 395 – year: 2013 article-title: Playing atari with deep reinforcement learning – volume: 1 start-page: 123 issue: 2 year: 1984 end-page: 140 article-title: Bettering operation of robots by learning publication-title: J Rob Syst – start-page: 202 year: 1997 end-page: 210 – volume: 8 start-page: 1528 issue: 9 year: 2020 article-title: PD‐type iterative learning control for uncertain spatially interconnected systems publication-title: Mathematics – volume: 7 start-page: 393 year: 1995 end-page: 400 article-title: Reinforcement learning methods for continuous‐time Markov decision problems publication-title: Adv Neural Inf Process Syst – volume: 1 year: 2000 – volume: 32 start-page: 671 issue: 2 year: 2021 end-page: 692 article-title: Robust PD‐type iterative learning control for discrete systems with multiple time‐delays subjected to polytopic uncertainty and restricted frequency‐domain publication-title: Multidimens Syst Signal Process – volume: 8 start-page: 2629 issue: 11 year: 2007 end-page: 2669 article-title: Hierarchical average reward reinforcement learning publication-title: J Mach Learn Res – volume: 518 start-page: 529 issue: 7540 year: 2015 end-page: 533 article-title: Human‐level control through deep reinforcement learning publication-title: Nature – volume: 26 start-page: 118 year: 2012 end-page: 129 article-title: Analysis and improvement of policy gradient estimation publication-title: Neural Networks – volume: 32 start-page: 1238 issue: 11 year: 2013 end-page: 1274 article-title: Reinforcement learning in robotics: a survey publication-title: Int J Rob Res – volume: 34 start-page: 729 issue: 9 year: 2002 end-page: 742 article-title: A reinforcement learning approach to a single leg airline revenue management problem with multiple fare classes and overbooking publication-title: IIE Trans – volume: 1 start-page: 11 year: 2007 end-page: 18 – volume: 14 start-page: 706 issue: 6 year: 1978 end-page: 712 article-title: Formation of high‐speed motion pattern of a mechanical arm by trial publication-title: Trans Soc Instrum Control Eng – year: 2019 article-title: Discounted reinforcement learning is not an optimization problem – volume-title: Dynamic Programming and Optimal Control year: 2000 ident: e_1_2_7_36_1 – ident: e_1_2_7_25_1 doi: 10.1016/j.ejor.2012.08.010 – ident: e_1_2_7_33_1 doi: 10.1007/s11045-020-00754-9 – volume-title: Reinforcement Learning: an Introduction year: 2018 ident: e_1_2_7_38_1 – ident: e_1_2_7_30_1 doi: 10.1002/rob.4620010203 – start-page: 926 volume-title: 2011 8th Asian Control Conference (ASCC) year: 2011 ident: e_1_2_7_24_1 – ident: e_1_2_7_27_1 doi: 10.1016/B978-1-55860-307-3.50045-9 – ident: e_1_2_7_31_1 doi: 10.1109/MCS.2006.1636313 – ident: e_1_2_7_18_1 doi: 10.1080/07408170208928908 – start-page: 1889 year: 2015 ident: e_1_2_7_3_1 article-title: Trust region policy optimization publication-title: International Conference on Machine Learning PMLR – ident: e_1_2_7_14_1 doi: 10.1016/j.apenergy.2020.115451 – ident: e_1_2_7_7_1 doi: 10.1177/0278364913495721 – volume-title: Markov Decision Processes: Discrete Stochastic Dynamic Programming year: 2014 ident: e_1_2_7_35_1 – ident: e_1_2_7_19_1 doi: 10.1016/S0377-2217(02)00874-3 – ident: e_1_2_7_20_1 doi: 10.1109/TAC.2003.811252 – ident: e_1_2_7_11_1 doi: 10.1080/00207170902823006 – ident: e_1_2_7_28_1 – ident: e_1_2_7_29_1 doi: 10.9746/sicetr1965.14.706 – ident: e_1_2_7_8_1 doi: 10.1016/j.ejor.2006.02.023 – ident: e_1_2_7_4_1 – start-page: 202 volume-title: International Workshop Then Conference on Machine Learning year: 1997 ident: e_1_2_7_17_1 – ident: e_1_2_7_23_1 doi: 10.1109/ICTAI.2007.12 – ident: e_1_2_7_5_1 – ident: e_1_2_7_32_1 doi: 10.1049/iet-cta.2020.0557 – ident: e_1_2_7_2_1 – start-page: 387 volume-title: International Conference on Machine Learning PMLR year: 2014 ident: e_1_2_7_37_1 – ident: e_1_2_7_10_1 doi: 10.1287/mnsc.45.4.560 – ident: e_1_2_7_22_1 doi: 10.1162/089976698300017746 – volume: 7 start-page: 393 year: 1995 ident: e_1_2_7_16_1 article-title: Reinforcement learning methods for continuous‐time Markov decision problems publication-title: Adv Neural Inf Process Syst – ident: e_1_2_7_9_1 doi: 10.1287/msom.5.4.348.24884 – ident: e_1_2_7_12_1 doi: 10.1016/S0004-3702(99)00052-1 – ident: e_1_2_7_21_1 doi: 10.1109/ICARCV.2002.1234955 – ident: e_1_2_7_34_1 doi: 10.3390/math8091528 – ident: e_1_2_7_26_1 doi: 10.1016/j.neunet.2011.09.005 – volume: 8 start-page: 2629 issue: 11 year: 2007 ident: e_1_2_7_13_1 article-title: Hierarchical average reward reinforcement learning publication-title: J Mach Learn Res – ident: e_1_2_7_6_1 doi: 10.1038/nature14236 – ident: e_1_2_7_15_1 |
| SSID | ssj0011745 |
| Score | 2.361015 |
| Snippet | A large class of sequential decision‐making problems under uncertainty, with broad applications from preventive maintenance to event‐triggered control can be... |
| SourceID | proquest crossref wiley |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 4008 |
| SubjectTerms | Algorithms average reward deterministic policy Intelligent systems Markov analysis Markov processes policy gradient theorem Preventive maintenance reinforcement learning SMDP Theorems |
| Title | Deterministic policy gradient algorithms for semi‐Markov decision processes |
| URI | https://onlinelibrary.wiley.com/doi/abs/10.1002%2Fint.22709 https://www.proquest.com/docview/2669523005 |
| Volume | 37 |
| WOSCitedRecordID | wos000707853300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVWIB databaseName: Wiley Online Library Full Collection 2020 customDbUrl: eissn: 1098-111X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011745 issn: 0884-8173 databaseCode: DRFUL dateStart: 19960101 isFulltext: true titleUrlDefault: https://onlinelibrary.wiley.com providerName: Wiley-Blackwell |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3PS8MwFA5j8-DF-ROnU4J48FLXJk3b4EmcQ0GHyITdSpsmWti60dad_RP8G_1LTNJ0U1AQPLWHl7a85CVfk_d9D4BTVzEchZ9YhDJiuUgQi_rMtlDi8SQQQrE5dbEJfzgMxmP60AAXNRem0odYbripyNDztQrwKC56K9HQNCvPEfIVea8lL9htglb_cfB0tzxEkGCbVCDStQLHx7WwkI16y8bfl6MVxvyKVPVSM2j_6yM3wYZBmPCyGhJboMGzbdCuqzdAE8w74L5vMmG0VDOca4Fg-JzrHLASRpPnWZ6WL9MCSlwLCz5NP97eFbVntoCJqcwD5xXPgBe7YDS4Hl3dWKa4gsWk-6mcBXnkxDG2eeAxCXMETjBhzOEcU0fEwhYOV8eeSCSckMh2Yt9xeGBLfOO5AuM90MxmGd8HkFEmzdSvI-GuxB-Rl3gEUcVglVgwdjvgrHZxyIzwuKp_MQkryWQUSi-F2ksdcLI0nVdqGz8Zdet-Ck3AFaHEGVRtcNtEvk73yO8PCG-HI31z8HfTQ7COFPFBJ-p2QbPMX_kRWGOLMi3yYzPyPgE8Dty4 |
| linkProvider | Wiley-Blackwell |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB5KK-jF-sRq1UU8eIlNNtk8wItYS4ttEanQW2g2u7XQF0ns2Z_gb_SXuLt5VEFB8JbD5MHOzO6X3fm-Abi0JMORO6FGPEo0C3OieQ7VNRzaLHQ5l2xO1WzC6ffd4dB7LMFNzoVJ9SGKDTeZGWq-lgkuN6Qba9XQyTy5xtiR7L2KJcKIlKHSfGo9d4tTBIG2SYoiLc01HDNXFtJxo7j5-3q0Bplfoapaa1rV_33lDmxnGBPdpkGxCyU234Nq3r8BZem8D71mVgujxJrRUkkEo3GkqsASNJqOF9EkeZnFSCBbFLPZ5OPtXZJ7FisUZr150DJlGrD4AAat-8FdW8vaK2hUOMAT8yAbGUFg6sy1qQA63AxNQqnBmOkZPOA6N5g8-MQ8ZISMdCNwDIO5ukA4tsVN8xDK88WcHQGiHhVm8ueRMEsgkJEd2gR7ksMq0GBg1eAqH2OfZtLjsgPG1E9Fk7EvRslXo1SDi8J0mept_GRUzx3lZykX-wJpeHKLWyfidcolvz_A7_QH6uL476bnsNke9Lp-t9N_OIEtLGkQqmy3DuUkemWnsEFXySSOzrIw_AR3QeCo |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bS8MwFA7DifjivOJ0ahAffKlr06ZtwBdxFodzDJmwt7CmySzsUta6Z3-Cv9FfYpK2m4KC4FsfTi-ck5N8Tc73HQAuHMVwFF5kYMKw4SCBDeIx00CRyyNfCMXm1M0mvG7XHwxIrwKuSy5Mrg-x3HBTmaHna5XgPIlEc6UaGk-zK4Q8xd6rOpi4Mi2rrafgubM8RZBoG-co0jF8y7NLZSETNZc3f1-PViDzK1TVa01Q-99XboOtAmPCm3xQ7IAKn-6CWtm_ARbpvAceW0UtjBZrhomWCIajua4Cy-BwPJrN4-xlkkKJbGHKJ_HH27si98wWMCp688AkZxrwdB_0g7v-7b1RtFcwmAwAkfMgH1phaJvcd5kEOsKObMyYxblNLBEKU1hcHXwiEXGMh6YVepbFfVMiHNcRtn0A1qazKT8EkBEmzdTPI-aORCBDN3IxIorDKtFg6NTBZeljygrpcdUBY0xz0WREpZeo9lIdnC9Nk1xv4yejRhkoWqRcSiXSIGqL28TydTokvz-Atrt9fXH0d9MzsNFrBbTT7j4cg02kWBC6arcB1rL5Kz8B62yRxen8tBiFnzDF4CM |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Deterministic+policy+gradient+algorithms+for+semi%E2%80%90Markov+decision+processes&rft.jtitle=International+journal+of+intelligent+systems&rft.au=Hosseinloo%2C+Ashkan+Haji&rft.au=Dahleh%2C+Munther+A.&rft.date=2022-07-01&rft.issn=0884-8173&rft.eissn=1098-111X&rft.volume=37&rft.issue=7&rft.spage=4008&rft.epage=4019&rft_id=info:doi/10.1002%2Fint.22709&rft.externalDBID=n%2Fa&rft.externalDocID=10_1002_int_22709 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0884-8173&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0884-8173&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0884-8173&client=summon |