Natural actor–critic algorithms
We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters ar...
Saved in:
| Published in: | Automatica (Oxford) Vol. 45; no. 11; pp. 2471 - 2482 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Kidlington
Elsevier Ltd
01.11.2009
Elsevier |
| Subjects: | |
| ISSN: | 0005-1098, 1873-2836 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor–critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor–critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. |
|---|---|
| AbstractList | We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor–critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor–critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present empirical results verifying the convergence of our algorithms. |
| Author | Bhatnagar, Shalabh Sutton, Richard S. Lee, Mark Ghavamzadeh, Mohammad |
| Author_xml | – sequence: 1 givenname: Shalabh surname: Bhatnagar fullname: Bhatnagar, Shalabh email: shalabh@csa.iisc.ernet.in organization: Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India – sequence: 2 givenname: Richard S. surname: Sutton fullname: Sutton, Richard S. email: sutton@cs.ualberta.ca organization: The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 – sequence: 3 givenname: Mohammad surname: Ghavamzadeh fullname: Ghavamzadeh, Mohammad email: mohammad.ghavamzadeh@inria.fr organization: INRIA Lille - Nord Europe, Team SequeL, France – sequence: 4 givenname: Mark surname: Lee fullname: Lee, Mark email: mlee@cs.ualberta.ca organization: The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 |
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=22121539$$DView record in Pascal Francis https://inria.hal.science/hal-00840470$$DView record in HAL |
| BookMark | eNqNkMFO3DAQhi1EJRbad1gOSPSQdGwntnNBoitaKq3aC3drmDjgVTYG24vUW9-BN-RJ8GopSFzag2WP9c0_9nfI9qcwOcbmHGoOXH1Z1bjJYY3ZE9YCoKtB1wBmj8240bISRqp9NgOAtuLQmQN2mNKqlA03YsaOf2LeRBznSDnEpz-PFH2JmuN4E8rpdp0-sg8Djsl9etmP2NW3i6vFZbX89f3H4nxZUaPaXCnZ9iiuqceGiGstynJGdY3scADRGjNIpUm0WoByrhcNCtHRoEgMRkp5xD7vYm9xtHfRrzH-tgG9vTxf2u1d-VMDjYYHXtjTHXsXw_3GpWzXPpEbR5xc2CTLleYSeAtNQU9eUEyE4xBxIp9eBwjBBW9lV7izHUcxpBTdYMnnIjVMOaIfLQe71W1X9k233eq2oLdPKwHmXcDfGf_R-nXX6oreB--iTeTdRK730VG2ffD_DnkGg_WhHw |
| CODEN | ATCAA9 |
| CitedBy_id | crossref_primary_10_1016_j_neunet_2016_08_003 crossref_primary_10_1109_TAC_2019_2953089 crossref_primary_10_1287_opre_2021_2151 crossref_primary_10_1371_journal_pone_0158722 crossref_primary_10_1016_j_jocs_2024_102421 crossref_primary_10_1109_JIOT_2018_2878435 crossref_primary_10_1016_j_neunet_2009_05_011 crossref_primary_10_1007_s10994_018_5697_1 crossref_primary_10_1080_01691864_2015_1070748 crossref_primary_10_1016_j_segan_2024_101277 crossref_primary_10_3389_fnbot_2017_00058 crossref_primary_10_1080_24751839_2023_2182174 crossref_primary_10_1016_j_visres_2020_07_009 crossref_primary_10_1109_JIOT_2019_2903347 crossref_primary_10_3390_drones9010026 crossref_primary_10_1088_1757_899X_312_1_012018 crossref_primary_10_26599_JICV_2023_9210021 crossref_primary_10_1109_TGCN_2021_3104801 crossref_primary_10_1177_1059712316650265 crossref_primary_10_1109_TCOMM_2022_3220870 crossref_primary_10_1016_j_neunet_2014_01_002 crossref_primary_10_3390_jmse12010063 crossref_primary_10_1016_j_jfranklin_2024_107357 crossref_primary_10_1109_TSG_2017_2667599 crossref_primary_10_1155_2022_3679145 crossref_primary_10_1016_j_automatica_2025_112458 crossref_primary_10_1016_j_future_2023_09_018 crossref_primary_10_1016_j_eswa_2021_115127 crossref_primary_10_1016_j_conengprac_2021_104758 crossref_primary_10_1016_j_trc_2020_102949 crossref_primary_10_1145_3338123 crossref_primary_10_1007_s11704_017_6222_6 crossref_primary_10_1049_iet_its_2019_0317 crossref_primary_10_1016_j_arcontrol_2012_03_004 crossref_primary_10_1016_j_future_2022_11_022 crossref_primary_10_1109_TITS_2023_3335132 crossref_primary_10_1007_s10957_012_9989_5 crossref_primary_10_1109_TAC_2024_3403693 crossref_primary_10_1587_transinf_2017EDP7363 crossref_primary_10_1016_j_neucom_2023_126381 crossref_primary_10_1016_j_ins_2013_08_037 crossref_primary_10_1109_TAC_2023_3243165 crossref_primary_10_1109_TAC_2022_3163085 crossref_primary_10_1007_s10994_012_5313_8 crossref_primary_10_1002_acs_2344 crossref_primary_10_1109_LCSYS_2022_3172242 crossref_primary_10_1007_s00521_012_0865_x crossref_primary_10_1016_j_knosys_2019_105392 crossref_primary_10_1017_S0269964821000206 crossref_primary_10_1109_TNNLS_2023_3348422 crossref_primary_10_1109_ACCESS_2021_3099071 crossref_primary_10_1109_TITS_2021_3090974 crossref_primary_10_1016_j_robot_2014_11_006 crossref_primary_10_3390_electronics13030484 crossref_primary_10_1016_j_cirpj_2022_11_003 crossref_primary_10_1016_j_eswa_2019_06_066 crossref_primary_10_1007_s12555_020_0923_6 crossref_primary_10_1016_j_enconman_2023_116678 crossref_primary_10_1007_s10626_015_0216_z crossref_primary_10_1007_s13235_022_00449_9 crossref_primary_10_1016_j_procs_2012_09_130 crossref_primary_10_1016_j_automatica_2025_112395 crossref_primary_10_1016_j_apenergy_2021_118078 crossref_primary_10_1016_j_ins_2011_01_001 crossref_primary_10_1038_s41562_023_01543_7 crossref_primary_10_3390_systems12020038 crossref_primary_10_1109_TII_2012_2209660 crossref_primary_10_1007_s10994_023_06303_2 crossref_primary_10_1109_THMS_2019_2912447 crossref_primary_10_1109_TAC_2016_2616384 crossref_primary_10_1109_TITS_2010_2091408 crossref_primary_10_1016_j_neucom_2022_05_004 crossref_primary_10_1007_s10586_022_03742_9 crossref_primary_10_1109_TCOMM_2015_2415777 crossref_primary_10_1016_j_ejcon_2023_100853 crossref_primary_10_1016_j_automatica_2016_12_014 crossref_primary_10_1109_TITS_2023_3303953 crossref_primary_10_1103_PhysRevE_106_025315 crossref_primary_10_1049_cit2_12202 crossref_primary_10_1145_3320496_3320500 crossref_primary_10_1016_j_sysconle_2016_02_020 crossref_primary_10_1109_TSP_2023_3268475 crossref_primary_10_1109_TEVC_2016_2560139 crossref_primary_10_1007_s40430_024_05134_z crossref_primary_10_3390_s24103254 crossref_primary_10_1109_TCOMM_2024_3369694 crossref_primary_10_1016_j_knosys_2018_05_033 crossref_primary_10_1109_LCSYS_2023_3288931 crossref_primary_10_1109_TEVC_2015_2415464 crossref_primary_10_1109_TVT_2020_2980861 crossref_primary_10_1088_1757_899X_476_1_012022 crossref_primary_10_1109_TNNLS_2024_3378913 crossref_primary_10_1007_s40305_024_00549_w crossref_primary_10_1109_TWC_2017_2769644 crossref_primary_10_1016_j_asoc_2025_113463 crossref_primary_10_1080_01691864_2020_1757507 crossref_primary_10_1016_j_ifacol_2019_12_182 crossref_primary_10_1109_TNNLS_2018_2808203 crossref_primary_10_1109_TNNLS_2020_2981377 crossref_primary_10_1109_TNNLS_2023_3317628 crossref_primary_10_3390_drones7070418 crossref_primary_10_1007_s40815_020_00868_z crossref_primary_10_1109_TIE_2017_2708002 crossref_primary_10_20965_jrm_2010_p0542 crossref_primary_10_1016_j_apenergy_2022_120599 crossref_primary_10_1007_s10994_016_5569_5 crossref_primary_10_1088_2634_4386_ac84fd crossref_primary_10_1016_j_ejor_2025_08_038 crossref_primary_10_1016_j_energy_2024_132636 crossref_primary_10_1016_j_automatica_2009_07_008 crossref_primary_10_1109_TII_2022_3177415 crossref_primary_10_3390_sym13081335 crossref_primary_10_1137_21M1402303 crossref_primary_10_1049_cth2_12211 crossref_primary_10_1007_s00530_022_00922_w crossref_primary_10_3390_fi14090256 crossref_primary_10_1016_j_jnca_2023_103639 crossref_primary_10_1016_j_cap_2024_03_003 crossref_primary_10_1016_j_engappai_2020_103525 crossref_primary_10_1371_journal_pcbi_1008973 crossref_primary_10_1007_s13369_023_08245_2 crossref_primary_10_1137_19M1288012 crossref_primary_10_1016_j_arcontrol_2018_09_005 crossref_primary_10_1016_j_eswa_2025_127540 crossref_primary_10_1145_2868723 crossref_primary_10_1109_TNSE_2020_2978856 crossref_primary_10_1007_s12555_021_0734_4 crossref_primary_10_1109_TIE_2022_3189103 crossref_primary_10_1016_j_aei_2023_101889 crossref_primary_10_1137_22M1540156 crossref_primary_10_1016_j_neunet_2012_11_007 crossref_primary_10_1016_j_jpdc_2024_104880 crossref_primary_10_1109_TVCG_2020_3030467 crossref_primary_10_1017_S0269888921000023 crossref_primary_10_1109_ACCESS_2020_3036938 crossref_primary_10_1007_s11424_025_4426_7 crossref_primary_10_1109_MCOM_001_2200223 crossref_primary_10_1109_TCYB_2014_2311578 crossref_primary_10_1007_s11276_014_0762_6 crossref_primary_10_1016_j_patcog_2017_07_031 crossref_primary_10_1016_j_compeleceng_2024_109603 crossref_primary_10_1007_s10489_012_0412_6 crossref_primary_10_1016_j_sysconle_2010_08_013 crossref_primary_10_1109_ACCESS_2022_3213649 crossref_primary_10_1109_TNET_2018_2818468 crossref_primary_10_1016_j_adhoc_2025_103854 crossref_primary_10_1016_j_neunet_2018_10_007 crossref_primary_10_1109_TSMC_2020_3041775 crossref_primary_10_1109_TAI_2024_3452678 crossref_primary_10_1109_TAC_2022_3190032 crossref_primary_10_1038_s41598_025_96201_5 crossref_primary_10_1109_TNNLS_2018_2820019 crossref_primary_10_1007_s11771_022_5193_4 crossref_primary_10_1016_j_cie_2021_107621 crossref_primary_10_1103_PhysRevResearch_2_023230 crossref_primary_10_1007_s00500_023_07817_6 crossref_primary_10_3389_fnbot_2018_00066 crossref_primary_10_1109_TSMCC_2012_2218595 crossref_primary_10_1016_j_ifacol_2020_12_2021 crossref_primary_10_1155_2016_4824072 crossref_primary_10_1016_j_ijleo_2018_09_160 crossref_primary_10_1016_j_neunet_2023_05_018 crossref_primary_10_3390_drones9070484 crossref_primary_10_1016_j_apenergy_2020_115256 crossref_primary_10_1162_neco_a_01004 crossref_primary_10_1109_TNNLS_2018_2806087 crossref_primary_10_1016_j_robot_2018_08_009 crossref_primary_10_1109_TCYB_2015_2478857 crossref_primary_10_1007_s11432_021_3775_4 crossref_primary_10_1016_j_neunet_2023_10_023 crossref_primary_10_1109_TAI_2024_3379109 crossref_primary_10_1016_j_jtbi_2023_111433 crossref_primary_10_1007_s10208_025_09729_3 crossref_primary_10_1016_j_neucom_2011_11_034 crossref_primary_10_1002_aisy_202300692 crossref_primary_10_1007_s10898_018_0698_y crossref_primary_10_1007_s41870_022_01137_y crossref_primary_10_1109_TCST_2013_2246866 crossref_primary_10_1109_TITS_2021_3066366 crossref_primary_10_1177_0959651820937085 crossref_primary_10_1016_j_adhoc_2023_103193 crossref_primary_10_1016_j_compchemeng_2021_107382 crossref_primary_10_1073_pnas_1908100117 crossref_primary_10_1002_jcc_27322 crossref_primary_10_3389_fnbot_2019_00049 crossref_primary_10_1049_cit2_12015 crossref_primary_10_1109_TITS_2024_3397700 crossref_primary_10_1109_TAC_2016_2644871 crossref_primary_10_1016_j_sysconle_2011_04_002 crossref_primary_10_1016_j_sysconle_2022_105214 crossref_primary_10_1007_s13198_021_01147_2 crossref_primary_10_1631_FITEE_1900661 crossref_primary_10_1007_s10845_024_02454_8 crossref_primary_10_1016_j_automatica_2015_01_006 crossref_primary_10_1109_TITS_2019_2960872 crossref_primary_10_1109_TSMC_2020_2966631 crossref_primary_10_1007_s10462_021_10061_9 crossref_primary_10_1088_1367_2630_abd7bd crossref_primary_10_1007_s00521_022_07628_0 crossref_primary_10_1016_j_eswa_2023_120495 |
| Cites_doi | 10.1007/BF00115009 10.1145/1044322.1044326 10.1016/S1574-0021(96)01016-7 10.1142/9789814273633_0004 10.1109/9.905687 10.1016/B978-1-55860-377-6.50040-2 10.1613/jair.806 10.1137/S0363012997331639 10.1145/203330.203343 10.1016/j.neucom.2007.11.026 10.1137/S036301299731669X 10.2307/2002797 10.1145/1315575.1315577 10.1007/BF00114723 10.1214/aop/1176990853 10.1109/9.580874 10.1109/ROBOT.2004.1307456 10.1016/B978-1-55860-377-6.50013-X 10.1137/S036301299630759X 10.1016/j.automatica.2009.07.008 10.1023/A:1007609817671 10.1162/089976698300017746 10.1145/84537.84552 10.1007/s10626-006-0003-y 10.1016/0893-6080(89)90018-X 10.1137/S0363012901385691 10.1137/S0363012999361974 10.1057/jors.1993.181 10.1109/TAC.2004.825622 10.1016/S0167-6911(97)90015-3 10.1023/A:1007518724497 10.1145/1273496.1273534 10.1016/S0005-1098(99)00099-0 10.21236/ADA280862 10.1109/9.633827 10.1007/BF00992696 10.1007/BF00993306 |
| ContentType | Journal Article |
| Copyright | 2009 Elsevier Ltd 2009 INIST-CNRS Distributed under a Creative Commons Attribution 4.0 International License |
| Copyright_xml | – notice: 2009 Elsevier Ltd – notice: 2009 INIST-CNRS – notice: Distributed under a Creative Commons Attribution 4.0 International License |
| DBID | AAYXX CITATION IQODW 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| DOI | 10.1016/j.automatica.2009.07.008 |
| DatabaseName | CrossRef Pascal-Francis Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Applied Sciences Computer Science |
| EISSN | 1873-2836 |
| EndPage | 2482 |
| ExternalDocumentID | oai:HAL:hal-00840470v1 22121539 10_1016_j_automatica_2009_07_008 S0005109809003549 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 23N 3R3 4.4 457 4G. 5GY 5VS 6TJ 7-5 71M 8P~ 9JN 9JO AAAKF AAAKG AABNK AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AARIN AAXUO ABDEX ABFNM ABFRF ABJNI ABMAC ABUCO ABXDB ABYKQ ACBEA ACDAQ ACGFO ACGFS ACNNM ACRLP ADBBV ADEZE ADIYS ADMUD ADTZH AEBSH AECPX AEFWE AEKER AENEX AFFNX AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHPGS AI. AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ APLSM ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA HAMUX HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ RXW SBC SDF SDG SDP SES SET SEW SPC SPCBC SSB SSD SST SSZ T5K T9H TAE TN5 VH1 WH7 WUQ X6Y XFK XPP ZMT ~G- 77I 9DU AATTM AAXKI AAYWO AAYXX ABUFD ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS ~HD AFXIZ AGCQF AGRNS BNPGV IQODW SSH 7SC 7SP 8FD JQ2 L7M L~C L~D 1XC VOOES |
| ID | FETCH-LOGICAL-c465t-635da2bcda4cc1772177e869439af02588f367c257206eed24a229cf6c2f8333 |
| ISICitedReferencesCount | 410 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000271877200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0005-1098 |
| IngestDate | Wed Oct 29 06:32:34 EDT 2025 Sun Sep 28 10:11:22 EDT 2025 Mon Jul 21 09:11:51 EDT 2025 Sat Nov 29 01:49:05 EST 2025 Tue Nov 18 21:42:04 EST 2025 Fri Feb 23 02:14:11 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 11 |
| Keywords | Two-timescale stochastic approximation Temporal difference learning Approximate dynamic programming Policy-gradient methods Actor–critic reinforcement learning algorithms Function approximation Natural gradient algorithms Probabilistic approach Reinforcement learning Empirical method Stochastic approximation State space method Parameterization Variance Interest Gradient descent Value function Actor-critic reinforcement learning Dynamic programming Compatibility Learning algorithm Artificial intelligence Gradient method |
| Language | English |
| License | https://www.elsevier.com/tdm/userlicense/1.0 CC BY 4.0 Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c465t-635da2bcda4cc1772177e869439af02588f367c257206eed24a229cf6c2f8333 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| OpenAccessLink | https://inria.hal.science/hal-00840470 |
| PQID | 1671301504 |
| PQPubID | 23500 |
| PageCount | 12 |
| ParticipantIDs | hal_primary_oai_HAL_hal_00840470v1 proquest_miscellaneous_1671301504 pascalfrancis_primary_22121539 crossref_citationtrail_10_1016_j_automatica_2009_07_008 crossref_primary_10_1016_j_automatica_2009_07_008 elsevier_sciencedirect_doi_10_1016_j_automatica_2009_07_008 |
| PublicationCentury | 2000 |
| PublicationDate | 2009-11-01 |
| PublicationDateYYYYMMDD | 2009-11-01 |
| PublicationDate_xml | – month: 11 year: 2009 text: 2009-11-01 day: 01 |
| PublicationDecade | 2000 |
| PublicationPlace | Kidlington |
| PublicationPlace_xml | – name: Kidlington |
| PublicationTitle | Automatica (Oxford) |
| PublicationYear | 2009 |
| Publisher | Elsevier Ltd Elsevier |
| Publisher_xml | – name: Elsevier Ltd – name: Elsevier |
| References | pp. 30–37 Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. Farahmand, Ghavamzadeh, Szepesvári, Mannor (b29) 2009; 21 Tesauro (b59) 1995; 38 Borkar, V.S. (2008). Reinforcement learning–a bridge between numerical methods and Monte-Carlo Bellman, Dreyfus (b10) 1959; 13 Brandiere (b25) 1998; 36 Greensmith, Bartlett, Baxter (b35) 2004; 5 Widrow, Stearns (b64) 1985 Peters, Schaal (b49) 2008; 71 Benveniste, Metivier, Priouret (b11) 1990 Baird, L. (1993). Advantage updating. pp. 49–56 Pemantle (b47) 1990; 18 Aleksandrov, Sysoyev, Shemeneva (b3) 1968; 5 Bertsekas (b12) 1999 Gordon, G. (1995). Stable function approximation in dynamic programming, In Richter, Aberdeen, Yu (b51) 2007; 19 Rust (b52) 1996 Barto, Sutton, Anderson (b8) 1983; 13 Konda, Tsitsiklis (b40) 2003; 42 Bhatnagar, Sutton, Ghavamzadeh, Lee (b19) 2008; 20 Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics, In Ghavamzadeh, M., & Engel, Y. (2007b). Bayesian actor-critic algorithms, In pp. 226–233 pp. 2619–2624 Konda, Borkar (b39) 1999; 38 Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., & Tse, B. et al. (2004). Inverted autonomous helicopter flight via reinforcement learning, In Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion, In Tsitsikis (b60) 1994; 16 Meyn (b45) 2007 Abdulla, Bhatnagar (b1) 2007; 17 White (b63) 1993; 44 Baxter, Bartlett (b9) 2001; 15 Puterman (b50) 1994 Bhatnagar (b17) 2007; 18 Lagoudakis, Parr (b43) 2003; 4 Bhatnagar, Kumar (b15) 2004; 49 Sutton (b54) 1988; 3 Tsitsiklis, Van Roy (b61) 1997; 42 Sutton (b55) 1996; 8 Bertsekas, Tsitsiklis (b13) 1989 Abounadi, Bertsekas, Borkar (b2) 2001; 40 Ghavamzadeh, M., & Mahadevan, S. (2003). Hierarchical policy gradient algorithms, In Sutton, McAllester, Singh, Mansour (b56) 2000; 12 Bradtke, Barto (b26) 1996; 22 Amherst: University of Massachusetts Kushner, Clark (b41) 1978 Bertsekas, Tsitsiklis (b14) 1996 Amari (b4) 1998; 10 Kakade (b37) 2002; 14 Sutton, Barto (b57) 1998 Tadic (b58) 2001; 42 Tsitsikis, Van Roy (b62) 1999; 35 ] Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Bagnell, J., & Schneider, J. (2003). Covariant policy search. In pp. 1019–1024 pp. 261–268 Borkar (b20) 1997; 29 Cao, Chen (b27) 1997; 42 Glynn (b33) 1990; 33 Boyan, J. (1999). Least-squares temporal difference learning. In Borkar, Meyn (b22) 2000; 38 pp. 297–304 Wright Laboratory, OH Hirsch (b36) 1989; 2 Marbach, Tsitsiklis (b44) 2001; 46 Bhatnagar (b16) 2005; 15 Boyan, Moore (b24) 1995; 7 Crites, Barto (b28) 1998; 33 Williams (b65) 1992; 8 Ghavamzadeh, Engel (b31) 2007; 19 Kushner, Yin (b42) 1997 Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Technical Report Lagoudakis (10.1016/j.automatica.2009.07.008_b43) 2003; 4 Konda (10.1016/j.automatica.2009.07.008_b39) 1999; 38 White (10.1016/j.automatica.2009.07.008_b63) 1993; 44 Bertsekas (10.1016/j.automatica.2009.07.008_b12) 1999 Kakade (10.1016/j.automatica.2009.07.008_b37) 2002; 14 10.1016/j.automatica.2009.07.008_b34 Tsitsiklis (10.1016/j.automatica.2009.07.008_b61) 1997; 42 Glynn (10.1016/j.automatica.2009.07.008_b33) 1990; 33 10.1016/j.automatica.2009.07.008_b30 Pemantle (10.1016/j.automatica.2009.07.008_b47) 1990; 18 Crites (10.1016/j.automatica.2009.07.008_b28) 1998; 33 10.1016/j.automatica.2009.07.008_b32 Boyan (10.1016/j.automatica.2009.07.008_b24) 1995; 7 Peters (10.1016/j.automatica.2009.07.008_b49) 2008; 71 10.1016/j.automatica.2009.07.008_b38 Benveniste (10.1016/j.automatica.2009.07.008_b11) 1990 Tesauro (10.1016/j.automatica.2009.07.008_b59) 1995; 38 Rust (10.1016/j.automatica.2009.07.008_b52) 1996 Hirsch (10.1016/j.automatica.2009.07.008_b36) 1989; 2 Borkar (10.1016/j.automatica.2009.07.008_b22) 2000; 38 Bradtke (10.1016/j.automatica.2009.07.008_b26) 1996; 22 Marbach (10.1016/j.automatica.2009.07.008_b44) 2001; 46 Amari (10.1016/j.automatica.2009.07.008_b4) 1998; 10 Brandiere (10.1016/j.automatica.2009.07.008_b25) 1998; 36 Konda (10.1016/j.automatica.2009.07.008_b40) 2003; 42 10.1016/j.automatica.2009.07.008_b23 Kushner (10.1016/j.automatica.2009.07.008_b42) 1997 Bellman (10.1016/j.automatica.2009.07.008_b10) 1959; 13 Bertsekas (10.1016/j.automatica.2009.07.008_b13) 1989 10.1016/j.automatica.2009.07.008_b21 Cao (10.1016/j.automatica.2009.07.008_b27) 1997; 42 Borkar (10.1016/j.automatica.2009.07.008_b20) 1997; 29 Meyn (10.1016/j.automatica.2009.07.008_b45) 2007 Bhatnagar (10.1016/j.automatica.2009.07.008_b16) 2005; 15 Kushner (10.1016/j.automatica.2009.07.008_b41) 1978 Farahmand (10.1016/j.automatica.2009.07.008_b29) 2009; 21 Bhatnagar (10.1016/j.automatica.2009.07.008_b19) 2008; 20 Sutton (10.1016/j.automatica.2009.07.008_b56) 2000; 12 10.1016/j.automatica.2009.07.008_b53 Aleksandrov (10.1016/j.automatica.2009.07.008_b3) 1968; 5 Barto (10.1016/j.automatica.2009.07.008_b8) 1983; 13 Ghavamzadeh (10.1016/j.automatica.2009.07.008_b31) 2007; 19 Tadic (10.1016/j.automatica.2009.07.008_b58) 2001; 42 Sutton (10.1016/j.automatica.2009.07.008_b55) 1996; 8 Puterman (10.1016/j.automatica.2009.07.008_b50) 1994 Tsitsikis (10.1016/j.automatica.2009.07.008_b60) 1994; 16 Bhatnagar (10.1016/j.automatica.2009.07.008_b15) 2004; 49 10.1016/j.automatica.2009.07.008_b18 Sutton (10.1016/j.automatica.2009.07.008_b54) 1988; 3 Abounadi (10.1016/j.automatica.2009.07.008_b2) 2001; 40 Widrow (10.1016/j.automatica.2009.07.008_b64) 1985 10.1016/j.automatica.2009.07.008_b46 10.1016/j.automatica.2009.07.008_b5 10.1016/j.automatica.2009.07.008_b48 10.1016/j.automatica.2009.07.008_b6 Bhatnagar (10.1016/j.automatica.2009.07.008_b17) 2007; 18 Sutton (10.1016/j.automatica.2009.07.008_b57) 1998 10.1016/j.automatica.2009.07.008_b7 Richter (10.1016/j.automatica.2009.07.008_b51) 2007; 19 Tsitsikis (10.1016/j.automatica.2009.07.008_b62) 1999; 35 Abdulla (10.1016/j.automatica.2009.07.008_b1) 2007; 17 Baxter (10.1016/j.automatica.2009.07.008_b9) 2001; 15 Bertsekas (10.1016/j.automatica.2009.07.008_b14) 1996 Williams (10.1016/j.automatica.2009.07.008_b65) 1992; 8 Greensmith (10.1016/j.automatica.2009.07.008_b35) 2004; 5 |
| References_xml | – volume: 20 start-page: 105 year: 2008 end-page: 112 ident: b19 article-title: Incremental natural actor-critic algorithms publication-title: Advances in Neural Information Processing Systems – reference: (pp. 49–56) – volume: 49 start-page: 592 year: 2004 end-page: 598 ident: b15 article-title: A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes publication-title: IEEE Transactions on Automatic Control – volume: 15 start-page: 74 year: 2005 end-page: 107 ident: b16 article-title: Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization publication-title: ACM Transactions on Modeling and Computer Simulation – volume: 38 start-page: 447 year: 2000 end-page: 469 ident: b22 article-title: The O.D.E. method for convergence of stochastic approximation and reinforcement learning publication-title: SIAM Journal on Control and Optimization – volume: 40 start-page: 681 year: 2001 end-page: 698 ident: b2 article-title: Learning algorithms for Markov decision processes publication-title: SIAM Journal on Control and Optimization – volume: 18 start-page: 698 year: 1990 end-page: 712 ident: b47 article-title: Nonconvergence to unstable points in urn models and stochastic approximations publication-title: Annals of Probability – year: 1978 ident: b41 article-title: Stochastic approximation methods for constrained and unconstrained systems – reference: (pp. 261–268) – volume: 14 year: 2002 ident: b37 article-title: A natural policy gradient publication-title: Advances in Neural Information Processing Systems – year: 2007 ident: b45 article-title: Control techniques for complex networks – volume: 36 start-page: 1293 year: 1998 end-page: 1314 ident: b25 article-title: Some pathological traps for stochastic approximation publication-title: SIAM Journal on Control and Optimization – volume: 38 start-page: 94 year: 1999 end-page: 123 ident: b39 article-title: Actor–critic like learning algorithms for Markov decision processes publication-title: SIAM Journal on Control and Optimization – volume: 42 start-page: 1143 year: 2003 end-page: 1166 ident: b40 article-title: On actor–critic algorithms publication-title: SIAM Journal on Control and Optimization – reference: [ – volume: 10 start-page: 251 year: 1998 end-page: 276 ident: b4 article-title: Natural gradient works efficiently in learning publication-title: Neural Computation – volume: 8 start-page: 1038 year: 1996 end-page: 1044 ident: b55 article-title: Generalization in reinforcement learning: Successful examples using sparse coarse coding publication-title: Advances in Neural Information Processing Systems – volume: 44 start-page: 1073 year: 1993 end-page: 1096 ident: b63 article-title: A survey of applications of Markov decision processes publication-title: Journal of the Operational Research Society – reference: (pp. 30–37) – volume: 19 start-page: 1169 year: 2007 end-page: 1176 ident: b51 article-title: Natural actor-critic for road traffic optimization publication-title: Advances in Neural Information Processing Systems – reference: Boyan, J. (1999). Least-squares temporal difference learning. In – volume: 17 start-page: 23 year: 2007 end-page: 52 ident: b1 article-title: Reinforcement learning based algorithms for average cost Markov decision processes publication-title: Discrete Event Dynamic Systems: Theory and Applications – volume: 15 start-page: 319 year: 2001 end-page: 350 ident: b9 article-title: Infinite-horizon policy-gradient estimation publication-title: Journal of Artificial Intelligence Research – reference: Ghavamzadeh, M., & Mahadevan, S. (2003). Hierarchical policy gradient algorithms, In – volume: 5 start-page: 1471 year: 2004 end-page: 1530 ident: b35 article-title: Variance reduction techniques for gradient estimates in reinforcement learning publication-title: Journal of Machine Learning Research – year: 1997 ident: b42 article-title: Stochastic approximation algorithms and applications – reference: , Amherst: University of Massachusetts – reference: (pp. 1019–1024) – year: 1996 ident: b14 article-title: Neuro-dynamic programming – volume: 2 start-page: 331 year: 1989 end-page: 349 ident: b36 article-title: Convergent activation dynamics in continuous time networks publication-title: Neural Networks – volume: 35 start-page: 1799 year: 1999 end-page: 1808 ident: b62 article-title: Average cost temporal-difference learning publication-title: Automatica – volume: 4 start-page: 1107 year: 2003 end-page: 1149 ident: b43 article-title: Least-squares policy iteration publication-title: Journal of Machine Learning Research – volume: 16 start-page: 185 year: 1994 end-page: 202 ident: b60 article-title: Asynchronous stochastic approximation and Q-learning publication-title: Machine Learning – year: 1989 ident: b13 article-title: Parallel and distributed computation – year: 1998 ident: b57 article-title: Reinforcement learning: An introduction – reference: Bagnell, J., & Schneider, J. (2003). Covariant policy search. In – volume: 21 start-page: 441 year: 2009 end-page: 448 ident: b29 article-title: Regularized policy iteration publication-title: Advances in Neural Information Processing Systems – volume: 19 start-page: 457 year: 2007 end-page: 464 ident: b31 article-title: Bayesian policy gradient algorithms publication-title: Advances in Neural Information Processing Systems – volume: 3 start-page: 9 year: 1988 end-page: 44 ident: b54 article-title: Learning to predict by the method of temporal differences publication-title: Machine Learning – volume: 46 start-page: 191 year: 2001 end-page: 209 ident: b44 article-title: Simulation-based optimization of Markov reward processes publication-title: IEEE Transactions on Automatic Control – volume: 42 start-page: 241 year: 2001 end-page: 267 ident: b58 article-title: On the convergence of temporal difference learning with linear function approximation publication-title: Machine Learning – volume: 13 start-page: 835 year: 1983 end-page: 846 ident: b8 article-title: Neuron-like elements that can solve difficult learning control problems publication-title: IEEE Transactions on Systems, Man and Cybernetics – reference: Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., & Tse, B. et al. (2004). Inverted autonomous helicopter flight via reinforcement learning, In – reference: Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion, In – reference: Baird, L. (1993). Advantage updating. – volume: 71 start-page: 1180 year: 2008 end-page: 1190 ident: b49 article-title: Natural actor-critic publication-title: Neurocomputing – volume: 42 start-page: 1382 year: 1997 end-page: 1393 ident: b27 article-title: Perturbation realization, potentials and sensitivity analysis of Markov processes publication-title: IEEE Transactions on Automatic Control – volume: 5 start-page: 11 year: 1968 end-page: 16 ident: b3 article-title: Stochastic optimization publication-title: Engineering Cybernetics – volume: 13 start-page: 247 year: 1959 end-page: 251 ident: b10 article-title: Functional approximations and dynamic programming publication-title: Mathematical Tables and Other Aids to Computation – volume: 29 start-page: 291 year: 1997 end-page: 294 ident: b20 article-title: Stochastic approximation with two timescales publication-title: Systems and Control Letters – reference: Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics, In – volume: 33 start-page: 75 year: 1990 end-page: 84 ident: b33 article-title: Likelihood ratio gradient estimation for stochastic systems publication-title: Communications of the ACM – reference: Gordon, G. (1995). Stable function approximation in dynamic programming, In – year: 1999 ident: b12 article-title: Nonlinear programming – reference: ] – volume: 38 start-page: 58 year: 1995 end-page: 68 ident: b59 article-title: Temporal difference learning and TD-Gammon publication-title: Communications of the ACM – year: 1990 ident: b11 article-title: Adaptive algorithms and stochastic approximations – reference: (pp. 2619–2624) – year: 1985 ident: b64 article-title: Adaptive signal processing – reference: Borkar, V.S. (2008). Reinforcement learning–a bridge between numerical methods and Monte-Carlo, – volume: 18 start-page: 2:1 year: 2007 end-page: 2:35 ident: b17 article-title: Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization publication-title: ACM Transactions on Modeling and Computer Simulation – reference: Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. – volume: 8 start-page: 229 year: 1992 end-page: 256 ident: b65 article-title: Simple statistical gradient-following algorithms for connectionist reinforcement learning publication-title: Machine Learning – volume: 7 start-page: 369 year: 1995 end-page: 376 ident: b24 article-title: Generalization in reinforcement learning: Safely approximating the value function publication-title: Advances in Neural Information Processing Systems – reference: , Wright Laboratory, OH – reference: Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Technical Report, – volume: 22 start-page: 33 year: 1996 end-page: 57 ident: b26 article-title: Linear least-squares algorithms for temporal difference learning publication-title: Machine Learning – reference: Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In – reference: (pp. 226–233) – volume: 33 start-page: 235 year: 1998 end-page: 262 ident: b28 article-title: Elevator group control using multiple reinforcement learning agents publication-title: Machine Learning – start-page: 614 year: 1996 end-page: 722 ident: b52 article-title: Numerical dynamic programming in economics publication-title: Handbook of computational economics – volume: 42 start-page: 674 year: 1997 end-page: 690 ident: b61 article-title: An analysis of temporal-difference learning with function approximation publication-title: IEEE Transactions on Automatic Control – volume: 12 start-page: 1057 year: 2000 end-page: 1063 ident: b56 article-title: Policy gradient methods for reinforcement learning with function approximation publication-title: Advances in Neural Information Processing Systems – reference: (pp. 297–304) – year: 1994 ident: b50 article-title: Markov decision processes: Discrete stochastic dynamic programming – reference: Ghavamzadeh, M., & Engel, Y. (2007b). Bayesian actor-critic algorithms, In – ident: 10.1016/j.automatica.2009.07.008_b53 – volume: 3 start-page: 9 year: 1988 ident: 10.1016/j.automatica.2009.07.008_b54 article-title: Learning to predict by the method of temporal differences publication-title: Machine Learning doi: 10.1007/BF00115009 – volume: 15 start-page: 74 issue: 1 year: 2005 ident: 10.1016/j.automatica.2009.07.008_b16 article-title: Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization publication-title: ACM Transactions on Modeling and Computer Simulation doi: 10.1145/1044322.1044326 – volume: 5 start-page: 11 year: 1968 ident: 10.1016/j.automatica.2009.07.008_b3 article-title: Stochastic optimization publication-title: Engineering Cybernetics – start-page: 614 year: 1996 ident: 10.1016/j.automatica.2009.07.008_b52 article-title: Numerical dynamic programming in economics doi: 10.1016/S1574-0021(96)01016-7 – ident: 10.1016/j.automatica.2009.07.008_b21 doi: 10.1142/9789814273633_0004 – volume: 14 year: 2002 ident: 10.1016/j.automatica.2009.07.008_b37 article-title: A natural policy gradient publication-title: Advances in Neural Information Processing Systems – volume: 46 start-page: 191 year: 2001 ident: 10.1016/j.automatica.2009.07.008_b44 article-title: Simulation-based optimization of Markov reward processes publication-title: IEEE Transactions on Automatic Control doi: 10.1109/9.905687 – ident: 10.1016/j.automatica.2009.07.008_b34 doi: 10.1016/B978-1-55860-377-6.50040-2 – ident: 10.1016/j.automatica.2009.07.008_b30 – volume: 15 start-page: 319 year: 2001 ident: 10.1016/j.automatica.2009.07.008_b9 article-title: Infinite-horizon policy-gradient estimation publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.806 – volume: 13 start-page: 835 year: 1983 ident: 10.1016/j.automatica.2009.07.008_b8 article-title: Neuron-like elements that can solve difficult learning control problems publication-title: IEEE Transactions on Systems, Man and Cybernetics – volume: 38 start-page: 447 issue: 2 year: 2000 ident: 10.1016/j.automatica.2009.07.008_b22 article-title: The O.D.E. method for convergence of stochastic approximation and reinforcement learning publication-title: SIAM Journal on Control and Optimization doi: 10.1137/S0363012997331639 – volume: 38 start-page: 58 year: 1995 ident: 10.1016/j.automatica.2009.07.008_b59 article-title: Temporal difference learning and TD-Gammon publication-title: Communications of the ACM doi: 10.1145/203330.203343 – volume: 71 start-page: 1180 issue: 7–9 year: 2008 ident: 10.1016/j.automatica.2009.07.008_b49 article-title: Natural actor-critic publication-title: Neurocomputing doi: 10.1016/j.neucom.2007.11.026 – year: 1985 ident: 10.1016/j.automatica.2009.07.008_b64 – year: 1989 ident: 10.1016/j.automatica.2009.07.008_b13 – year: 1978 ident: 10.1016/j.automatica.2009.07.008_b41 – volume: 38 start-page: 94 issue: 1 year: 1999 ident: 10.1016/j.automatica.2009.07.008_b39 article-title: Actor–critic like learning algorithms for Markov decision processes publication-title: SIAM Journal on Control and Optimization doi: 10.1137/S036301299731669X – ident: 10.1016/j.automatica.2009.07.008_b48 – volume: 13 start-page: 247 year: 1959 ident: 10.1016/j.automatica.2009.07.008_b10 article-title: Functional approximations and dynamic programming publication-title: Mathematical Tables and Other Aids to Computation doi: 10.2307/2002797 – volume: 8 start-page: 1038 year: 1996 ident: 10.1016/j.automatica.2009.07.008_b55 article-title: Generalization in reinforcement learning: Successful examples using sparse coarse coding publication-title: Advances in Neural Information Processing Systems – year: 1997 ident: 10.1016/j.automatica.2009.07.008_b42 – ident: 10.1016/j.automatica.2009.07.008_b23 – year: 1994 ident: 10.1016/j.automatica.2009.07.008_b50 – volume: 18 start-page: 2:1 issue: 1 year: 2007 ident: 10.1016/j.automatica.2009.07.008_b17 article-title: Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization publication-title: ACM Transactions on Modeling and Computer Simulation doi: 10.1145/1315575.1315577 – volume: 22 start-page: 33 year: 1996 ident: 10.1016/j.automatica.2009.07.008_b26 article-title: Linear least-squares algorithms for temporal difference learning publication-title: Machine Learning doi: 10.1007/BF00114723 – volume: 18 start-page: 698 year: 1990 ident: 10.1016/j.automatica.2009.07.008_b47 article-title: Nonconvergence to unstable points in urn models and stochastic approximations publication-title: Annals of Probability doi: 10.1214/aop/1176990853 – volume: 42 start-page: 674 issue: 5 year: 1997 ident: 10.1016/j.automatica.2009.07.008_b61 article-title: An analysis of temporal-difference learning with function approximation publication-title: IEEE Transactions on Automatic Control doi: 10.1109/9.580874 – ident: 10.1016/j.automatica.2009.07.008_b38 doi: 10.1109/ROBOT.2004.1307456 – year: 1990 ident: 10.1016/j.automatica.2009.07.008_b11 – ident: 10.1016/j.automatica.2009.07.008_b6 doi: 10.1016/B978-1-55860-377-6.50013-X – volume: 4 start-page: 1107 year: 2003 ident: 10.1016/j.automatica.2009.07.008_b43 article-title: Least-squares policy iteration publication-title: Journal of Machine Learning Research – year: 1999 ident: 10.1016/j.automatica.2009.07.008_b12 – volume: 20 start-page: 105 year: 2008 ident: 10.1016/j.automatica.2009.07.008_b19 article-title: Incremental natural actor-critic algorithms publication-title: Advances in Neural Information Processing Systems – volume: 7 start-page: 369 year: 1995 ident: 10.1016/j.automatica.2009.07.008_b24 article-title: Generalization in reinforcement learning: Safely approximating the value function publication-title: Advances in Neural Information Processing Systems – volume: 36 start-page: 1293 year: 1998 ident: 10.1016/j.automatica.2009.07.008_b25 article-title: Some pathological traps for stochastic approximation publication-title: SIAM Journal on Control and Optimization doi: 10.1137/S036301299630759X – ident: 10.1016/j.automatica.2009.07.008_b18 doi: 10.1016/j.automatica.2009.07.008 – year: 1996 ident: 10.1016/j.automatica.2009.07.008_b14 – volume: 19 start-page: 457 year: 2007 ident: 10.1016/j.automatica.2009.07.008_b31 article-title: Bayesian policy gradient algorithms publication-title: Advances in Neural Information Processing Systems – volume: 42 start-page: 241 issue: 3 year: 2001 ident: 10.1016/j.automatica.2009.07.008_b58 article-title: On the convergence of temporal difference learning with linear function approximation publication-title: Machine Learning doi: 10.1023/A:1007609817671 – volume: 10 start-page: 251 issue: 2 year: 1998 ident: 10.1016/j.automatica.2009.07.008_b4 article-title: Natural gradient works efficiently in learning publication-title: Neural Computation doi: 10.1162/089976698300017746 – volume: 33 start-page: 75 year: 1990 ident: 10.1016/j.automatica.2009.07.008_b33 article-title: Likelihood ratio gradient estimation for stochastic systems publication-title: Communications of the ACM doi: 10.1145/84537.84552 – volume: 17 start-page: 23 issue: 1 year: 2007 ident: 10.1016/j.automatica.2009.07.008_b1 article-title: Reinforcement learning based algorithms for average cost Markov decision processes publication-title: Discrete Event Dynamic Systems: Theory and Applications doi: 10.1007/s10626-006-0003-y – volume: 5 start-page: 1471 year: 2004 ident: 10.1016/j.automatica.2009.07.008_b35 article-title: Variance reduction techniques for gradient estimates in reinforcement learning publication-title: Journal of Machine Learning Research – volume: 2 start-page: 331 year: 1989 ident: 10.1016/j.automatica.2009.07.008_b36 article-title: Convergent activation dynamics in continuous time networks publication-title: Neural Networks doi: 10.1016/0893-6080(89)90018-X – volume: 42 start-page: 1143 issue: 4 year: 2003 ident: 10.1016/j.automatica.2009.07.008_b40 article-title: On actor–critic algorithms publication-title: SIAM Journal on Control and Optimization doi: 10.1137/S0363012901385691 – volume: 40 start-page: 681 year: 2001 ident: 10.1016/j.automatica.2009.07.008_b2 article-title: Learning algorithms for Markov decision processes publication-title: SIAM Journal on Control and Optimization doi: 10.1137/S0363012999361974 – volume: 44 start-page: 1073 year: 1993 ident: 10.1016/j.automatica.2009.07.008_b63 article-title: A survey of applications of Markov decision processes publication-title: Journal of the Operational Research Society doi: 10.1057/jors.1993.181 – year: 1998 ident: 10.1016/j.automatica.2009.07.008_b57 – volume: 49 start-page: 592 issue: 4 year: 2004 ident: 10.1016/j.automatica.2009.07.008_b15 article-title: A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes publication-title: IEEE Transactions on Automatic Control doi: 10.1109/TAC.2004.825622 – volume: 29 start-page: 291 year: 1997 ident: 10.1016/j.automatica.2009.07.008_b20 article-title: Stochastic approximation with two timescales publication-title: Systems and Control Letters doi: 10.1016/S0167-6911(97)90015-3 – volume: 33 start-page: 235 year: 1998 ident: 10.1016/j.automatica.2009.07.008_b28 article-title: Elevator group control using multiple reinforcement learning agents publication-title: Machine Learning doi: 10.1023/A:1007518724497 – ident: 10.1016/j.automatica.2009.07.008_b46 – volume: 12 start-page: 1057 year: 2000 ident: 10.1016/j.automatica.2009.07.008_b56 article-title: Policy gradient methods for reinforcement learning with function approximation publication-title: Advances in Neural Information Processing Systems – ident: 10.1016/j.automatica.2009.07.008_b32 doi: 10.1145/1273496.1273534 – volume: 35 start-page: 1799 year: 1999 ident: 10.1016/j.automatica.2009.07.008_b62 article-title: Average cost temporal-difference learning publication-title: Automatica doi: 10.1016/S0005-1098(99)00099-0 – year: 2007 ident: 10.1016/j.automatica.2009.07.008_b45 – volume: 19 start-page: 1169 year: 2007 ident: 10.1016/j.automatica.2009.07.008_b51 article-title: Natural actor-critic for road traffic optimization publication-title: Advances in Neural Information Processing Systems – ident: 10.1016/j.automatica.2009.07.008_b5 doi: 10.21236/ADA280862 – volume: 42 start-page: 1382 year: 1997 ident: 10.1016/j.automatica.2009.07.008_b27 article-title: Perturbation realization, potentials and sensitivity analysis of Markov processes publication-title: IEEE Transactions on Automatic Control doi: 10.1109/9.633827 – volume: 21 start-page: 441 year: 2009 ident: 10.1016/j.automatica.2009.07.008_b29 article-title: Regularized policy iteration publication-title: Advances in Neural Information Processing Systems – volume: 8 start-page: 229 year: 1992 ident: 10.1016/j.automatica.2009.07.008_b65 article-title: Simple statistical gradient-following algorithms for connectionist reinforcement learning publication-title: Machine Learning doi: 10.1007/BF00992696 – ident: 10.1016/j.automatica.2009.07.008_b7 – volume: 16 start-page: 185 year: 1994 ident: 10.1016/j.automatica.2009.07.008_b60 article-title: Asynchronous stochastic approximation and Q-learning publication-title: Machine Learning doi: 10.1007/BF00993306 |
| SSID | ssj0004182 |
| Score | 2.5325997 |
| Snippet | We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their... We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their... We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their... |
| SourceID | hal proquest pascalfrancis crossref elsevier |
| SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 2471 |
| SubjectTerms | Actor–critic reinforcement learning algorithms Algorithms Applied sciences Approximate dynamic programming Artificial intelligence Cognitive science Computer science Computer science; control theory; systems Convergence Exact sciences and technology Function approximation Learning Natural gradient Parametrization Policies Policy-gradient methods Reinforcement Temporal difference learning Temporal logic Two-timescale stochastic approximation Variance |
| Title | Natural actor–critic algorithms |
| URI | https://dx.doi.org/10.1016/j.automatica.2009.07.008 https://www.proquest.com/docview/1671301504 https://inria.hal.science/hal-00840470 |
| Volume | 45 |
| WOSCitedRecordID | wos000271877200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1873-2836 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0004182 issn: 0005-1098 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1fb9MwELe6jgcQQvwVHTB1iLcpU5x4sS2eChoaCCqk9qFvkevYC1ObVGtaVXviO_AN-SScYzvpNJCKEA-NIrf557uc766_-x1Cb2CJAi9By4CA7Q-IVCLgEAgF02TKsA4lVjyrm03Q4ZBNJvxrp7PxtTDrGS0KttnwxX8VNYyBsE3p7F-IuzkpDMA-CB22IHbY7iT4obBUGnUjHY9liGXd0uBYzC5K2MsdR7mnn11VZU3dKmr20Y0FvDcpgne5qApxYaHYo1yA4jRJ5NGqqm6U6B-PThpMTy7WYn4tMlXnbr6UuZjPRea_dhigplrI5x64K8JrEmK-KKZFIFkja8hNbXPpE2XtKqNxAJ5Msm14LY-kVzC8bUaJ7cviluSI2AZFt8y9zTxcGrCPmybHQGp4KVm7xDXAw1Hts8K9hSaDC7HxHtqP6ClnXbQ_-Hg2-dTW1GJmmebdwzgUmMUG_v56f3Jt9nKDsb2_EEt47bTtl3Jr6a_9mfFD9MAFIv2BVaBHqKOKx-jeFj3lE3TkVKlfq9LP7z-sEvVbJXqKxh_Oxu_PA9dSI5AkOa0CcC8zEU1lJoiUGCIr-CiWcHBLhQb3lzEdJ1SCHY_CBNyniIgo4lInMtIsjuNnqFuUhXqO-hJidU0zKmKiCQ2JwLFmU04zgjOutO4h6icjlY5u3nQ9maUeV3iZttNouqHyNDRYCNZDuDlyYSlXdjjmrZ_v1LmO1iVMQVV2OPo1iKi5mGFcPx98Ts2Y6TcRwvOtcQ8d3pBg8_MoMgwtMe-hIy_SFGy0-eNNFKpcLVOcUHAVIfQiB_90ny_Q3fYtfIm61dVKvUJ35Lr6trw6dBr8CxLrt-A |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Natural+actor%E2%80%93critic+algorithms&rft.jtitle=Automatica+%28Oxford%29&rft.au=Bhatnagar%2C+Shalabh&rft.au=Sutton%2C+Richard+S.&rft.au=Ghavamzadeh%2C+Mohammad&rft.au=Lee%2C+Mark&rft.date=2009-11-01&rft.pub=Elsevier+Ltd&rft.issn=0005-1098&rft.eissn=1873-2836&rft.volume=45&rft.issue=11&rft.spage=2471&rft.epage=2482&rft_id=info:doi/10.1016%2Fj.automatica.2009.07.008&rft.externalDocID=S0005109809003549 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0005-1098&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0005-1098&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0005-1098&client=summon |