Natural actor–critic algorithms

We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters ar...

Full description

Saved in:
Bibliographic Details
Published in:Automatica (Oxford) Vol. 45; no. 11; pp. 2471 - 2482
Main Authors: Bhatnagar, Shalabh, Sutton, Richard S., Ghavamzadeh, Mohammad, Lee, Mark
Format: Journal Article
Language:English
Published: Kidlington Elsevier Ltd 01.11.2009
Elsevier
Subjects:
ISSN:0005-1098, 1873-2836
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor–critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor–critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.
AbstractList We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor–critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor–critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.
We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. We present empirical results verifying the convergence of our algorithms.
Author Bhatnagar, Shalabh
Sutton, Richard S.
Lee, Mark
Ghavamzadeh, Mohammad
Author_xml – sequence: 1
  givenname: Shalabh
  surname: Bhatnagar
  fullname: Bhatnagar, Shalabh
  email: shalabh@csa.iisc.ernet.in
  organization: Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India
– sequence: 2
  givenname: Richard S.
  surname: Sutton
  fullname: Sutton, Richard S.
  email: sutton@cs.ualberta.ca
  organization: The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8
– sequence: 3
  givenname: Mohammad
  surname: Ghavamzadeh
  fullname: Ghavamzadeh, Mohammad
  email: mohammad.ghavamzadeh@inria.fr
  organization: INRIA Lille - Nord Europe, Team SequeL, France
– sequence: 4
  givenname: Mark
  surname: Lee
  fullname: Lee, Mark
  email: mlee@cs.ualberta.ca
  organization: The RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=22121539$$DView record in Pascal Francis
https://inria.hal.science/hal-00840470$$DView record in HAL
BookMark eNqNkMFO3DAQhi1EJRbad1gOSPSQdGwntnNBoitaKq3aC3drmDjgVTYG24vUW9-BN-RJ8GopSFzag2WP9c0_9nfI9qcwOcbmHGoOXH1Z1bjJYY3ZE9YCoKtB1wBmj8240bISRqp9NgOAtuLQmQN2mNKqlA03YsaOf2LeRBznSDnEpz-PFH2JmuN4E8rpdp0-sg8Djsl9etmP2NW3i6vFZbX89f3H4nxZUaPaXCnZ9iiuqceGiGstynJGdY3scADRGjNIpUm0WoByrhcNCtHRoEgMRkp5xD7vYm9xtHfRrzH-tgG9vTxf2u1d-VMDjYYHXtjTHXsXw_3GpWzXPpEbR5xc2CTLleYSeAtNQU9eUEyE4xBxIp9eBwjBBW9lV7izHUcxpBTdYMnnIjVMOaIfLQe71W1X9k233eq2oLdPKwHmXcDfGf_R-nXX6oreB--iTeTdRK730VG2ffD_DnkGg_WhHw
CODEN ATCAA9
CitedBy_id crossref_primary_10_1016_j_neunet_2016_08_003
crossref_primary_10_1109_TAC_2019_2953089
crossref_primary_10_1287_opre_2021_2151
crossref_primary_10_1371_journal_pone_0158722
crossref_primary_10_1016_j_jocs_2024_102421
crossref_primary_10_1109_JIOT_2018_2878435
crossref_primary_10_1016_j_neunet_2009_05_011
crossref_primary_10_1007_s10994_018_5697_1
crossref_primary_10_1080_01691864_2015_1070748
crossref_primary_10_1016_j_segan_2024_101277
crossref_primary_10_3389_fnbot_2017_00058
crossref_primary_10_1080_24751839_2023_2182174
crossref_primary_10_1016_j_visres_2020_07_009
crossref_primary_10_1109_JIOT_2019_2903347
crossref_primary_10_3390_drones9010026
crossref_primary_10_1088_1757_899X_312_1_012018
crossref_primary_10_26599_JICV_2023_9210021
crossref_primary_10_1109_TGCN_2021_3104801
crossref_primary_10_1177_1059712316650265
crossref_primary_10_1109_TCOMM_2022_3220870
crossref_primary_10_1016_j_neunet_2014_01_002
crossref_primary_10_3390_jmse12010063
crossref_primary_10_1016_j_jfranklin_2024_107357
crossref_primary_10_1109_TSG_2017_2667599
crossref_primary_10_1155_2022_3679145
crossref_primary_10_1016_j_automatica_2025_112458
crossref_primary_10_1016_j_future_2023_09_018
crossref_primary_10_1016_j_eswa_2021_115127
crossref_primary_10_1016_j_conengprac_2021_104758
crossref_primary_10_1016_j_trc_2020_102949
crossref_primary_10_1145_3338123
crossref_primary_10_1007_s11704_017_6222_6
crossref_primary_10_1049_iet_its_2019_0317
crossref_primary_10_1016_j_arcontrol_2012_03_004
crossref_primary_10_1016_j_future_2022_11_022
crossref_primary_10_1109_TITS_2023_3335132
crossref_primary_10_1007_s10957_012_9989_5
crossref_primary_10_1109_TAC_2024_3403693
crossref_primary_10_1587_transinf_2017EDP7363
crossref_primary_10_1016_j_neucom_2023_126381
crossref_primary_10_1016_j_ins_2013_08_037
crossref_primary_10_1109_TAC_2023_3243165
crossref_primary_10_1109_TAC_2022_3163085
crossref_primary_10_1007_s10994_012_5313_8
crossref_primary_10_1002_acs_2344
crossref_primary_10_1109_LCSYS_2022_3172242
crossref_primary_10_1007_s00521_012_0865_x
crossref_primary_10_1016_j_knosys_2019_105392
crossref_primary_10_1017_S0269964821000206
crossref_primary_10_1109_TNNLS_2023_3348422
crossref_primary_10_1109_ACCESS_2021_3099071
crossref_primary_10_1109_TITS_2021_3090974
crossref_primary_10_1016_j_robot_2014_11_006
crossref_primary_10_3390_electronics13030484
crossref_primary_10_1016_j_cirpj_2022_11_003
crossref_primary_10_1016_j_eswa_2019_06_066
crossref_primary_10_1007_s12555_020_0923_6
crossref_primary_10_1016_j_enconman_2023_116678
crossref_primary_10_1007_s10626_015_0216_z
crossref_primary_10_1007_s13235_022_00449_9
crossref_primary_10_1016_j_procs_2012_09_130
crossref_primary_10_1016_j_automatica_2025_112395
crossref_primary_10_1016_j_apenergy_2021_118078
crossref_primary_10_1016_j_ins_2011_01_001
crossref_primary_10_1038_s41562_023_01543_7
crossref_primary_10_3390_systems12020038
crossref_primary_10_1109_TII_2012_2209660
crossref_primary_10_1007_s10994_023_06303_2
crossref_primary_10_1109_THMS_2019_2912447
crossref_primary_10_1109_TAC_2016_2616384
crossref_primary_10_1109_TITS_2010_2091408
crossref_primary_10_1016_j_neucom_2022_05_004
crossref_primary_10_1007_s10586_022_03742_9
crossref_primary_10_1109_TCOMM_2015_2415777
crossref_primary_10_1016_j_ejcon_2023_100853
crossref_primary_10_1016_j_automatica_2016_12_014
crossref_primary_10_1109_TITS_2023_3303953
crossref_primary_10_1103_PhysRevE_106_025315
crossref_primary_10_1049_cit2_12202
crossref_primary_10_1145_3320496_3320500
crossref_primary_10_1016_j_sysconle_2016_02_020
crossref_primary_10_1109_TSP_2023_3268475
crossref_primary_10_1109_TEVC_2016_2560139
crossref_primary_10_1007_s40430_024_05134_z
crossref_primary_10_3390_s24103254
crossref_primary_10_1109_TCOMM_2024_3369694
crossref_primary_10_1016_j_knosys_2018_05_033
crossref_primary_10_1109_LCSYS_2023_3288931
crossref_primary_10_1109_TEVC_2015_2415464
crossref_primary_10_1109_TVT_2020_2980861
crossref_primary_10_1088_1757_899X_476_1_012022
crossref_primary_10_1109_TNNLS_2024_3378913
crossref_primary_10_1007_s40305_024_00549_w
crossref_primary_10_1109_TWC_2017_2769644
crossref_primary_10_1016_j_asoc_2025_113463
crossref_primary_10_1080_01691864_2020_1757507
crossref_primary_10_1016_j_ifacol_2019_12_182
crossref_primary_10_1109_TNNLS_2018_2808203
crossref_primary_10_1109_TNNLS_2020_2981377
crossref_primary_10_1109_TNNLS_2023_3317628
crossref_primary_10_3390_drones7070418
crossref_primary_10_1007_s40815_020_00868_z
crossref_primary_10_1109_TIE_2017_2708002
crossref_primary_10_20965_jrm_2010_p0542
crossref_primary_10_1016_j_apenergy_2022_120599
crossref_primary_10_1007_s10994_016_5569_5
crossref_primary_10_1088_2634_4386_ac84fd
crossref_primary_10_1016_j_ejor_2025_08_038
crossref_primary_10_1016_j_energy_2024_132636
crossref_primary_10_1016_j_automatica_2009_07_008
crossref_primary_10_1109_TII_2022_3177415
crossref_primary_10_3390_sym13081335
crossref_primary_10_1137_21M1402303
crossref_primary_10_1049_cth2_12211
crossref_primary_10_1007_s00530_022_00922_w
crossref_primary_10_3390_fi14090256
crossref_primary_10_1016_j_jnca_2023_103639
crossref_primary_10_1016_j_cap_2024_03_003
crossref_primary_10_1016_j_engappai_2020_103525
crossref_primary_10_1371_journal_pcbi_1008973
crossref_primary_10_1007_s13369_023_08245_2
crossref_primary_10_1137_19M1288012
crossref_primary_10_1016_j_arcontrol_2018_09_005
crossref_primary_10_1016_j_eswa_2025_127540
crossref_primary_10_1145_2868723
crossref_primary_10_1109_TNSE_2020_2978856
crossref_primary_10_1007_s12555_021_0734_4
crossref_primary_10_1109_TIE_2022_3189103
crossref_primary_10_1016_j_aei_2023_101889
crossref_primary_10_1137_22M1540156
crossref_primary_10_1016_j_neunet_2012_11_007
crossref_primary_10_1016_j_jpdc_2024_104880
crossref_primary_10_1109_TVCG_2020_3030467
crossref_primary_10_1017_S0269888921000023
crossref_primary_10_1109_ACCESS_2020_3036938
crossref_primary_10_1007_s11424_025_4426_7
crossref_primary_10_1109_MCOM_001_2200223
crossref_primary_10_1109_TCYB_2014_2311578
crossref_primary_10_1007_s11276_014_0762_6
crossref_primary_10_1016_j_patcog_2017_07_031
crossref_primary_10_1016_j_compeleceng_2024_109603
crossref_primary_10_1007_s10489_012_0412_6
crossref_primary_10_1016_j_sysconle_2010_08_013
crossref_primary_10_1109_ACCESS_2022_3213649
crossref_primary_10_1109_TNET_2018_2818468
crossref_primary_10_1016_j_adhoc_2025_103854
crossref_primary_10_1016_j_neunet_2018_10_007
crossref_primary_10_1109_TSMC_2020_3041775
crossref_primary_10_1109_TAI_2024_3452678
crossref_primary_10_1109_TAC_2022_3190032
crossref_primary_10_1038_s41598_025_96201_5
crossref_primary_10_1109_TNNLS_2018_2820019
crossref_primary_10_1007_s11771_022_5193_4
crossref_primary_10_1016_j_cie_2021_107621
crossref_primary_10_1103_PhysRevResearch_2_023230
crossref_primary_10_1007_s00500_023_07817_6
crossref_primary_10_3389_fnbot_2018_00066
crossref_primary_10_1109_TSMCC_2012_2218595
crossref_primary_10_1016_j_ifacol_2020_12_2021
crossref_primary_10_1155_2016_4824072
crossref_primary_10_1016_j_ijleo_2018_09_160
crossref_primary_10_1016_j_neunet_2023_05_018
crossref_primary_10_3390_drones9070484
crossref_primary_10_1016_j_apenergy_2020_115256
crossref_primary_10_1162_neco_a_01004
crossref_primary_10_1109_TNNLS_2018_2806087
crossref_primary_10_1016_j_robot_2018_08_009
crossref_primary_10_1109_TCYB_2015_2478857
crossref_primary_10_1007_s11432_021_3775_4
crossref_primary_10_1016_j_neunet_2023_10_023
crossref_primary_10_1109_TAI_2024_3379109
crossref_primary_10_1016_j_jtbi_2023_111433
crossref_primary_10_1007_s10208_025_09729_3
crossref_primary_10_1016_j_neucom_2011_11_034
crossref_primary_10_1002_aisy_202300692
crossref_primary_10_1007_s10898_018_0698_y
crossref_primary_10_1007_s41870_022_01137_y
crossref_primary_10_1109_TCST_2013_2246866
crossref_primary_10_1109_TITS_2021_3066366
crossref_primary_10_1177_0959651820937085
crossref_primary_10_1016_j_adhoc_2023_103193
crossref_primary_10_1016_j_compchemeng_2021_107382
crossref_primary_10_1073_pnas_1908100117
crossref_primary_10_1002_jcc_27322
crossref_primary_10_3389_fnbot_2019_00049
crossref_primary_10_1049_cit2_12015
crossref_primary_10_1109_TITS_2024_3397700
crossref_primary_10_1109_TAC_2016_2644871
crossref_primary_10_1016_j_sysconle_2011_04_002
crossref_primary_10_1016_j_sysconle_2022_105214
crossref_primary_10_1007_s13198_021_01147_2
crossref_primary_10_1631_FITEE_1900661
crossref_primary_10_1007_s10845_024_02454_8
crossref_primary_10_1016_j_automatica_2015_01_006
crossref_primary_10_1109_TITS_2019_2960872
crossref_primary_10_1109_TSMC_2020_2966631
crossref_primary_10_1007_s10462_021_10061_9
crossref_primary_10_1088_1367_2630_abd7bd
crossref_primary_10_1007_s00521_022_07628_0
crossref_primary_10_1016_j_eswa_2023_120495
Cites_doi 10.1007/BF00115009
10.1145/1044322.1044326
10.1016/S1574-0021(96)01016-7
10.1142/9789814273633_0004
10.1109/9.905687
10.1016/B978-1-55860-377-6.50040-2
10.1613/jair.806
10.1137/S0363012997331639
10.1145/203330.203343
10.1016/j.neucom.2007.11.026
10.1137/S036301299731669X
10.2307/2002797
10.1145/1315575.1315577
10.1007/BF00114723
10.1214/aop/1176990853
10.1109/9.580874
10.1109/ROBOT.2004.1307456
10.1016/B978-1-55860-377-6.50013-X
10.1137/S036301299630759X
10.1016/j.automatica.2009.07.008
10.1023/A:1007609817671
10.1162/089976698300017746
10.1145/84537.84552
10.1007/s10626-006-0003-y
10.1016/0893-6080(89)90018-X
10.1137/S0363012901385691
10.1137/S0363012999361974
10.1057/jors.1993.181
10.1109/TAC.2004.825622
10.1016/S0167-6911(97)90015-3
10.1023/A:1007518724497
10.1145/1273496.1273534
10.1016/S0005-1098(99)00099-0
10.21236/ADA280862
10.1109/9.633827
10.1007/BF00992696
10.1007/BF00993306
ContentType Journal Article
Copyright 2009 Elsevier Ltd
2009 INIST-CNRS
Distributed under a Creative Commons Attribution 4.0 International License
Copyright_xml – notice: 2009 Elsevier Ltd
– notice: 2009 INIST-CNRS
– notice: Distributed under a Creative Commons Attribution 4.0 International License
DBID AAYXX
CITATION
IQODW
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
DOI 10.1016/j.automatica.2009.07.008
DatabaseName CrossRef
Pascal-Francis
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList

Technology Research Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Applied Sciences
Computer Science
EISSN 1873-2836
EndPage 2482
ExternalDocumentID oai:HAL:hal-00840470v1
22121539
10_1016_j_automatica_2009_07_008
S0005109809003549
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
23N
3R3
4.4
457
4G.
5GY
5VS
6TJ
7-5
71M
8P~
9JN
9JO
AAAKF
AAAKG
AABNK
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AARIN
AAXUO
ABDEX
ABFNM
ABFRF
ABJNI
ABMAC
ABUCO
ABXDB
ABYKQ
ACBEA
ACDAQ
ACGFO
ACGFS
ACNNM
ACRLP
ADBBV
ADEZE
ADIYS
ADMUD
ADTZH
AEBSH
AECPX
AEFWE
AEKER
AENEX
AFFNX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHPGS
AI.
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
APLSM
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
GBLVA
HAMUX
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
K-O
KOM
LG9
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
RXW
SBC
SDF
SDG
SDP
SES
SET
SEW
SPC
SPCBC
SSB
SSD
SST
SSZ
T5K
T9H
TAE
TN5
VH1
WH7
WUQ
X6Y
XFK
XPP
ZMT
~G-
77I
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABUFD
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
~HD
AFXIZ
AGCQF
AGRNS
BNPGV
IQODW
SSH
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
1XC
VOOES
ID FETCH-LOGICAL-c465t-635da2bcda4cc1772177e869439af02588f367c257206eed24a229cf6c2f8333
ISICitedReferencesCount 410
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000271877200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0005-1098
IngestDate Wed Oct 29 06:32:34 EDT 2025
Sun Sep 28 10:11:22 EDT 2025
Mon Jul 21 09:11:51 EDT 2025
Sat Nov 29 01:49:05 EST 2025
Tue Nov 18 21:42:04 EST 2025
Fri Feb 23 02:14:11 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 11
Keywords Two-timescale stochastic approximation
Temporal difference learning
Approximate dynamic programming
Policy-gradient methods
Actor–critic reinforcement learning algorithms
Function approximation
Natural gradient
algorithms
Probabilistic approach
Reinforcement learning
Empirical method
Stochastic approximation
State space method
Parameterization
Variance
Interest
Gradient descent
Value function
Actor-critic reinforcement learning
Dynamic programming
Compatibility
Learning algorithm
Artificial intelligence
Gradient method
Language English
License https://www.elsevier.com/tdm/userlicense/1.0
CC BY 4.0
Distributed under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c465t-635da2bcda4cc1772177e869439af02588f367c257206eed24a229cf6c2f8333
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink https://inria.hal.science/hal-00840470
PQID 1671301504
PQPubID 23500
PageCount 12
ParticipantIDs hal_primary_oai_HAL_hal_00840470v1
proquest_miscellaneous_1671301504
pascalfrancis_primary_22121539
crossref_citationtrail_10_1016_j_automatica_2009_07_008
crossref_primary_10_1016_j_automatica_2009_07_008
elsevier_sciencedirect_doi_10_1016_j_automatica_2009_07_008
PublicationCentury 2000
PublicationDate 2009-11-01
PublicationDateYYYYMMDD 2009-11-01
PublicationDate_xml – month: 11
  year: 2009
  text: 2009-11-01
  day: 01
PublicationDecade 2000
PublicationPlace Kidlington
PublicationPlace_xml – name: Kidlington
PublicationTitle Automatica (Oxford)
PublicationYear 2009
Publisher Elsevier Ltd
Elsevier
Publisher_xml – name: Elsevier Ltd
– name: Elsevier
References pp. 30–37
Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning.
Farahmand, Ghavamzadeh, Szepesvári, Mannor (b29) 2009; 21
Tesauro (b59) 1995; 38
Borkar, V.S. (2008). Reinforcement learning–a bridge between numerical methods and Monte-Carlo
Bellman, Dreyfus (b10) 1959; 13
Brandiere (b25) 1998; 36
Greensmith, Bartlett, Baxter (b35) 2004; 5
Widrow, Stearns (b64) 1985
Peters, Schaal (b49) 2008; 71
Benveniste, Metivier, Priouret (b11) 1990
Baird, L. (1993). Advantage updating.
pp. 49–56
Pemantle (b47) 1990; 18
Aleksandrov, Sysoyev, Shemeneva (b3) 1968; 5
Bertsekas (b12) 1999
Gordon, G. (1995). Stable function approximation in dynamic programming, In
Richter, Aberdeen, Yu (b51) 2007; 19
Rust (b52) 1996
Barto, Sutton, Anderson (b8) 1983; 13
Konda, Tsitsiklis (b40) 2003; 42
Bhatnagar, Sutton, Ghavamzadeh, Lee (b19) 2008; 20
Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics, In
Ghavamzadeh, M., & Engel, Y. (2007b). Bayesian actor-critic algorithms, In
pp. 226–233
pp. 2619–2624
Konda, Borkar (b39) 1999; 38
Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., & Tse, B. et al. (2004). Inverted autonomous helicopter flight via reinforcement learning, In
Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion, In
Tsitsikis (b60) 1994; 16
Meyn (b45) 2007
Abdulla, Bhatnagar (b1) 2007; 17
White (b63) 1993; 44
Baxter, Bartlett (b9) 2001; 15
Puterman (b50) 1994
Bhatnagar (b17) 2007; 18
Lagoudakis, Parr (b43) 2003; 4
Bhatnagar, Kumar (b15) 2004; 49
Sutton (b54) 1988; 3
Tsitsiklis, Van Roy (b61) 1997; 42
Sutton (b55) 1996; 8
Bertsekas, Tsitsiklis (b13) 1989
Abounadi, Bertsekas, Borkar (b2) 2001; 40
Ghavamzadeh, M., & Mahadevan, S. (2003). Hierarchical policy gradient algorithms, In
Sutton, McAllester, Singh, Mansour (b56) 2000; 12
Bradtke, Barto (b26) 1996; 22
Amherst: University of Massachusetts
Kushner, Clark (b41) 1978
Bertsekas, Tsitsiklis (b14) 1996
Amari (b4) 1998; 10
Kakade (b37) 2002; 14
Sutton, Barto (b57) 1998
Tadic (b58) 2001; 42
Tsitsikis, Van Roy (b62) 1999; 35
]
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In
Bagnell, J., & Schneider, J. (2003). Covariant policy search. In
pp. 1019–1024
pp. 261–268
Borkar (b20) 1997; 29
Cao, Chen (b27) 1997; 42
Glynn (b33) 1990; 33
Boyan, J. (1999). Least-squares temporal difference learning. In
Borkar, Meyn (b22) 2000; 38
pp. 297–304
Wright Laboratory, OH
Hirsch (b36) 1989; 2
Marbach, Tsitsiklis (b44) 2001; 46
Bhatnagar (b16) 2005; 15
Boyan, Moore (b24) 1995; 7
Crites, Barto (b28) 1998; 33
Williams (b65) 1992; 8
Ghavamzadeh, Engel (b31) 2007; 19
Kushner, Yin (b42) 1997
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Technical Report
Lagoudakis (10.1016/j.automatica.2009.07.008_b43) 2003; 4
Konda (10.1016/j.automatica.2009.07.008_b39) 1999; 38
White (10.1016/j.automatica.2009.07.008_b63) 1993; 44
Bertsekas (10.1016/j.automatica.2009.07.008_b12) 1999
Kakade (10.1016/j.automatica.2009.07.008_b37) 2002; 14
10.1016/j.automatica.2009.07.008_b34
Tsitsiklis (10.1016/j.automatica.2009.07.008_b61) 1997; 42
Glynn (10.1016/j.automatica.2009.07.008_b33) 1990; 33
10.1016/j.automatica.2009.07.008_b30
Pemantle (10.1016/j.automatica.2009.07.008_b47) 1990; 18
Crites (10.1016/j.automatica.2009.07.008_b28) 1998; 33
10.1016/j.automatica.2009.07.008_b32
Boyan (10.1016/j.automatica.2009.07.008_b24) 1995; 7
Peters (10.1016/j.automatica.2009.07.008_b49) 2008; 71
10.1016/j.automatica.2009.07.008_b38
Benveniste (10.1016/j.automatica.2009.07.008_b11) 1990
Tesauro (10.1016/j.automatica.2009.07.008_b59) 1995; 38
Rust (10.1016/j.automatica.2009.07.008_b52) 1996
Hirsch (10.1016/j.automatica.2009.07.008_b36) 1989; 2
Borkar (10.1016/j.automatica.2009.07.008_b22) 2000; 38
Bradtke (10.1016/j.automatica.2009.07.008_b26) 1996; 22
Marbach (10.1016/j.automatica.2009.07.008_b44) 2001; 46
Amari (10.1016/j.automatica.2009.07.008_b4) 1998; 10
Brandiere (10.1016/j.automatica.2009.07.008_b25) 1998; 36
Konda (10.1016/j.automatica.2009.07.008_b40) 2003; 42
10.1016/j.automatica.2009.07.008_b23
Kushner (10.1016/j.automatica.2009.07.008_b42) 1997
Bellman (10.1016/j.automatica.2009.07.008_b10) 1959; 13
Bertsekas (10.1016/j.automatica.2009.07.008_b13) 1989
10.1016/j.automatica.2009.07.008_b21
Cao (10.1016/j.automatica.2009.07.008_b27) 1997; 42
Borkar (10.1016/j.automatica.2009.07.008_b20) 1997; 29
Meyn (10.1016/j.automatica.2009.07.008_b45) 2007
Bhatnagar (10.1016/j.automatica.2009.07.008_b16) 2005; 15
Kushner (10.1016/j.automatica.2009.07.008_b41) 1978
Farahmand (10.1016/j.automatica.2009.07.008_b29) 2009; 21
Bhatnagar (10.1016/j.automatica.2009.07.008_b19) 2008; 20
Sutton (10.1016/j.automatica.2009.07.008_b56) 2000; 12
10.1016/j.automatica.2009.07.008_b53
Aleksandrov (10.1016/j.automatica.2009.07.008_b3) 1968; 5
Barto (10.1016/j.automatica.2009.07.008_b8) 1983; 13
Ghavamzadeh (10.1016/j.automatica.2009.07.008_b31) 2007; 19
Tadic (10.1016/j.automatica.2009.07.008_b58) 2001; 42
Sutton (10.1016/j.automatica.2009.07.008_b55) 1996; 8
Puterman (10.1016/j.automatica.2009.07.008_b50) 1994
Tsitsikis (10.1016/j.automatica.2009.07.008_b60) 1994; 16
Bhatnagar (10.1016/j.automatica.2009.07.008_b15) 2004; 49
10.1016/j.automatica.2009.07.008_b18
Sutton (10.1016/j.automatica.2009.07.008_b54) 1988; 3
Abounadi (10.1016/j.automatica.2009.07.008_b2) 2001; 40
Widrow (10.1016/j.automatica.2009.07.008_b64) 1985
10.1016/j.automatica.2009.07.008_b46
10.1016/j.automatica.2009.07.008_b5
10.1016/j.automatica.2009.07.008_b48
10.1016/j.automatica.2009.07.008_b6
Bhatnagar (10.1016/j.automatica.2009.07.008_b17) 2007; 18
Sutton (10.1016/j.automatica.2009.07.008_b57) 1998
10.1016/j.automatica.2009.07.008_b7
Richter (10.1016/j.automatica.2009.07.008_b51) 2007; 19
Tsitsikis (10.1016/j.automatica.2009.07.008_b62) 1999; 35
Abdulla (10.1016/j.automatica.2009.07.008_b1) 2007; 17
Baxter (10.1016/j.automatica.2009.07.008_b9) 2001; 15
Bertsekas (10.1016/j.automatica.2009.07.008_b14) 1996
Williams (10.1016/j.automatica.2009.07.008_b65) 1992; 8
Greensmith (10.1016/j.automatica.2009.07.008_b35) 2004; 5
References_xml – volume: 20
  start-page: 105
  year: 2008
  end-page: 112
  ident: b19
  article-title: Incremental natural actor-critic algorithms
  publication-title: Advances in Neural Information Processing Systems
– reference: (pp. 49–56)
– volume: 49
  start-page: 592
  year: 2004
  end-page: 598
  ident: b15
  article-title: A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes
  publication-title: IEEE Transactions on Automatic Control
– volume: 15
  start-page: 74
  year: 2005
  end-page: 107
  ident: b16
  article-title: Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization
  publication-title: ACM Transactions on Modeling and Computer Simulation
– volume: 38
  start-page: 447
  year: 2000
  end-page: 469
  ident: b22
  article-title: The O.D.E. method for convergence of stochastic approximation and reinforcement learning
  publication-title: SIAM Journal on Control and Optimization
– volume: 40
  start-page: 681
  year: 2001
  end-page: 698
  ident: b2
  article-title: Learning algorithms for Markov decision processes
  publication-title: SIAM Journal on Control and Optimization
– volume: 18
  start-page: 698
  year: 1990
  end-page: 712
  ident: b47
  article-title: Nonconvergence to unstable points in urn models and stochastic approximations
  publication-title: Annals of Probability
– year: 1978
  ident: b41
  article-title: Stochastic approximation methods for constrained and unconstrained systems
– reference: (pp. 261–268)
– volume: 14
  year: 2002
  ident: b37
  article-title: A natural policy gradient
  publication-title: Advances in Neural Information Processing Systems
– year: 2007
  ident: b45
  article-title: Control techniques for complex networks
– volume: 36
  start-page: 1293
  year: 1998
  end-page: 1314
  ident: b25
  article-title: Some pathological traps for stochastic approximation
  publication-title: SIAM Journal on Control and Optimization
– volume: 38
  start-page: 94
  year: 1999
  end-page: 123
  ident: b39
  article-title: Actor–critic like learning algorithms for Markov decision processes
  publication-title: SIAM Journal on Control and Optimization
– volume: 42
  start-page: 1143
  year: 2003
  end-page: 1166
  ident: b40
  article-title: On actor–critic algorithms
  publication-title: SIAM Journal on Control and Optimization
– reference: [
– volume: 10
  start-page: 251
  year: 1998
  end-page: 276
  ident: b4
  article-title: Natural gradient works efficiently in learning
  publication-title: Neural Computation
– volume: 8
  start-page: 1038
  year: 1996
  end-page: 1044
  ident: b55
  article-title: Generalization in reinforcement learning: Successful examples using sparse coarse coding
  publication-title: Advances in Neural Information Processing Systems
– volume: 44
  start-page: 1073
  year: 1993
  end-page: 1096
  ident: b63
  article-title: A survey of applications of Markov decision processes
  publication-title: Journal of the Operational Research Society
– reference: (pp. 30–37)
– volume: 19
  start-page: 1169
  year: 2007
  end-page: 1176
  ident: b51
  article-title: Natural actor-critic for road traffic optimization
  publication-title: Advances in Neural Information Processing Systems
– reference: Boyan, J. (1999). Least-squares temporal difference learning. In
– volume: 17
  start-page: 23
  year: 2007
  end-page: 52
  ident: b1
  article-title: Reinforcement learning based algorithms for average cost Markov decision processes
  publication-title: Discrete Event Dynamic Systems: Theory and Applications
– volume: 15
  start-page: 319
  year: 2001
  end-page: 350
  ident: b9
  article-title: Infinite-horizon policy-gradient estimation
  publication-title: Journal of Artificial Intelligence Research
– reference: Ghavamzadeh, M., & Mahadevan, S. (2003). Hierarchical policy gradient algorithms, In
– volume: 5
  start-page: 1471
  year: 2004
  end-page: 1530
  ident: b35
  article-title: Variance reduction techniques for gradient estimates in reinforcement learning
  publication-title: Journal of Machine Learning Research
– year: 1997
  ident: b42
  article-title: Stochastic approximation algorithms and applications
– reference: , Amherst: University of Massachusetts
– reference: (pp. 1019–1024)
– year: 1996
  ident: b14
  article-title: Neuro-dynamic programming
– volume: 2
  start-page: 331
  year: 1989
  end-page: 349
  ident: b36
  article-title: Convergent activation dynamics in continuous time networks
  publication-title: Neural Networks
– volume: 35
  start-page: 1799
  year: 1999
  end-page: 1808
  ident: b62
  article-title: Average cost temporal-difference learning
  publication-title: Automatica
– volume: 4
  start-page: 1107
  year: 2003
  end-page: 1149
  ident: b43
  article-title: Least-squares policy iteration
  publication-title: Journal of Machine Learning Research
– volume: 16
  start-page: 185
  year: 1994
  end-page: 202
  ident: b60
  article-title: Asynchronous stochastic approximation and Q-learning
  publication-title: Machine Learning
– year: 1989
  ident: b13
  article-title: Parallel and distributed computation
– year: 1998
  ident: b57
  article-title: Reinforcement learning: An introduction
– reference: Bagnell, J., & Schneider, J. (2003). Covariant policy search. In
– volume: 21
  start-page: 441
  year: 2009
  end-page: 448
  ident: b29
  article-title: Regularized policy iteration
  publication-title: Advances in Neural Information Processing Systems
– volume: 19
  start-page: 457
  year: 2007
  end-page: 464
  ident: b31
  article-title: Bayesian policy gradient algorithms
  publication-title: Advances in Neural Information Processing Systems
– volume: 3
  start-page: 9
  year: 1988
  end-page: 44
  ident: b54
  article-title: Learning to predict by the method of temporal differences
  publication-title: Machine Learning
– volume: 46
  start-page: 191
  year: 2001
  end-page: 209
  ident: b44
  article-title: Simulation-based optimization of Markov reward processes
  publication-title: IEEE Transactions on Automatic Control
– volume: 42
  start-page: 241
  year: 2001
  end-page: 267
  ident: b58
  article-title: On the convergence of temporal difference learning with linear function approximation
  publication-title: Machine Learning
– volume: 13
  start-page: 835
  year: 1983
  end-page: 846
  ident: b8
  article-title: Neuron-like elements that can solve difficult learning control problems
  publication-title: IEEE Transactions on Systems, Man and Cybernetics
– reference: Ng, A., Coates, A., Diel, M., Ganapathi, V., Schulte, J., & Tse, B. et al. (2004). Inverted autonomous helicopter flight via reinforcement learning, In
– reference: Kohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion, In
– reference: Baird, L. (1993). Advantage updating.
– volume: 71
  start-page: 1180
  year: 2008
  end-page: 1190
  ident: b49
  article-title: Natural actor-critic
  publication-title: Neurocomputing
– volume: 42
  start-page: 1382
  year: 1997
  end-page: 1393
  ident: b27
  article-title: Perturbation realization, potentials and sensitivity analysis of Markov processes
  publication-title: IEEE Transactions on Automatic Control
– volume: 5
  start-page: 11
  year: 1968
  end-page: 16
  ident: b3
  article-title: Stochastic optimization
  publication-title: Engineering Cybernetics
– volume: 13
  start-page: 247
  year: 1959
  end-page: 251
  ident: b10
  article-title: Functional approximations and dynamic programming
  publication-title: Mathematical Tables and Other Aids to Computation
– volume: 29
  start-page: 291
  year: 1997
  end-page: 294
  ident: b20
  article-title: Stochastic approximation with two timescales
  publication-title: Systems and Control Letters
– reference: Peters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics, In
– volume: 33
  start-page: 75
  year: 1990
  end-page: 84
  ident: b33
  article-title: Likelihood ratio gradient estimation for stochastic systems
  publication-title: Communications of the ACM
– reference: Gordon, G. (1995). Stable function approximation in dynamic programming, In
– year: 1999
  ident: b12
  article-title: Nonlinear programming
– reference: ]
– volume: 38
  start-page: 58
  year: 1995
  end-page: 68
  ident: b59
  article-title: Temporal difference learning and TD-Gammon
  publication-title: Communications of the ACM
– year: 1990
  ident: b11
  article-title: Adaptive algorithms and stochastic approximations
– reference: (pp. 2619–2624)
– year: 1985
  ident: b64
  article-title: Adaptive signal processing
– reference: Borkar, V.S. (2008). Reinforcement learning–a bridge between numerical methods and Monte-Carlo,
– volume: 18
  start-page: 2:1
  year: 2007
  end-page: 2:35
  ident: b17
  article-title: Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization
  publication-title: ACM Transactions on Modeling and Computer Simulation
– reference: Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning.
– volume: 8
  start-page: 229
  year: 1992
  end-page: 256
  ident: b65
  article-title: Simple statistical gradient-following algorithms for connectionist reinforcement learning
  publication-title: Machine Learning
– volume: 7
  start-page: 369
  year: 1995
  end-page: 376
  ident: b24
  article-title: Generalization in reinforcement learning: Safely approximating the value function
  publication-title: Advances in Neural Information Processing Systems
– reference: , Wright Laboratory, OH
– reference: Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Technical Report,
– volume: 22
  start-page: 33
  year: 1996
  end-page: 57
  ident: b26
  article-title: Linear least-squares algorithms for temporal difference learning
  publication-title: Machine Learning
– reference: Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In
– reference: (pp. 226–233)
– volume: 33
  start-page: 235
  year: 1998
  end-page: 262
  ident: b28
  article-title: Elevator group control using multiple reinforcement learning agents
  publication-title: Machine Learning
– start-page: 614
  year: 1996
  end-page: 722
  ident: b52
  article-title: Numerical dynamic programming in economics
  publication-title: Handbook of computational economics
– volume: 42
  start-page: 674
  year: 1997
  end-page: 690
  ident: b61
  article-title: An analysis of temporal-difference learning with function approximation
  publication-title: IEEE Transactions on Automatic Control
– volume: 12
  start-page: 1057
  year: 2000
  end-page: 1063
  ident: b56
  article-title: Policy gradient methods for reinforcement learning with function approximation
  publication-title: Advances in Neural Information Processing Systems
– reference: (pp. 297–304)
– year: 1994
  ident: b50
  article-title: Markov decision processes: Discrete stochastic dynamic programming
– reference: Ghavamzadeh, M., & Engel, Y. (2007b). Bayesian actor-critic algorithms, In
– ident: 10.1016/j.automatica.2009.07.008_b53
– volume: 3
  start-page: 9
  year: 1988
  ident: 10.1016/j.automatica.2009.07.008_b54
  article-title: Learning to predict by the method of temporal differences
  publication-title: Machine Learning
  doi: 10.1007/BF00115009
– volume: 15
  start-page: 74
  issue: 1
  year: 2005
  ident: 10.1016/j.automatica.2009.07.008_b16
  article-title: Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization
  publication-title: ACM Transactions on Modeling and Computer Simulation
  doi: 10.1145/1044322.1044326
– volume: 5
  start-page: 11
  year: 1968
  ident: 10.1016/j.automatica.2009.07.008_b3
  article-title: Stochastic optimization
  publication-title: Engineering Cybernetics
– start-page: 614
  year: 1996
  ident: 10.1016/j.automatica.2009.07.008_b52
  article-title: Numerical dynamic programming in economics
  doi: 10.1016/S1574-0021(96)01016-7
– ident: 10.1016/j.automatica.2009.07.008_b21
  doi: 10.1142/9789814273633_0004
– volume: 14
  year: 2002
  ident: 10.1016/j.automatica.2009.07.008_b37
  article-title: A natural policy gradient
  publication-title: Advances in Neural Information Processing Systems
– volume: 46
  start-page: 191
  year: 2001
  ident: 10.1016/j.automatica.2009.07.008_b44
  article-title: Simulation-based optimization of Markov reward processes
  publication-title: IEEE Transactions on Automatic Control
  doi: 10.1109/9.905687
– ident: 10.1016/j.automatica.2009.07.008_b34
  doi: 10.1016/B978-1-55860-377-6.50040-2
– ident: 10.1016/j.automatica.2009.07.008_b30
– volume: 15
  start-page: 319
  year: 2001
  ident: 10.1016/j.automatica.2009.07.008_b9
  article-title: Infinite-horizon policy-gradient estimation
  publication-title: Journal of Artificial Intelligence Research
  doi: 10.1613/jair.806
– volume: 13
  start-page: 835
  year: 1983
  ident: 10.1016/j.automatica.2009.07.008_b8
  article-title: Neuron-like elements that can solve difficult learning control problems
  publication-title: IEEE Transactions on Systems, Man and Cybernetics
– volume: 38
  start-page: 447
  issue: 2
  year: 2000
  ident: 10.1016/j.automatica.2009.07.008_b22
  article-title: The O.D.E. method for convergence of stochastic approximation and reinforcement learning
  publication-title: SIAM Journal on Control and Optimization
  doi: 10.1137/S0363012997331639
– volume: 38
  start-page: 58
  year: 1995
  ident: 10.1016/j.automatica.2009.07.008_b59
  article-title: Temporal difference learning and TD-Gammon
  publication-title: Communications of the ACM
  doi: 10.1145/203330.203343
– volume: 71
  start-page: 1180
  issue: 7–9
  year: 2008
  ident: 10.1016/j.automatica.2009.07.008_b49
  article-title: Natural actor-critic
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2007.11.026
– year: 1985
  ident: 10.1016/j.automatica.2009.07.008_b64
– year: 1989
  ident: 10.1016/j.automatica.2009.07.008_b13
– year: 1978
  ident: 10.1016/j.automatica.2009.07.008_b41
– volume: 38
  start-page: 94
  issue: 1
  year: 1999
  ident: 10.1016/j.automatica.2009.07.008_b39
  article-title: Actor–critic like learning algorithms for Markov decision processes
  publication-title: SIAM Journal on Control and Optimization
  doi: 10.1137/S036301299731669X
– ident: 10.1016/j.automatica.2009.07.008_b48
– volume: 13
  start-page: 247
  year: 1959
  ident: 10.1016/j.automatica.2009.07.008_b10
  article-title: Functional approximations and dynamic programming
  publication-title: Mathematical Tables and Other Aids to Computation
  doi: 10.2307/2002797
– volume: 8
  start-page: 1038
  year: 1996
  ident: 10.1016/j.automatica.2009.07.008_b55
  article-title: Generalization in reinforcement learning: Successful examples using sparse coarse coding
  publication-title: Advances in Neural Information Processing Systems
– year: 1997
  ident: 10.1016/j.automatica.2009.07.008_b42
– ident: 10.1016/j.automatica.2009.07.008_b23
– year: 1994
  ident: 10.1016/j.automatica.2009.07.008_b50
– volume: 18
  start-page: 2:1
  issue: 1
  year: 2007
  ident: 10.1016/j.automatica.2009.07.008_b17
  article-title: Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization
  publication-title: ACM Transactions on Modeling and Computer Simulation
  doi: 10.1145/1315575.1315577
– volume: 22
  start-page: 33
  year: 1996
  ident: 10.1016/j.automatica.2009.07.008_b26
  article-title: Linear least-squares algorithms for temporal difference learning
  publication-title: Machine Learning
  doi: 10.1007/BF00114723
– volume: 18
  start-page: 698
  year: 1990
  ident: 10.1016/j.automatica.2009.07.008_b47
  article-title: Nonconvergence to unstable points in urn models and stochastic approximations
  publication-title: Annals of Probability
  doi: 10.1214/aop/1176990853
– volume: 42
  start-page: 674
  issue: 5
  year: 1997
  ident: 10.1016/j.automatica.2009.07.008_b61
  article-title: An analysis of temporal-difference learning with function approximation
  publication-title: IEEE Transactions on Automatic Control
  doi: 10.1109/9.580874
– ident: 10.1016/j.automatica.2009.07.008_b38
  doi: 10.1109/ROBOT.2004.1307456
– year: 1990
  ident: 10.1016/j.automatica.2009.07.008_b11
– ident: 10.1016/j.automatica.2009.07.008_b6
  doi: 10.1016/B978-1-55860-377-6.50013-X
– volume: 4
  start-page: 1107
  year: 2003
  ident: 10.1016/j.automatica.2009.07.008_b43
  article-title: Least-squares policy iteration
  publication-title: Journal of Machine Learning Research
– year: 1999
  ident: 10.1016/j.automatica.2009.07.008_b12
– volume: 20
  start-page: 105
  year: 2008
  ident: 10.1016/j.automatica.2009.07.008_b19
  article-title: Incremental natural actor-critic algorithms
  publication-title: Advances in Neural Information Processing Systems
– volume: 7
  start-page: 369
  year: 1995
  ident: 10.1016/j.automatica.2009.07.008_b24
  article-title: Generalization in reinforcement learning: Safely approximating the value function
  publication-title: Advances in Neural Information Processing Systems
– volume: 36
  start-page: 1293
  year: 1998
  ident: 10.1016/j.automatica.2009.07.008_b25
  article-title: Some pathological traps for stochastic approximation
  publication-title: SIAM Journal on Control and Optimization
  doi: 10.1137/S036301299630759X
– ident: 10.1016/j.automatica.2009.07.008_b18
  doi: 10.1016/j.automatica.2009.07.008
– year: 1996
  ident: 10.1016/j.automatica.2009.07.008_b14
– volume: 19
  start-page: 457
  year: 2007
  ident: 10.1016/j.automatica.2009.07.008_b31
  article-title: Bayesian policy gradient algorithms
  publication-title: Advances in Neural Information Processing Systems
– volume: 42
  start-page: 241
  issue: 3
  year: 2001
  ident: 10.1016/j.automatica.2009.07.008_b58
  article-title: On the convergence of temporal difference learning with linear function approximation
  publication-title: Machine Learning
  doi: 10.1023/A:1007609817671
– volume: 10
  start-page: 251
  issue: 2
  year: 1998
  ident: 10.1016/j.automatica.2009.07.008_b4
  article-title: Natural gradient works efficiently in learning
  publication-title: Neural Computation
  doi: 10.1162/089976698300017746
– volume: 33
  start-page: 75
  year: 1990
  ident: 10.1016/j.automatica.2009.07.008_b33
  article-title: Likelihood ratio gradient estimation for stochastic systems
  publication-title: Communications of the ACM
  doi: 10.1145/84537.84552
– volume: 17
  start-page: 23
  issue: 1
  year: 2007
  ident: 10.1016/j.automatica.2009.07.008_b1
  article-title: Reinforcement learning based algorithms for average cost Markov decision processes
  publication-title: Discrete Event Dynamic Systems: Theory and Applications
  doi: 10.1007/s10626-006-0003-y
– volume: 5
  start-page: 1471
  year: 2004
  ident: 10.1016/j.automatica.2009.07.008_b35
  article-title: Variance reduction techniques for gradient estimates in reinforcement learning
  publication-title: Journal of Machine Learning Research
– volume: 2
  start-page: 331
  year: 1989
  ident: 10.1016/j.automatica.2009.07.008_b36
  article-title: Convergent activation dynamics in continuous time networks
  publication-title: Neural Networks
  doi: 10.1016/0893-6080(89)90018-X
– volume: 42
  start-page: 1143
  issue: 4
  year: 2003
  ident: 10.1016/j.automatica.2009.07.008_b40
  article-title: On actor–critic algorithms
  publication-title: SIAM Journal on Control and Optimization
  doi: 10.1137/S0363012901385691
– volume: 40
  start-page: 681
  year: 2001
  ident: 10.1016/j.automatica.2009.07.008_b2
  article-title: Learning algorithms for Markov decision processes
  publication-title: SIAM Journal on Control and Optimization
  doi: 10.1137/S0363012999361974
– volume: 44
  start-page: 1073
  year: 1993
  ident: 10.1016/j.automatica.2009.07.008_b63
  article-title: A survey of applications of Markov decision processes
  publication-title: Journal of the Operational Research Society
  doi: 10.1057/jors.1993.181
– year: 1998
  ident: 10.1016/j.automatica.2009.07.008_b57
– volume: 49
  start-page: 592
  issue: 4
  year: 2004
  ident: 10.1016/j.automatica.2009.07.008_b15
  article-title: A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes
  publication-title: IEEE Transactions on Automatic Control
  doi: 10.1109/TAC.2004.825622
– volume: 29
  start-page: 291
  year: 1997
  ident: 10.1016/j.automatica.2009.07.008_b20
  article-title: Stochastic approximation with two timescales
  publication-title: Systems and Control Letters
  doi: 10.1016/S0167-6911(97)90015-3
– volume: 33
  start-page: 235
  year: 1998
  ident: 10.1016/j.automatica.2009.07.008_b28
  article-title: Elevator group control using multiple reinforcement learning agents
  publication-title: Machine Learning
  doi: 10.1023/A:1007518724497
– ident: 10.1016/j.automatica.2009.07.008_b46
– volume: 12
  start-page: 1057
  year: 2000
  ident: 10.1016/j.automatica.2009.07.008_b56
  article-title: Policy gradient methods for reinforcement learning with function approximation
  publication-title: Advances in Neural Information Processing Systems
– ident: 10.1016/j.automatica.2009.07.008_b32
  doi: 10.1145/1273496.1273534
– volume: 35
  start-page: 1799
  year: 1999
  ident: 10.1016/j.automatica.2009.07.008_b62
  article-title: Average cost temporal-difference learning
  publication-title: Automatica
  doi: 10.1016/S0005-1098(99)00099-0
– year: 2007
  ident: 10.1016/j.automatica.2009.07.008_b45
– volume: 19
  start-page: 1169
  year: 2007
  ident: 10.1016/j.automatica.2009.07.008_b51
  article-title: Natural actor-critic for road traffic optimization
  publication-title: Advances in Neural Information Processing Systems
– ident: 10.1016/j.automatica.2009.07.008_b5
  doi: 10.21236/ADA280862
– volume: 42
  start-page: 1382
  year: 1997
  ident: 10.1016/j.automatica.2009.07.008_b27
  article-title: Perturbation realization, potentials and sensitivity analysis of Markov processes
  publication-title: IEEE Transactions on Automatic Control
  doi: 10.1109/9.633827
– volume: 21
  start-page: 441
  year: 2009
  ident: 10.1016/j.automatica.2009.07.008_b29
  article-title: Regularized policy iteration
  publication-title: Advances in Neural Information Processing Systems
– volume: 8
  start-page: 229
  year: 1992
  ident: 10.1016/j.automatica.2009.07.008_b65
  article-title: Simple statistical gradient-following algorithms for connectionist reinforcement learning
  publication-title: Machine Learning
  doi: 10.1007/BF00992696
– ident: 10.1016/j.automatica.2009.07.008_b7
– volume: 16
  start-page: 185
  year: 1994
  ident: 10.1016/j.automatica.2009.07.008_b60
  article-title: Asynchronous stochastic approximation and Q-learning
  publication-title: Machine Learning
  doi: 10.1007/BF00993306
SSID ssj0004182
Score 2.5325997
Snippet We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their...
We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their...
We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their...
SourceID hal
proquest
pascalfrancis
crossref
elsevier
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 2471
SubjectTerms Actor–critic reinforcement learning algorithms
Algorithms
Applied sciences
Approximate dynamic programming
Artificial intelligence
Cognitive science
Computer science
Computer science; control theory; systems
Convergence
Exact sciences and technology
Function approximation
Learning
Natural gradient
Parametrization
Policies
Policy-gradient methods
Reinforcement
Temporal difference learning
Temporal logic
Two-timescale stochastic approximation
Variance
Title Natural actor–critic algorithms
URI https://dx.doi.org/10.1016/j.automatica.2009.07.008
https://www.proquest.com/docview/1671301504
https://inria.hal.science/hal-00840470
Volume 45
WOSCitedRecordID wos000271877200001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1873-2836
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0004182
  issn: 0005-1098
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1fb9MwELe6jgcQQvwVHTB1iLcpU5x4sS2eChoaCCqk9qFvkevYC1ObVGtaVXviO_AN-SScYzvpNJCKEA-NIrf557uc766_-x1Cb2CJAi9By4CA7Q-IVCLgEAgF02TKsA4lVjyrm03Q4ZBNJvxrp7PxtTDrGS0KttnwxX8VNYyBsE3p7F-IuzkpDMA-CB22IHbY7iT4obBUGnUjHY9liGXd0uBYzC5K2MsdR7mnn11VZU3dKmr20Y0FvDcpgne5qApxYaHYo1yA4jRJ5NGqqm6U6B-PThpMTy7WYn4tMlXnbr6UuZjPRea_dhigplrI5x64K8JrEmK-KKZFIFkja8hNbXPpE2XtKqNxAJ5Msm14LY-kVzC8bUaJ7cviluSI2AZFt8y9zTxcGrCPmybHQGp4KVm7xDXAw1Hts8K9hSaDC7HxHtqP6ClnXbQ_-Hg2-dTW1GJmmebdwzgUmMUG_v56f3Jt9nKDsb2_EEt47bTtl3Jr6a_9mfFD9MAFIv2BVaBHqKOKx-jeFj3lE3TkVKlfq9LP7z-sEvVbJXqKxh_Oxu_PA9dSI5AkOa0CcC8zEU1lJoiUGCIr-CiWcHBLhQb3lzEdJ1SCHY_CBNyniIgo4lInMtIsjuNnqFuUhXqO-hJidU0zKmKiCQ2JwLFmU04zgjOutO4h6icjlY5u3nQ9maUeV3iZttNouqHyNDRYCNZDuDlyYSlXdjjmrZ_v1LmO1iVMQVV2OPo1iKi5mGFcPx98Ts2Y6TcRwvOtcQ8d3pBg8_MoMgwtMe-hIy_SFGy0-eNNFKpcLVOcUHAVIfQiB_90ny_Q3fYtfIm61dVKvUJ35Lr6trw6dBr8CxLrt-A
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Natural+actor%E2%80%93critic+algorithms&rft.jtitle=Automatica+%28Oxford%29&rft.au=Bhatnagar%2C+Shalabh&rft.au=Sutton%2C+Richard+S.&rft.au=Ghavamzadeh%2C+Mohammad&rft.au=Lee%2C+Mark&rft.date=2009-11-01&rft.pub=Elsevier+Ltd&rft.issn=0005-1098&rft.eissn=1873-2836&rft.volume=45&rft.issue=11&rft.spage=2471&rft.epage=2482&rft_id=info:doi/10.1016%2Fj.automatica.2009.07.008&rft.externalDocID=S0005109809003549
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0005-1098&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0005-1098&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0005-1098&client=summon