Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distribu...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transaction on neural networks and learning systems Ročník 33; číslo 11; s. 6584 - 6598
Hlavní autoři: Duan, Jingliang, Guan, Yang, Li, Shengbo Eben, Ren, Yangang, Sun, Qi, Cheng, Bo
Médium: Journal Article
Jazyk:angličtina
Vydáno: Piscataway IEEE 01.11.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:2162-237X, 2162-2388, 2162-2388
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations because it is capable of adaptively adjusting the update step size of the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
AbstractList In reinforcement learning (RL), function approximation errors are known to easily lead to the [Formula Omitted]-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor–critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating [Formula Omitted]-value overestimations. We first discover in theory that learning a distribution function of state–action returns can effectively mitigate [Formula Omitted]-value overestimations because it is capable of adaptively adjusting the update step size of the [Formula Omitted]-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor–critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state–action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations because it is capable of adaptively adjusting the update step size of the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
Author Duan, Jingliang
Ren, Yangang
Guan, Yang
Li, Shengbo Eben
Sun, Qi
Cheng, Bo
Author_xml – sequence: 1
  givenname: Jingliang
  orcidid: 0000-0002-3697-1576
  surname: Duan
  fullname: Duan, Jingliang
  email: duanjl15@163.com
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
– sequence: 2
  givenname: Yang
  orcidid: 0000-0003-0689-0510
  surname: Guan
  fullname: Guan, Yang
  email: guany17@mails.tsinghua.edu.cn
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
– sequence: 3
  givenname: Shengbo Eben
  orcidid: 0000-0003-4923-3633
  surname: Li
  fullname: Li, Shengbo Eben
  email: lishbo@tsinghua.edu.cn
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
– sequence: 4
  givenname: Yangang
  orcidid: 0000-0002-1173-7230
  surname: Ren
  fullname: Ren, Yangang
  email: ryg18@mails.tsinghua.edu.cn
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
– sequence: 5
  givenname: Qi
  surname: Sun
  fullname: Sun, Qi
  email: qisun@tsinghua.edu.cn
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
– sequence: 6
  givenname: Bo
  surname: Cheng
  fullname: Cheng, Bo
  email: chengbo@tsinghua.edu.cn
  organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China
BookMark eNp9kU1PJCEQhsnGzeqqf0AvJF689MhX0-BtMs5-JBPdqLvx1qHpwjDpaRTog_9-Gcd48CAXKPI8lVS939HeGEZA6ISSGaVEX9xfX6_uZowwOuNEsVqqL-iAUckqxpXae383D_voOKU1KUeSWgr9De1zQQmttT5A6yufcvTdlH0YzYDvgst4bnOI1SL67O0lvnGu-hMGb1_wLfjRhWhhA2PGKzBx9OMjLl943vcRUtqW_8wwAV6m7Ddm2xYvYwwxHaGvzgwJjt_uQ_T3x_J-8ata3fz8vZivKsuZypXRoqvB2KYTznaikZozri3jVkjpbO-00qpnDureuYZ33FpHpQIpweiGOH6Iznd9n2J4niDlduOThWEwI4QptazmumaN4KSgZx_QdZhi2UOhGqYllwUslNpRNoaUIrjW-vw6WY7GDy0l7TaS9jWSdhtJ-xZJUdkH9SmWrcSXz6XTneQB4F3QQiguCf8PerCZAw
CODEN ITNNAL
CitedBy_id crossref_primary_10_1109_TNNLS_2024_3401170
crossref_primary_10_1007_s00521_024_09913_6
crossref_primary_10_1109_JSYST_2025_3554840
crossref_primary_10_1109_TITS_2021_3136588
crossref_primary_10_1007_s42154_023_00260_1
crossref_primary_10_1007_s00521_023_09029_3
crossref_primary_10_1109_TKDE_2025_3551147
crossref_primary_10_1007_s13042_025_02812_9
crossref_primary_10_1177_09544070231186841
crossref_primary_10_1109_TNNLS_2024_3511670
crossref_primary_10_1016_j_future_2025_108106
crossref_primary_10_1038_s41598_025_00351_5
crossref_primary_10_1109_TMC_2024_3357218
crossref_primary_10_3390_math12081146
crossref_primary_10_1109_LRA_2024_3518839
crossref_primary_10_1109_JIOT_2025_3575158
crossref_primary_10_1109_TIE_2023_3331074
crossref_primary_10_1109_TGRS_2023_3278491
crossref_primary_10_1109_TNNLS_2023_3264815
crossref_primary_10_1109_TCYB_2025_3542223
crossref_primary_10_1109_TVT_2022_3191490
crossref_primary_10_1109_JSAC_2024_3459086
crossref_primary_10_1109_TNNLS_2022_3215596
crossref_primary_10_1109_TIV_2023_3255264
crossref_primary_10_1109_JIOT_2024_3391296
crossref_primary_10_1016_j_cja_2024_03_008
crossref_primary_10_1109_TVT_2025_3551661
crossref_primary_10_1016_j_eswa_2024_125410
crossref_primary_10_1016_j_phycom_2024_102462
crossref_primary_10_1016_j_apenergy_2025_126282
crossref_primary_10_1002_rnc_7734
crossref_primary_10_1016_j_inffus_2025_103226
crossref_primary_10_1109_TITS_2023_3270887
crossref_primary_10_3934_electreng_2025009
crossref_primary_10_1109_TIV_2024_3432891
crossref_primary_10_1002_int_22928
crossref_primary_10_1016_j_neucom_2024_127755
crossref_primary_10_1016_j_commtr_2023_100096
crossref_primary_10_1109_TAES_2022_3216579
crossref_primary_10_3390_drones8090461
crossref_primary_10_1109_TAI_2023_3328848
crossref_primary_10_1109_TPAMI_2025_3537087
crossref_primary_10_1016_j_ins_2024_120465
crossref_primary_10_1109_TSMC_2023_3277737
crossref_primary_10_1109_TAES_2024_3447617
crossref_primary_10_1093_jcde_qwaf045
crossref_primary_10_1109_TPWRS_2021_3130413
crossref_primary_10_1016_j_agwat_2024_109194
crossref_primary_10_1109_JIOT_2024_3481257
crossref_primary_10_1109_TIV_2024_3446823
crossref_primary_10_1049_itr2_12107
crossref_primary_10_1016_j_commtr_2025_100191
crossref_primary_10_1109_TIV_2023_3348134
crossref_primary_10_3390_buildings15040644
crossref_primary_10_1016_j_engappai_2025_110373
crossref_primary_10_1016_j_ifacol_2021_11_201
crossref_primary_10_1109_MCI_2024_3364428
crossref_primary_10_1109_TITS_2023_3307873
crossref_primary_10_1109_TNNLS_2023_3329513
crossref_primary_10_1109_TASE_2025_3587790
crossref_primary_10_1016_j_hcc_2024_100235
crossref_primary_10_1109_TITS_2023_3329823
crossref_primary_10_1109_TITS_2024_3400227
crossref_primary_10_1016_j_asoc_2025_113079
crossref_primary_10_1109_TMECH_2024_3397297
crossref_primary_10_1016_j_est_2024_114377
crossref_primary_10_1109_TSMC_2024_3516377
crossref_primary_10_1109_JIOT_2024_3395568
crossref_primary_10_1016_j_csbj_2025_04_036
crossref_primary_10_1109_ACCESS_2024_3416179
crossref_primary_10_1016_j_apenergy_2025_126030
crossref_primary_10_1109_TNNLS_2024_3457509
crossref_primary_10_1016_j_apor_2025_104778
crossref_primary_10_1109_TITS_2023_3237568
crossref_primary_10_1109_TITS_2023_3341034
crossref_primary_10_1109_TNNLS_2024_3386225
crossref_primary_10_1016_j_engappai_2024_109158
crossref_primary_10_1016_j_neucom_2025_129666
crossref_primary_10_1016_j_neunet_2023_05_027
crossref_primary_10_1145_3643565
crossref_primary_10_1109_TVT_2022_3212996
crossref_primary_10_1109_TNNLS_2022_3175595
crossref_primary_10_1109_JIOT_2023_3245721
crossref_primary_10_1109_TASE_2025_3590068
crossref_primary_10_1007_s10489_025_06693_x
crossref_primary_10_1109_LRA_2024_3427551
crossref_primary_10_1109_TNNLS_2024_3435406
crossref_primary_10_1109_TNNLS_2023_3302131
crossref_primary_10_1109_TCYB_2023_3323316
crossref_primary_10_1109_TASE_2023_3292388
crossref_primary_10_1016_j_ins_2025_122081
crossref_primary_10_1109_TSTE_2024_3485060
crossref_primary_10_1109_TITS_2023_3348489
crossref_primary_10_1109_TMC_2025_3559099
crossref_primary_10_1007_s10462_024_10739_w
crossref_primary_10_12677_SEA_2023_123052
crossref_primary_10_1016_j_neunet_2025_108018
crossref_primary_10_1109_TAC_2023_3275732
crossref_primary_10_7746_jkros_2024_19_1_092
crossref_primary_10_1109_TNNLS_2024_3443082
crossref_primary_10_1109_JIOT_2025_3578198
crossref_primary_10_1007_s13042_024_02399_7
crossref_primary_10_1016_j_engappai_2024_109726
crossref_primary_10_1109_TIV_2022_3185159
crossref_primary_10_1016_j_trc_2024_104654
crossref_primary_10_1016_j_enconman_2022_115450
crossref_primary_10_1016_j_knosys_2025_114152
crossref_primary_10_1109_TNNLS_2024_3395508
Cites_doi 10.1109/TAC.2019.2912443
10.1038/nature14236
10.1609/aaai.v32i1.11791
10.1609/aaai.v30i1.10295
10.1609/aaai.v33i01.33014504
10.1038/nature24270
10.1049/iet-its.2019.0317
10.1109/ADPRL.2013.6614994
10.1038/s41586-019-1924-6
10.1109/IROS.2012.6386109
10.1038/nature16961
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
DBID 97E
RIA
RIE
AAYXX
CITATION
7QF
7QO
7QP
7QQ
7QR
7SC
7SE
7SP
7SR
7TA
7TB
7TK
7U5
8BQ
8FD
F28
FR3
H8D
JG9
JQ2
KR7
L7M
L~C
L~D
P64
7X8
DOI 10.1109/TNNLS.2021.3082568
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) 1998-Present
IEEE Electronic Library (IEL)
CrossRef
Aluminium Industry Abstracts
Biotechnology Research Abstracts
Calcium & Calcified Tissue Abstracts
Ceramic Abstracts
Chemoreception Abstracts
Computer and Information Systems Abstracts
Corrosion Abstracts
Electronics & Communications Abstracts
Engineered Materials Abstracts
Materials Business File
Mechanical & Transportation Engineering Abstracts
Neurosciences Abstracts
Solid State and Superconductivity Abstracts
METADEX
Technology Research Database
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
Aerospace Database
Materials Research Database
ProQuest Computer Science Collection
Civil Engineering Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Biotechnology and BioEngineering Abstracts
MEDLINE - Academic
DatabaseTitle CrossRef
Materials Research Database
Technology Research Database
Computer and Information Systems Abstracts – Academic
Mechanical & Transportation Engineering Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Materials Business File
Aerospace Database
Engineered Materials Abstracts
Biotechnology Research Abstracts
Chemoreception Abstracts
Advanced Technologies Database with Aerospace
ANTE: Abstracts in New Technology & Engineering
Civil Engineering Abstracts
Aluminium Industry Abstracts
Electronics & Communications Abstracts
Ceramic Abstracts
Neurosciences Abstracts
METADEX
Biotechnology and BioEngineering Abstracts
Computer and Information Systems Abstracts Professional
Solid State and Superconductivity Abstracts
Engineering Research Database
Calcium & Calcified Tissue Abstracts
Corrosion Abstracts
MEDLINE - Academic
DatabaseTitleList Materials Research Database

MEDLINE - Academic
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2162-2388
EndPage 6598
ExternalDocumentID 10_1109_TNNLS_2021_3082568
9448360
Genre orig-research
GrantInformation_xml – fundername: NSF China
  grantid: 51575293; U20A20334
  funderid: 10.13039/501100001809
– fundername: NSF Beijing
  grantid: JQ18010
– fundername: Tsinghua University–Toyota Joint Research Center for AI Technology of Automated Vehicle
  funderid: 10.13039/501100004147
GroupedDBID 0R~
4.4
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACIWK
ACPRK
AENEX
AFRAH
AGQYO
AGSQL
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
IFIPE
IPLJI
JAVBF
M43
MS~
O9-
OCL
PQQKQ
RIA
RIE
RNS
AAYXX
CITATION
7QF
7QO
7QP
7QQ
7QR
7SC
7SE
7SP
7SR
7TA
7TB
7TK
7U5
8BQ
8FD
F28
FR3
H8D
JG9
JQ2
KR7
L7M
L~C
L~D
P64
7X8
ID FETCH-LOGICAL-c328t-a94b5eac7b4fcb47693239c23c466fcdf9898d2fe5dff73b3ccf168e66ea970f3
IEDL.DBID RIE
ISICitedReferencesCount 178
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000732356000001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2162-237X
2162-2388
IngestDate Wed Oct 01 13:47:49 EDT 2025
Mon Jun 30 04:48:55 EDT 2025
Sat Nov 29 01:40:13 EST 2025
Tue Nov 18 21:45:11 EST 2025
Wed Aug 27 02:14:45 EDT 2025
IsPeerReviewed false
IsScholarly true
Issue 11
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c328t-a94b5eac7b4fcb47693239c23c466fcdf9898d2fe5dff73b3ccf168e66ea970f3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0002-3697-1576
0000-0002-1173-7230
0000-0003-0689-0510
0000-0003-4923-3633
PMID 34101599
PQID 2729636527
PQPubID 85436
PageCount 15
ParticipantIDs ieee_primary_9448360
proquest_journals_2729636527
proquest_miscellaneous_2539527430
crossref_citationtrail_10_1109_TNNLS_2021_3082568
crossref_primary_10_1109_TNNLS_2021_3082568
PublicationCentury 2000
PublicationDate 2022-11-01
PublicationDateYYYYMMDD 2022-11-01
PublicationDate_xml – month: 11
  year: 2022
  text: 2022-11-01
  day: 01
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE transaction on neural networks and learning systems
PublicationTitleAbbrev TNNLS
PublicationYear 2022
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref37
ref11
Haarnoja (ref32)
ref33
ref10
Nachum (ref29)
ref2
Schulman (ref25) 2017
ref1
Sallans (ref30) 2004; 5
ref19
Dabney (ref20)
Barth-Maron (ref23)
O’Donoghue (ref27)
Silver (ref13)
Lillicrap (ref14)
Fox (ref31)
Heess (ref26) 2017
Mnih (ref3)
Kingma (ref40)
Sutton (ref7) 2018
ref22
Schulman (ref28) 2017
Horgan (ref36) 2018
Rowland (ref21)
Thrun (ref9)
Fujimoto (ref12)
Hendrycks (ref39) 2016
Brockman (ref38) 2016
ref8
Bellemare (ref18)
Schulman (ref24)
ref4
Kingma (ref34) 2013
ref5
Watkins (ref6) 1989
Haarnoja (ref16)
Haarnoja (ref17) 2018
Espeholt (ref35)
van Hasselt (ref15)
References_xml – volume-title: arXiv:1707.06347
  year: 2017
  ident: ref25
  article-title: Proximal policy optimization algorithms
– ident: ref11
  doi: 10.1109/TAC.2019.2912443
– volume-title: Proc. 4th Int. Conf. Learn. Represent. (ICLR)
  ident: ref14
  article-title: Continuous control with deep reinforcement learning
– start-page: 29
  volume-title: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS)
  ident: ref21
  article-title: An analysis of categorical distributional reinforcement learning
– volume: 5
  start-page: 1063
  issue: 8
  year: 2004
  ident: ref30
  article-title: Reinforcement learning with factored states and actions
  publication-title: J. Mach. Learn. Res.
– start-page: 255
  volume-title: Proc. Connectionist Models Summer School
  ident: ref9
  article-title: Issues in using function approximation for reinforcement learning
– volume-title: arXiv:1803.00933
  year: 2018
  ident: ref36
  article-title: Distributed prioritized experience replay
– ident: ref1
  doi: 10.1038/nature14236
– start-page: 387
  volume-title: Proc. 31st Int. Conf. Mach. Learn. (ICML)
  ident: ref13
  article-title: Deterministic policy gradient algorithms
– start-page: 2775
  volume-title: Proc. 30th Adv. Neural Inf. Process. Syst. (NeurIPS)
  ident: ref29
  article-title: Bridging the gap between value and policy based reinforcement learning
– ident: ref19
  doi: 10.1609/aaai.v32i1.11791
– ident: ref8
  doi: 10.1609/aaai.v30i1.10295
– volume-title: arXiv:1606.08415
  year: 2016
  ident: ref39
  article-title: Gaussian error linear units (GELUs)
– ident: ref22
  doi: 10.1609/aaai.v33i01.33014504
– start-page: 1889
  volume-title: Proc. 32nd Int. Conf. Mach. Learn. (ICML)
  ident: ref24
  article-title: Trust region policy optimization
– year: 1989
  ident: ref6
  article-title: Learning from delayed rewards
– volume-title: Reinforcement Learning: An Introduction
  year: 2018
  ident: ref7
– volume-title: arXiv:1812.05905
  year: 2018
  ident: ref17
  article-title: Soft actor-critic algorithms and applications
– volume-title: arXiv:1312.6114
  year: 2013
  ident: ref34
  article-title: Auto-encoding variational bayes
– start-page: 1407
  volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML)
  ident: ref35
  article-title: IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures
– start-page: 1352
  volume-title: Proc. 34th Int. Conf. Mach. Learn. (ICML)
  ident: ref32
  article-title: Reinforcement learning with deep energy-based policies
– start-page: 1587
  volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML)
  ident: ref12
  article-title: Addressing function approximation error in actor-critic methods
– start-page: 2613
  volume-title: Proc. 23rd Adv. Neural Inf. Process. Syst. (NeurIPS)
  ident: ref15
  article-title: Double Q-learning
– ident: ref4
  doi: 10.1038/nature24270
– start-page: 1861
  volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML)
  ident: ref16
  article-title: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
– volume-title: Proc. 6th Int. Conf. Learn. Represent. (ICLR)
  ident: ref23
  article-title: Distributed distributional deterministic policy gradients
– start-page: 1096
  volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML)
  ident: ref20
  article-title: Implicit quantile networks for distributional reinforcement learning
– volume-title: arXiv:1707.02286
  year: 2017
  ident: ref26
  article-title: Emergence of locomotion behaviours in rich environments
– volume-title: Proc. 3rd Int. Conf. Learn. Represent. (ICLR)
  ident: ref40
  article-title: Adam: A method for stochastic optimization
– ident: ref5
  doi: 10.1049/iet-its.2019.0317
– ident: ref10
  doi: 10.1109/ADPRL.2013.6614994
– volume-title: Proc. 4th Int. Conf. Learn. Represent. (ICLR)
  ident: ref27
  article-title: Combining policy gradient and Q-learning
– volume-title: arXiv:1606.01540
  year: 2016
  ident: ref38
  article-title: OpenAI gym
– volume-title: arXiv:1704.06440
  year: 2017
  ident: ref28
  article-title: Equivalence between policy gradients and soft Q-learning
– ident: ref33
  doi: 10.1038/s41586-019-1924-6
– ident: ref37
  doi: 10.1109/IROS.2012.6386109
– start-page: 449
  volume-title: Proc. 34th Int. Conf. Mach. Learn. (ICML)
  ident: ref18
  article-title: A distributional perspective on reinforcement learning
– ident: ref2
  doi: 10.1038/nature16961
– start-page: 1928
  volume-title: Proc. 33rd Int. Conf. Mach. Learn. (ICML)
  ident: ref3
  article-title: Asynchronous methods for deep reinforcement learning
– start-page: 202
  volume-title: Proc. 32nd Conf. Uncertainty Artif. Intell. (UAI)
  ident: ref31
  article-title: Taming the noise in reinforcement learning via soft updates
SSID ssj0000605649
Score 2.714012
Snippet In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q...
In reinforcement learning (RL), function approximation errors are known to easily lead to the [Formula Omitted]-value overestimations, thus greatly reducing...
In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 6584
SubjectTerms Algorithms
Approximation algorithms
Artificial neural networks
Control tasks
Distribution functions
Distributional soft actor–critic (DSAC)
Embedding
Entropy
Errors
Estimation
Iterative methods
Learning
Maximum entropy
overestimation
Probability distribution
Reinforcement
Reinforcement learning
reinforcement learning (RL)
Title Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
URI https://ieeexplore.ieee.org/document/9448360
https://www.proquest.com/docview/2729636527
https://www.proquest.com/docview/2539527430
Volume 33
WOSCitedRecordID wos000732356000001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 2162-2388
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000605649
  issn: 2162-237X
  databaseCode: RIE
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Lb9QwELZK1QMXWiiI7UtG4gZpk7Fjx72t2q04oC2iBe0tcpwxoqp2q-xuf3_HjjccQJW45TGOrIw94xl7vo-xj16WjWqszCSKkLqp8qzRgoyhA0vuo0JT2Eg2oafTajYz37bY56EWBhHj4TM8DZdxL79duHVIlZ0ZiiWEogD9hda6r9Ua8ik5rctVXO1CoSADoWebGpncnN1Op19vKBqE4jTgs5Qq8PSRASdnGFFf_7ikyLHyl2GO3uZq9__6ucdepVUlH_fD4DXbwvkbtrthbOBpAu-zu8uAk5sorqjBDRlhPg55-6znPDjn195nPVgw_44RVtXFDCJPSKy_OD3i47aNB2jp9qe9XyOfkKnoqyD5pOsW3fIt-3E1ub34kiW2hcwJqFaZNbIpyQzrRnrXyMCRCMI4EE4q5V3rA9NkCx7L1nstGuGcL1SFSqE1OvfiHdueL-b4nvGibC2FjYHdrJQeVAMOC3RkSgEsFO2IFZsfXrsERR4YMe7rGJLkpo76qoO-6qSvEfs0tHnogTield4Pahkkk0ZG7Gij1zrN1WUNFF8ooUrQI_ZheE2zLGyd2Dku1iRTCkMCUuQH__7yIXsJoTAiVikese1Vt8ZjtuMeV7-X3QkN2Fl1EgfsE6JA5n4
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Lj9MwEB6tFiS4sMCCKCxgJG6Q3WT8SMytgq4WUQJiC-otchwbgVYtSlt-P2PHDQcQErc8xpGVsWc8Y8_3ATz3QraqNSITjofUTZVnbcnJGFo05D4qpwsTySbKuq6WS_3xAF6OtTDOuXj4zJ2Gy7iX363tLqTKzjTFElxRgH5NCoHFUK01ZlRyWpmruN7FQmGGvFzuq2Ryfbao6_klxYNYnAaEFqkCUx-ZcHKHEff1t1OKLCt_mObob86P_q-nt-FWWley6TAQ7sCBW92Foz1nA0tT-Bi-vwlIuYnkihpckhlm05C5zwbWg1fsg_fZABfMPrkIrGpjDpElLNavjB6xadfFI7R0-8Vc7RybkbEY6iDZrO_X_eYefD6fLV5fZIlvIbMcq21mtGglGeKyFd62IrAkItcWuRVKedv5wDXZoXey877kLbfWF6pySjmjy9zz-3C4Wq_cA2CF7AwFjoHfTAqPqkXrCmfJmCIaLLoJFPsf3tgERh44Ma6aGJTkuon6aoK-mqSvCbwY2_wYoDj-KX0c1DJKJo1M4GSv1ybN1k2DFGEoriSWE3g2vqZ5FjZPzMqtdyQjuSYBwfOHf__yU7hxsXg_b-Zv63eP4CaGMolYs3gCh9t-5x7Ddftz-23TP4nD9hdUUejd
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributional+Soft+Actor-Critic%3A+Off-Policy+Reinforcement+Learning+for+Addressing+Value+Estimation+Errors&rft.jtitle=IEEE+transaction+on+neural+networks+and+learning+systems&rft.au=Duan%2C+Jingliang&rft.au=Yang%2C+Guan&rft.au=Shengbo+Eben+Li&rft.au=Ren%2C+Yangang&rft.date=2022-11-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=2162-237X&rft.eissn=2162-2388&rft.volume=33&rft.issue=11&rft.spage=6584&rft_id=info:doi/10.1109%2FTNNLS.2021.3082568&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2162-237X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2162-237X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2162-237X&client=summon