Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distribu...
Saved in:
| Published in: | IEEE transaction on neural networks and learning systems Vol. 33; no. 11; pp. 6584 - 6598 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Piscataway
IEEE
01.11.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 2162-237X, 2162-2388, 2162-2388 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations because it is capable of adaptively adjusting the update step size of the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance. |
|---|---|
| AbstractList | In reinforcement learning (RL), function approximation errors are known to easily lead to the [Formula Omitted]-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor–critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating [Formula Omitted]-value overestimations. We first discover in theory that learning a distribution function of state–action returns can effectively mitigate [Formula Omitted]-value overestimations because it is capable of adaptively adjusting the update step size of the [Formula Omitted]-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor–critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state–action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance. In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value overestimations because it is capable of adaptively adjusting the update step size of the <inline-formula> <tex-math notation="LaTeX">Q </tex-math></inline-formula>-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance. In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance. |
| Author | Duan, Jingliang Ren, Yangang Guan, Yang Li, Shengbo Eben Sun, Qi Cheng, Bo |
| Author_xml | – sequence: 1 givenname: Jingliang orcidid: 0000-0002-3697-1576 surname: Duan fullname: Duan, Jingliang email: duanjl15@163.com organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China – sequence: 2 givenname: Yang orcidid: 0000-0003-0689-0510 surname: Guan fullname: Guan, Yang email: guany17@mails.tsinghua.edu.cn organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China – sequence: 3 givenname: Shengbo Eben orcidid: 0000-0003-4923-3633 surname: Li fullname: Li, Shengbo Eben email: lishbo@tsinghua.edu.cn organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China – sequence: 4 givenname: Yangang orcidid: 0000-0002-1173-7230 surname: Ren fullname: Ren, Yangang email: ryg18@mails.tsinghua.edu.cn organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China – sequence: 5 givenname: Qi surname: Sun fullname: Sun, Qi email: qisun@tsinghua.edu.cn organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China – sequence: 6 givenname: Bo surname: Cheng fullname: Cheng, Bo email: chengbo@tsinghua.edu.cn organization: State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing, China |
| BookMark | eNp9kU1PJCEQhsnGzeqqf0AvJF689MhX0-BtMs5-JBPdqLvx1qHpwjDpaRTog_9-Gcd48CAXKPI8lVS939HeGEZA6ISSGaVEX9xfX6_uZowwOuNEsVqqL-iAUckqxpXae383D_voOKU1KUeSWgr9De1zQQmttT5A6yufcvTdlH0YzYDvgst4bnOI1SL67O0lvnGu-hMGb1_wLfjRhWhhA2PGKzBx9OMjLl943vcRUtqW_8wwAV6m7Ddm2xYvYwwxHaGvzgwJjt_uQ_T3x_J-8ata3fz8vZivKsuZypXRoqvB2KYTznaikZozri3jVkjpbO-00qpnDureuYZ33FpHpQIpweiGOH6Iznd9n2J4niDlduOThWEwI4QptazmumaN4KSgZx_QdZhi2UOhGqYllwUslNpRNoaUIrjW-vw6WY7GDy0l7TaS9jWSdhtJ-xZJUdkH9SmWrcSXz6XTneQB4F3QQiguCf8PerCZAw |
| CODEN | ITNNAL |
| CitedBy_id | crossref_primary_10_1109_TNNLS_2024_3401170 crossref_primary_10_1007_s00521_024_09913_6 crossref_primary_10_1109_JSYST_2025_3554840 crossref_primary_10_1109_TITS_2021_3136588 crossref_primary_10_1007_s42154_023_00260_1 crossref_primary_10_1007_s00521_023_09029_3 crossref_primary_10_1109_TKDE_2025_3551147 crossref_primary_10_1007_s13042_025_02812_9 crossref_primary_10_1177_09544070231186841 crossref_primary_10_1109_TNNLS_2024_3511670 crossref_primary_10_1016_j_future_2025_108106 crossref_primary_10_1038_s41598_025_00351_5 crossref_primary_10_1109_TMC_2024_3357218 crossref_primary_10_3390_math12081146 crossref_primary_10_1109_LRA_2024_3518839 crossref_primary_10_1109_JIOT_2025_3575158 crossref_primary_10_1109_TIE_2023_3331074 crossref_primary_10_1109_TGRS_2023_3278491 crossref_primary_10_1109_TNNLS_2023_3264815 crossref_primary_10_1109_TCYB_2025_3542223 crossref_primary_10_1109_TVT_2022_3191490 crossref_primary_10_1109_JSAC_2024_3459086 crossref_primary_10_1109_TNNLS_2022_3215596 crossref_primary_10_1109_TIV_2023_3255264 crossref_primary_10_1109_JIOT_2024_3391296 crossref_primary_10_1016_j_cja_2024_03_008 crossref_primary_10_1109_TVT_2025_3551661 crossref_primary_10_1016_j_eswa_2024_125410 crossref_primary_10_1016_j_phycom_2024_102462 crossref_primary_10_1016_j_apenergy_2025_126282 crossref_primary_10_1002_rnc_7734 crossref_primary_10_1016_j_inffus_2025_103226 crossref_primary_10_1109_TITS_2023_3270887 crossref_primary_10_3934_electreng_2025009 crossref_primary_10_1109_TIV_2024_3432891 crossref_primary_10_1002_int_22928 crossref_primary_10_1016_j_neucom_2024_127755 crossref_primary_10_1016_j_commtr_2023_100096 crossref_primary_10_1109_TAES_2022_3216579 crossref_primary_10_3390_drones8090461 crossref_primary_10_1109_TAI_2023_3328848 crossref_primary_10_1109_TPAMI_2025_3537087 crossref_primary_10_1016_j_ins_2024_120465 crossref_primary_10_1109_TSMC_2023_3277737 crossref_primary_10_1109_TAES_2024_3447617 crossref_primary_10_1093_jcde_qwaf045 crossref_primary_10_1109_TPWRS_2021_3130413 crossref_primary_10_1016_j_agwat_2024_109194 crossref_primary_10_1109_JIOT_2024_3481257 crossref_primary_10_1109_TIV_2024_3446823 crossref_primary_10_1049_itr2_12107 crossref_primary_10_1016_j_commtr_2025_100191 crossref_primary_10_1109_TIV_2023_3348134 crossref_primary_10_3390_buildings15040644 crossref_primary_10_1016_j_engappai_2025_110373 crossref_primary_10_1016_j_ifacol_2021_11_201 crossref_primary_10_1109_MCI_2024_3364428 crossref_primary_10_1109_TITS_2023_3307873 crossref_primary_10_1109_TNNLS_2023_3329513 crossref_primary_10_1109_TASE_2025_3587790 crossref_primary_10_1016_j_hcc_2024_100235 crossref_primary_10_1109_TITS_2023_3329823 crossref_primary_10_1109_TITS_2024_3400227 crossref_primary_10_1016_j_asoc_2025_113079 crossref_primary_10_1109_TMECH_2024_3397297 crossref_primary_10_1016_j_est_2024_114377 crossref_primary_10_1109_TSMC_2024_3516377 crossref_primary_10_1109_JIOT_2024_3395568 crossref_primary_10_1016_j_csbj_2025_04_036 crossref_primary_10_1109_ACCESS_2024_3416179 crossref_primary_10_1016_j_apenergy_2025_126030 crossref_primary_10_1109_TNNLS_2024_3457509 crossref_primary_10_1016_j_apor_2025_104778 crossref_primary_10_1109_TITS_2023_3237568 crossref_primary_10_1109_TITS_2023_3341034 crossref_primary_10_1109_TNNLS_2024_3386225 crossref_primary_10_1016_j_engappai_2024_109158 crossref_primary_10_1016_j_neucom_2025_129666 crossref_primary_10_1016_j_neunet_2023_05_027 crossref_primary_10_1145_3643565 crossref_primary_10_1109_TVT_2022_3212996 crossref_primary_10_1109_TNNLS_2022_3175595 crossref_primary_10_1109_JIOT_2023_3245721 crossref_primary_10_1109_TASE_2025_3590068 crossref_primary_10_1007_s10489_025_06693_x crossref_primary_10_1109_LRA_2024_3427551 crossref_primary_10_1109_TNNLS_2024_3435406 crossref_primary_10_1109_TNNLS_2023_3302131 crossref_primary_10_1109_TCYB_2023_3323316 crossref_primary_10_1109_TASE_2023_3292388 crossref_primary_10_1016_j_ins_2025_122081 crossref_primary_10_1109_TSTE_2024_3485060 crossref_primary_10_1109_TITS_2023_3348489 crossref_primary_10_1109_TMC_2025_3559099 crossref_primary_10_1007_s10462_024_10739_w crossref_primary_10_12677_SEA_2023_123052 crossref_primary_10_1016_j_neunet_2025_108018 crossref_primary_10_1109_TAC_2023_3275732 crossref_primary_10_7746_jkros_2024_19_1_092 crossref_primary_10_1109_TNNLS_2024_3443082 crossref_primary_10_1109_JIOT_2025_3578198 crossref_primary_10_1007_s13042_024_02399_7 crossref_primary_10_1016_j_engappai_2024_109726 crossref_primary_10_1109_TIV_2022_3185159 crossref_primary_10_1016_j_trc_2024_104654 crossref_primary_10_1016_j_enconman_2022_115450 crossref_primary_10_1016_j_knosys_2025_114152 crossref_primary_10_1109_TNNLS_2024_3395508 |
| Cites_doi | 10.1109/TAC.2019.2912443 10.1038/nature14236 10.1609/aaai.v32i1.11791 10.1609/aaai.v30i1.10295 10.1609/aaai.v33i01.33014504 10.1038/nature24270 10.1049/iet-its.2019.0317 10.1109/ADPRL.2013.6614994 10.1038/s41586-019-1924-6 10.1109/IROS.2012.6386109 10.1038/nature16961 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| DBID | 97E RIA RIE AAYXX CITATION 7QF 7QO 7QP 7QQ 7QR 7SC 7SE 7SP 7SR 7TA 7TB 7TK 7U5 8BQ 8FD F28 FR3 H8D JG9 JQ2 KR7 L7M L~C L~D P64 7X8 |
| DOI | 10.1109/TNNLS.2021.3082568 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998-Present IEEE Electronic Library (IEL) CrossRef Aluminium Industry Abstracts Biotechnology Research Abstracts Calcium & Calcified Tissue Abstracts Ceramic Abstracts Chemoreception Abstracts Computer and Information Systems Abstracts Corrosion Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts Materials Business File Mechanical & Transportation Engineering Abstracts Neurosciences Abstracts Solid State and Superconductivity Abstracts METADEX Technology Research Database ANTE: Abstracts in New Technology & Engineering Engineering Research Database Aerospace Database Materials Research Database ProQuest Computer Science Collection Civil Engineering Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Biotechnology and BioEngineering Abstracts MEDLINE - Academic |
| DatabaseTitle | CrossRef Materials Research Database Technology Research Database Computer and Information Systems Abstracts – Academic Mechanical & Transportation Engineering Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Materials Business File Aerospace Database Engineered Materials Abstracts Biotechnology Research Abstracts Chemoreception Abstracts Advanced Technologies Database with Aerospace ANTE: Abstracts in New Technology & Engineering Civil Engineering Abstracts Aluminium Industry Abstracts Electronics & Communications Abstracts Ceramic Abstracts Neurosciences Abstracts METADEX Biotechnology and BioEngineering Abstracts Computer and Information Systems Abstracts Professional Solid State and Superconductivity Abstracts Engineering Research Database Calcium & Calcified Tissue Abstracts Corrosion Abstracts MEDLINE - Academic |
| DatabaseTitleList | Materials Research Database MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2162-2388 |
| EndPage | 6598 |
| ExternalDocumentID | 10_1109_TNNLS_2021_3082568 9448360 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: NSF China grantid: 51575293; U20A20334 funderid: 10.13039/501100001809 – fundername: NSF Beijing grantid: JQ18010 – fundername: Tsinghua University–Toyota Joint Research Center for AI Technology of Automated Vehicle funderid: 10.13039/501100004147 |
| GroupedDBID | 0R~ 4.4 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACIWK ACPRK AENEX AFRAH AGQYO AGSQL AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD IFIPE IPLJI JAVBF M43 MS~ O9- OCL PQQKQ RIA RIE RNS AAYXX CITATION 7QF 7QO 7QP 7QQ 7QR 7SC 7SE 7SP 7SR 7TA 7TB 7TK 7U5 8BQ 8FD F28 FR3 H8D JG9 JQ2 KR7 L7M L~C L~D P64 7X8 |
| ID | FETCH-LOGICAL-c328t-a94b5eac7b4fcb47693239c23c466fcdf9898d2fe5dff73b3ccf168e66ea970f3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 178 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000732356000001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2162-237X 2162-2388 |
| IngestDate | Wed Oct 01 13:47:49 EDT 2025 Mon Jun 30 04:48:55 EDT 2025 Sat Nov 29 01:40:13 EST 2025 Tue Nov 18 21:45:11 EST 2025 Wed Aug 27 02:14:45 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Issue | 11 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c328t-a94b5eac7b4fcb47693239c23c466fcdf9898d2fe5dff73b3ccf168e66ea970f3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ORCID | 0000-0002-3697-1576 0000-0002-1173-7230 0000-0003-0689-0510 0000-0003-4923-3633 |
| PMID | 34101599 |
| PQID | 2729636527 |
| PQPubID | 85436 |
| PageCount | 15 |
| ParticipantIDs | ieee_primary_9448360 proquest_journals_2729636527 proquest_miscellaneous_2539527430 crossref_citationtrail_10_1109_TNNLS_2021_3082568 crossref_primary_10_1109_TNNLS_2021_3082568 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-11-01 |
| PublicationDateYYYYMMDD | 2022-11-01 |
| PublicationDate_xml | – month: 11 year: 2022 text: 2022-11-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Piscataway |
| PublicationPlace_xml | – name: Piscataway |
| PublicationTitle | IEEE transaction on neural networks and learning systems |
| PublicationTitleAbbrev | TNNLS |
| PublicationYear | 2022 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref37 ref11 Haarnoja (ref32) ref33 ref10 Nachum (ref29) ref2 Schulman (ref25) 2017 ref1 Sallans (ref30) 2004; 5 ref19 Dabney (ref20) Barth-Maron (ref23) O’Donoghue (ref27) Silver (ref13) Lillicrap (ref14) Fox (ref31) Heess (ref26) 2017 Mnih (ref3) Kingma (ref40) Sutton (ref7) 2018 ref22 Schulman (ref28) 2017 Horgan (ref36) 2018 Rowland (ref21) Thrun (ref9) Fujimoto (ref12) Hendrycks (ref39) 2016 Brockman (ref38) 2016 ref8 Bellemare (ref18) Schulman (ref24) ref4 Kingma (ref34) 2013 ref5 Watkins (ref6) 1989 Haarnoja (ref16) Haarnoja (ref17) 2018 Espeholt (ref35) van Hasselt (ref15) |
| References_xml | – volume-title: arXiv:1707.06347 year: 2017 ident: ref25 article-title: Proximal policy optimization algorithms – ident: ref11 doi: 10.1109/TAC.2019.2912443 – volume-title: Proc. 4th Int. Conf. Learn. Represent. (ICLR) ident: ref14 article-title: Continuous control with deep reinforcement learning – start-page: 29 volume-title: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) ident: ref21 article-title: An analysis of categorical distributional reinforcement learning – volume: 5 start-page: 1063 issue: 8 year: 2004 ident: ref30 article-title: Reinforcement learning with factored states and actions publication-title: J. Mach. Learn. Res. – start-page: 255 volume-title: Proc. Connectionist Models Summer School ident: ref9 article-title: Issues in using function approximation for reinforcement learning – volume-title: arXiv:1803.00933 year: 2018 ident: ref36 article-title: Distributed prioritized experience replay – ident: ref1 doi: 10.1038/nature14236 – start-page: 387 volume-title: Proc. 31st Int. Conf. Mach. Learn. (ICML) ident: ref13 article-title: Deterministic policy gradient algorithms – start-page: 2775 volume-title: Proc. 30th Adv. Neural Inf. Process. Syst. (NeurIPS) ident: ref29 article-title: Bridging the gap between value and policy based reinforcement learning – ident: ref19 doi: 10.1609/aaai.v32i1.11791 – ident: ref8 doi: 10.1609/aaai.v30i1.10295 – volume-title: arXiv:1606.08415 year: 2016 ident: ref39 article-title: Gaussian error linear units (GELUs) – ident: ref22 doi: 10.1609/aaai.v33i01.33014504 – start-page: 1889 volume-title: Proc. 32nd Int. Conf. Mach. Learn. (ICML) ident: ref24 article-title: Trust region policy optimization – year: 1989 ident: ref6 article-title: Learning from delayed rewards – volume-title: Reinforcement Learning: An Introduction year: 2018 ident: ref7 – volume-title: arXiv:1812.05905 year: 2018 ident: ref17 article-title: Soft actor-critic algorithms and applications – volume-title: arXiv:1312.6114 year: 2013 ident: ref34 article-title: Auto-encoding variational bayes – start-page: 1407 volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML) ident: ref35 article-title: IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures – start-page: 1352 volume-title: Proc. 34th Int. Conf. Mach. Learn. (ICML) ident: ref32 article-title: Reinforcement learning with deep energy-based policies – start-page: 1587 volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML) ident: ref12 article-title: Addressing function approximation error in actor-critic methods – start-page: 2613 volume-title: Proc. 23rd Adv. Neural Inf. Process. Syst. (NeurIPS) ident: ref15 article-title: Double Q-learning – ident: ref4 doi: 10.1038/nature24270 – start-page: 1861 volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML) ident: ref16 article-title: Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor – volume-title: Proc. 6th Int. Conf. Learn. Represent. (ICLR) ident: ref23 article-title: Distributed distributional deterministic policy gradients – start-page: 1096 volume-title: Proc. 35th Int. Conf. Mach. Learn. (ICML) ident: ref20 article-title: Implicit quantile networks for distributional reinforcement learning – volume-title: arXiv:1707.02286 year: 2017 ident: ref26 article-title: Emergence of locomotion behaviours in rich environments – volume-title: Proc. 3rd Int. Conf. Learn. Represent. (ICLR) ident: ref40 article-title: Adam: A method for stochastic optimization – ident: ref5 doi: 10.1049/iet-its.2019.0317 – ident: ref10 doi: 10.1109/ADPRL.2013.6614994 – volume-title: Proc. 4th Int. Conf. Learn. Represent. (ICLR) ident: ref27 article-title: Combining policy gradient and Q-learning – volume-title: arXiv:1606.01540 year: 2016 ident: ref38 article-title: OpenAI gym – volume-title: arXiv:1704.06440 year: 2017 ident: ref28 article-title: Equivalence between policy gradients and soft Q-learning – ident: ref33 doi: 10.1038/s41586-019-1924-6 – ident: ref37 doi: 10.1109/IROS.2012.6386109 – start-page: 449 volume-title: Proc. 34th Int. Conf. Mach. Learn. (ICML) ident: ref18 article-title: A distributional perspective on reinforcement learning – ident: ref2 doi: 10.1038/nature16961 – start-page: 1928 volume-title: Proc. 33rd Int. Conf. Mach. Learn. (ICML) ident: ref3 article-title: Asynchronous methods for deep reinforcement learning – start-page: 202 volume-title: Proc. 32nd Conf. Uncertainty Artif. Intell. (UAI) ident: ref31 article-title: Taming the noise in reinforcement learning via soft updates |
| SSID | ssj0000605649 |
| Score | 2.714012 |
| Snippet | In reinforcement learning (RL), function approximation errors are known to easily lead to the <inline-formula> <tex-math notation="LaTeX">Q... In reinforcement learning (RL), function approximation errors are known to easily lead to the [Formula Omitted]-value overestimations, thus greatly reducing... In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 6584 |
| SubjectTerms | Algorithms Approximation algorithms Artificial neural networks Control tasks Distribution functions Distributional soft actor–critic (DSAC) Embedding Entropy Errors Estimation Iterative methods Learning Maximum entropy overestimation Probability distribution Reinforcement Reinforcement learning reinforcement learning (RL) |
| Title | Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors |
| URI | https://ieeexplore.ieee.org/document/9448360 https://www.proquest.com/docview/2729636527 https://www.proquest.com/docview/2539527430 |
| Volume | 33 |
| WOSCitedRecordID | wos000732356000001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 2162-2388 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000605649 issn: 2162-237X databaseCode: RIE dateStart: 20120101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fT9swED6xigde1o0OUSjISHsbhsRO7HpvFRTtYSrTClPfosQ_pk1VOqUtfz9nx80ehpD2lsR2FOXO57uz7_sAPqpMVYYnhirhMopKUVIlc01tpnglEyd0UgayCTmbjRcL9W0PLrtaGGttOHxmr_xl2Ms3K731qbJrhbEEFxigv5FStrVaXT4lQb9cBG-XpYJRxuViVyOTqOuH2ezrHKNBll55fJZceJ4-NOC4GAbU179LUuBY-ccwh9Xmrv9_3_kO3kavkkxaNXgPe7Y-hP6OsYHECTyA37ceJzdSXOGAORphMvF5e9pyHnwm987RFiyYfLcBVlWHDCKJSKw_CT4iE2PCAVq8_VEut5ZM0VS0VZBk2jSrZv0BHu-mDzdfaGRboJqz8YaWKqtyNMOyypyuMs-RyLjSjOtMCKeN80yThjmbG-ckr7jWLhVjK4QtFUqVH0GvXtX2GEglsUXmptLoAWiUSsoEvhW9PW6llWoI6e6HFzpCkXtGjGURQpJEFUFehZdXEeU1hE_dmD8tEMervQdeLF3PKJEhjHZyLeJcXRcM4wvBRc7kEC66ZpxlfuukrO1qi31yrrBDxpOTl998CgfMF0aEKsUR9DbN1p7Bvn7a_Fo356iwi_F5UNhn4MnmSQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fb9MwED5NGxK8sMFAK2zDSLyBt8R27Jq3aus0RAmIFdS3KPGPCVS1KG3393N23PAAmsRbEp-jKGef786-7wN4o4VuLM8s1dILioOiploVhjqheaMyL01WR7IJVZbD2Ux_2YF3fS2Mcy4ePnNn4TLu5dul2YRU2bnGWIJLDND3CiFY3lVr9RmVDD1zGf1dlktGGVezbZVMps-nZTm5wXiQ5WcBoaWQgakPTTguhxH39c-iFFlW_jLNcb252v-_Lz2Ax8mvJKNuIDyBHbd4CvtbzgaSpvAh_LwMSLmJ5Ao73KAZJqOQuacd68F78tl72sEFk68uAquamEMkCYv1luAjMrI2HqHF2-_1fOPIGI1FVwdJxm27bFfP4NvVeHpxTRPfAjWcDde01qIp0BCrRnjTiMCSyLg2jBshpTfWB65Jy7wrrPeKN9wYn8uhk9LVGvXKn8PuYrlwR0AahS2qsI1BH8CgVnIm8a3o73GnnNIDyLc_vDIJjDxwYsyrGJRkuor6qoK-qqSvAbzt-_zqoDjulT4Mauklk0YGcLzVa5Vm66piGGFILgumBvC6b8Z5FjZP6oVbblCm4BoFBM9e_PvNr-Dh9fTTpJp8KD--hEcslEnEmsVj2F23G3cCD8zd-seqPY3D9jeSDuio |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Distributional+Soft+Actor-Critic%3A+Off-Policy+Reinforcement+Learning+for+Addressing+Value+Estimation+Errors&rft.jtitle=IEEE+transaction+on+neural+networks+and+learning+systems&rft.au=Duan%2C+Jingliang&rft.au=Guan%2C+Yang&rft.au=Li%2C+Shengbo+Eben&rft.au=Ren%2C+Yangang&rft.date=2022-11-01&rft.pub=IEEE&rft.issn=2162-237X&rft.volume=33&rft.issue=11&rft.spage=6584&rft.epage=6598&rft_id=info:doi/10.1109%2FTNNLS.2021.3082568&rft_id=info%3Apmid%2F34101599&rft.externalDocID=9448360 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2162-237X&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2162-237X&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2162-237X&client=summon |