From Show to Tell: A Survey on Deep Learning-Based Image Captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed...
Uložené v:
| Vydané v: | IEEE transactions on pattern analysis and machine intelligence Ročník 45; číslo 1; s. 539 - 559 |
|---|---|
| Hlavní autori: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
IEEE
01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Predmet: | |
| ISSN: | 0162-8828, 1939-3539, 2160-9292, 1939-3539 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy. |
|---|---|
| AbstractList | Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy. Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy. |
| Author | Cornia, Marcella Cascianelli, Silvia Cucchiara, Rita Fiameni, Giuseppe Baraldi, Lorenzo Stefanini, Matteo |
| Author_xml | – sequence: 1 givenname: Matteo surname: Stefanini fullname: Stefanini, Matteo email: matteo.stefanini@unimore.it organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy – sequence: 2 givenname: Marcella orcidid: 0000-0001-9640-9385 surname: Cornia fullname: Cornia, Marcella email: marcella.cornia@unimore.it organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy – sequence: 3 givenname: Lorenzo orcidid: 0000-0001-5125-4957 surname: Baraldi fullname: Baraldi, Lorenzo email: lorenzo.baraldi@unimore.it organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy – sequence: 4 givenname: Silvia orcidid: 0000-0001-7885-6050 surname: Cascianelli fullname: Cascianelli, Silvia email: silvia.cascianelli@unimore.it organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy – sequence: 5 givenname: Giuseppe orcidid: 0000-0001-8687-6609 surname: Fiameni fullname: Fiameni, Giuseppe email: gfiameni@nvidia.com organization: NVIDIA AI Technology Centre, Milan, Italy – sequence: 6 givenname: Rita surname: Cucchiara fullname: Cucchiara, Rita email: rita.cucchiara@unimore.it organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/35130142$$D View this record in MEDLINE/PubMed |
| BookMark | eNp9kcFu2zAMhoUixZJ2e4EOGAT00oszibJlebc0bdcAGTag2VlQZCpzEVupZG_o29de0h5y6IEgSHw_SfA_I6PGN0jIBWdTzlnxdfVr9mMxBQYwFTxVwNkJmQCXLCmggBGZMC4hUQrUmJzF-MgYTzMmPpCxyLjoC5iQ-V3wNX344__R1tMVbrff6Iw-dOEvPlPf0BvEHV2iCU3VbJJrE7Gki9pskM7Nrq380P5ITp3ZRvx0yOfk993tan6fLH9-X8xny8SmkrcJZNatpbGqlOCwVJYL54Sw4HInZB-ZdIURgtlClmgUB8Zya0ol7VqleSbOydV-7i74pw5jq-sq2v5k06DvogYJUhUcJO_RyyP00Xeh6a_TkKe55CyFYeCXA9Wtayz1LlS1Cc_69T09AHvABh9jQPeGcKYHD_R_D_TggT540IvUkchWrRl-1QZTbd-Xft5LK0R821XkTIpUiRe45JDn |
| CODEN | ITPIDJ |
| CitedBy_id | crossref_primary_10_15625_1813_9663_20929 crossref_primary_10_1007_s11263_024_02220_6 crossref_primary_10_1016_j_cviu_2023_103857 crossref_primary_10_1016_j_eswa_2023_121391 crossref_primary_10_3390_app131910894 crossref_primary_10_3390_app131911103 crossref_primary_10_1007_s10489_025_06384_7 crossref_primary_10_1007_s00530_024_01645_w crossref_primary_10_1016_j_asoc_2025_113310 crossref_primary_10_1109_TCSVT_2024_3444782 crossref_primary_10_1016_j_cosrev_2025_100766 crossref_primary_10_1007_s11042_023_16560_x crossref_primary_10_1007_s00530_025_01667_y crossref_primary_10_1016_j_compbiomed_2024_108709 crossref_primary_10_1016_j_nlp_2025_100159 crossref_primary_10_1016_j_displa_2024_102653 crossref_primary_10_1016_j_compeleceng_2025_110077 crossref_primary_10_3390_s23031286 crossref_primary_10_1109_ACCESS_2023_3317276 crossref_primary_10_1016_j_patcog_2024_110941 crossref_primary_10_1016_j_prime_2023_100372 crossref_primary_10_1109_ACCESS_2025_3564873 crossref_primary_10_1016_j_jhydrol_2024_131783 crossref_primary_10_1007_s00530_025_01801_w crossref_primary_10_3390_app15020523 crossref_primary_10_1016_j_sasc_2025_200231 crossref_primary_10_3390_biomedicines11082225 crossref_primary_10_3390_jimaging10010026 crossref_primary_10_1007_s11042_024_18467_7 crossref_primary_10_1016_j_eswa_2023_121168 crossref_primary_10_1016_j_neucom_2025_131534 crossref_primary_10_1109_ACCESS_2025_3588344 crossref_primary_10_1038_s41746_024_01101_z crossref_primary_10_1109_JSTARS_2024_3471625 crossref_primary_10_1016_j_knosys_2023_110515 crossref_primary_10_1007_s11263_023_01949_w crossref_primary_10_1049_ipr2_13287 crossref_primary_10_1007_s11633_023_1453_5 crossref_primary_10_1016_j_knosys_2025_114400 crossref_primary_10_1007_s00521_023_08957_4 crossref_primary_10_1007_s11277_024_11037_y crossref_primary_10_1016_j_displa_2025_103126 crossref_primary_10_1016_j_inffus_2023_102204 crossref_primary_10_1109_TMI_2023_3335909 crossref_primary_10_1016_j_ress_2024_110347 crossref_primary_10_1016_j_knosys_2023_111056 crossref_primary_10_1145_3663667 crossref_primary_10_1007_s00521_025_11199_1 crossref_primary_10_1016_j_compmedimag_2024_102486 crossref_primary_10_1109_TPAMI_2025_3581174 crossref_primary_10_1109_TGRS_2022_3218921 crossref_primary_10_1016_j_cviu_2024_104210 crossref_primary_10_1109_TPAMI_2025_3525671 crossref_primary_10_1109_TIP_2025_3568746 crossref_primary_10_1038_s41598_024_69664_1 crossref_primary_10_1109_TGRS_2024_3510833 crossref_primary_10_1016_j_engappai_2024_109134 crossref_primary_10_1016_j_eswa_2023_120698 crossref_primary_10_3390_app12157724 crossref_primary_10_1371_journal_pone_0320701 crossref_primary_10_3390_electronics13193854 crossref_primary_10_1155_2022_2756396 crossref_primary_10_1109_TIFS_2024_3361151 crossref_primary_10_1016_j_compbiomed_2024_109275 crossref_primary_10_1007_s11263_025_02535_y crossref_primary_10_1007_s11042_024_18307_8 crossref_primary_10_1109_ACCESS_2022_3232564 crossref_primary_10_1109_TPAMI_2024_3357631 crossref_primary_10_1109_ACCESS_2025_3536095 crossref_primary_10_1007_s00530_023_01249_w crossref_primary_10_3390_app15105608 crossref_primary_10_1109_TMM_2024_3384678 crossref_primary_10_1007_s00371_025_04072_8 crossref_primary_10_1016_j_neucom_2024_127651 crossref_primary_10_1007_s00371_024_03469_1 crossref_primary_10_1016_j_engappai_2024_109288 crossref_primary_10_32604_cmes_2025_059192 crossref_primary_10_3390_a16020097 crossref_primary_10_1007_s11042_024_18748_1 crossref_primary_10_1109_TIP_2023_3328224 crossref_primary_10_1016_j_compbiomed_2024_108073 crossref_primary_10_1145_3711680 crossref_primary_10_3390_electronics14091716 crossref_primary_10_4018_JOEUC_347914 crossref_primary_10_1007_s12559_023_10231_7 crossref_primary_10_1109_ACCESS_2025_3596720 crossref_primary_10_1061_JCCEE5_CPENG_5744 crossref_primary_10_1016_j_cmpb_2025_108677 crossref_primary_10_1038_s44287_025_00170_w crossref_primary_10_1080_0952813X_2025_2481044 crossref_primary_10_1109_TII_2025_3540483 crossref_primary_10_3390_app14062657 crossref_primary_10_1109_TIP_2025_3539468 crossref_primary_10_1007_s13735_025_00375_7 crossref_primary_10_1109_ACCESS_2023_3282444 crossref_primary_10_1007_s10489_024_05437_7 crossref_primary_10_1093_llc_fqae029 crossref_primary_10_1016_j_inffus_2025_103269 crossref_primary_10_1109_TGRS_2025_3585119 crossref_primary_10_1109_JSTARS_2025_3580686 crossref_primary_10_1109_TGRS_2024_3401576 crossref_primary_10_1007_s10489_023_05167_2 crossref_primary_10_1109_TSE_2024_3463747 crossref_primary_10_1007_s13735_025_00368_6 crossref_primary_10_1007_s42979_025_03719_6 crossref_primary_10_1007_s42979_025_03908_3 crossref_primary_10_1016_j_eswa_2025_126951 crossref_primary_10_1007_s11042_024_19927_w crossref_primary_10_1007_s10489_023_05198_9 crossref_primary_10_1186_s12911_024_02445_y crossref_primary_10_1109_ACCESS_2023_3310257 crossref_primary_10_1109_TMM_2024_3407695 crossref_primary_10_1016_j_imavis_2024_104946 crossref_primary_10_1080_21642583_2025_2467083 crossref_primary_10_1080_01691864_2024_2388114 crossref_primary_10_1109_ACCESS_2024_3401174 crossref_primary_10_1007_s11760_025_04066_y crossref_primary_10_3390_app15105538 crossref_primary_10_1007_s11633_024_1535_z crossref_primary_10_1109_ACCESS_2023_3302512 crossref_primary_10_1148_ryai_240625 crossref_primary_10_3390_app15094909 |
| Cites_doi | 10.1109/ICCV.2019.00184 10.24963/ijcai.2020/107 10.1007/978-3-030-01246-5_31 10.1145/3123266.3123275 10.1145/3394171.3413753 10.1109/ICVGIP.2008.47 10.18653/v1/D19-1220 10.1109/ICCV.2019.00271 10.1109/TMM.2019.2896494 10.1109/ICCV.2019.01042 10.1109/CVPR.2019.00425 10.1155/2015/565871 10.1145/3343031.3350943 10.1109/CVPR.2017.780 10.1109/CVPR.2015.7298594 10.1109/CVPR.2019.00643 10.1109/CVPR46437.2021.00356 10.18653/v1/2021.acl-long.387 10.18653/v1/D19-1208 10.1109/CVPR46437.2021.01249 10.1145/3295748 10.1109/ICCV.2017.138 10.1109/CVPR.2016.29 10.1109/ICCV.2017.364 10.1109/CVPR.2019.01278 10.1162/tacl_a_00166 10.1109/CVPR42600.2020.01028 10.1145/3343031.3350996 10.18653/v1/D17-1098 10.1109/CVPR.2019.00856 10.1109/CVPR46437.2021.01354 10.1109/CVPR.2017.356 10.1109/CVPR46437.2021.01245 10.1007/978-3-030-01216-8_17 10.1109/TMM.2021.3060948 10.1145/3177745 10.1613/jair.4900 10.1162/neco.1997.9.8.1735 10.1109/CVPR.2018.00754 10.1109/CVPR42600.2020.01034 10.1109/CVPR.2017.334 10.1109/CVPR.2019.01095 10.1007/978-3-030-58520-4_25 10.1109/CVPR46437.2021.01521 10.1007/978-3-030-58568-6_34 10.18653/v1/P18-1240 10.1109/ICCV48922.2021.00537 10.1109/CVPR.2017.559 10.1007/BF00992696 10.1609/aaai.v33i01.33018650 10.1145/2812802 10.1109/TPAMI.2018.2824816 10.1109/ICCV.2017.272 10.1109/TPAMI.2019.2909864 10.18653/v1/P16-1168 10.18653/v1/2020.acl-main.93 10.1109/ICCV.2017.524 10.1109/CVPR46437.2021.01101 10.1109/CVPR.2015.7298932 10.1609/aaai.v34i07.6898 10.1109/ICCV.2019.00434 10.1109/ICRA40945.2020.9196653 10.1109/CVPR.2019.00850 10.1109/ICCV.2019.00904 10.1007/978-3-319-46493-0_1 10.1109/CVPR.2017.345 10.1109/CVPR.2019.00640 10.1109/CVPR.2018.00583 10.1109/CVPR.2019.01280 10.1109/CVPR.2015.7298878 10.1109/CVPR.2017.214 10.1109/ICCV.2019.00902 10.1007/978-3-030-58571-6_37 10.1007/978-3-030-01225-0_13 10.1109/CVPR.2015.7298754 10.1109/CVPR.2018.00636 10.1145/3394171.3413901 10.1109/CVPR46437.2021.01383 10.1109/CVPR.2017.108 10.1109/CVPR.2016.90 10.1109/ICCV.2019.00751 10.1109/PARC49193.2020.236619 10.1109/CVPR.2015.7298935 10.1609/aaai.v34i05.6503 10.18653/v1/2020.eval4nlp-1.4 10.1109/ICCV.2019.00436 10.1109/TPAMI.2012.118 10.1109/CVPR.2017.128 10.1109/CVPR.2016.494 10.1007/978-3-030-01228-1_18 10.1109/ICCV48922.2021.00210 10.1109/TPAMI.2016.2577031 10.1145/3404835.3463257 10.1007/978-3-030-58577-8_7 10.1109/CVPR46437.2021.00136 10.1007/s11263-016-0981-7 10.1609/aaai.v32i1.12340 10.1007/978-3-030-58536-5_44 10.1007/978-3-030-49724-8_2 10.1007/978-3-030-58577-8_8 10.1109/CVPR.2017.130 10.1109/CVPR.2019.01094 10.1109/CVPR.2019.00859 10.1609/aaai.v35i2.16258 10.18653/v1/2021.emnlp-main.595 10.1007/978-3-030-01216-8_45 10.1007/978-3-030-58558-7_38 10.1109/ICCV.2017.64 10.1007/978-3-030-58601-0_42 10.1109/JPROC.2010.2050411 10.1007/978-3-030-01216-8_31 10.1109/ICCV.2017.445 10.1109/CVPR.2016.13 10.1109/CVPR.2018.00896 10.1109/TMM.2021.3074803 10.1109/CVPR.2017.131 10.1609/aaai.v35i3.16328 10.1109/CVPR42600.2020.01305 10.1109/CVPR.2015.7299087 10.1109/CVPR42600.2020.01059 10.1109/ICCV.2019.00473 10.1109/ICCV.2017.140 10.1016/j.neucom.2018.05.080 10.1609/aaai.v35i4.16476 10.1145/3343031.3350961 10.24963/ijcai.2020/128 10.1109/CVPR.2019.00433 10.1007/978-3-030-01252-6_5 10.1613/jair.3994 10.18653/v1/2020.acl-main.583 10.1109/TPAMI.2020.3013834 10.1109/CVPR.2017.667 10.1109/ICCV.2019.00898 10.1109/CVPR.2018.00146 10.1007/978-3-030-58601-0_1 10.18653/v1/2021.emnlp-main.419 10.1145/3240508.3240640 10.1109/CVPR46437.2021.00864 10.1145/2998181.2998364 10.18653/v1/2021.emnlp-main.542 10.1145/3123266.3123366 10.18653/v1/D18-1437 10.1109/CVPR42600.2020.00998 10.1007/978-3-642-15561-1_2 10.1109/CVPR.2018.00608 10.18653/v1/2020.acl-main.664 10.18653/v1/D18-1436 10.1109/ICCV.2015.277 10.1109/ICCV.2017.100 10.1007/978-3-319-46454-1_24 10.1109/TPAMI.2012.162 10.18653/v1/P16-1162 10.1109/CVPR.2019.00646 10.1109/CVPR42600.2020.01098 10.18653/v1/D19-1156 10.1109/CVPR.2017.127 10.18653/v1/W16-3210 10.1609/aaai.v32i1.12266 10.1109/ICCV.2019.00472 10.1109/CVPR.2019.00432 10.18653/v1/D19-1514 10.1007/978-3-030-01237-3_3 10.1109/ICCV.2017.323 10.1109/CVPR46437.2021.00553 10.18653/v1/P18-1238 10.1162/tacl_a_00188 10.1109/CVPR42600.2020.00486 10.1109/CVPR.2016.8 10.1109/CVPR.2018.00834 10.1017/CBO9780511815829 10.1609/aaai.v34i07.6998 10.1109/CVPR46437.2021.01657 10.18653/v1/E17-1019 10.18653/v1/W16-3203 10.18653/v1/2020.coling-main.210 10.1109/ICCV.2019.00435 10.1109/CVPR.2019.01275 10.18653/v1/2021.acl-short.29 10.1109/CVPR46437.2021.00275 10.1109/TPAMI.2017.2721945 10.1609/aaai.v34i07.7005 10.1109/CVPR.2016.503 10.1007/s00371-018-1566-y 10.24963/ijcai.2018/592 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023 |
| DBID | 97E RIA RIE AAYXX CITATION CGR CUY CVF ECM EIF NPM 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8 |
| DOI | 10.1109/TPAMI.2022.3148210 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library (IEL) (UW System Shared) CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional MEDLINE - Academic |
| DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional MEDLINE - Academic |
| DatabaseTitleList | Technology Research Database MEDLINE - Academic MEDLINE |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher – sequence: 3 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 2160-9292 1939-3539 |
| EndPage | 559 |
| ExternalDocumentID | 35130142 10_1109_TPAMI_2022_3148210 9706348 |
| Genre | orig-research Research Support, Non-U.S. Gov't Journal Article Review |
| GrantInformation_xml | – fundername: H2020 ICT-48-2020 HumanE-AI-NET – fundername: Fondazione di Modena – fundername: Italian Ministry of Foreign Affairs and International Cooperation – fundername: Artificial Intelligence for Cultural Heritage |
| GroupedDBID | --- -DZ -~X .DC 0R~ 29I 4.4 53G 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK ACNCT AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 UHB ~02 AAYXX CITATION 5VS 9M8 AAYOK ABFSI ADRHT AETIX AGSQL AI. AIBXA ALLEH CGR CUY CVF ECM EIF FA8 H~9 IBMZZ ICLAB IFJZH NPM PKN RIC RIG RNI RZB VH1 XJT Z5M 7SC 7SP 8FD JQ2 L7M L~C L~D 7X8 |
| ID | FETCH-LOGICAL-c461t-25cfb6ac8d62fed8c13ff33c2f7f367f356f9a330c96dea812007cad86cb84753 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 191 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000899419900033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0162-8828 1939-3539 |
| IngestDate | Sun Nov 09 11:17:41 EST 2025 Sun Nov 09 08:59:08 EST 2025 Wed Feb 19 02:23:56 EST 2025 Tue Nov 18 22:17:44 EST 2025 Sat Nov 29 02:58:19 EST 2025 Wed Aug 27 02:14:46 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c461t-25cfb6ac8d62fed8c13ff33c2f7f367f356f9a330c96dea812007cad86cb84753 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-3 content type line 23 ObjectType-Review-1 |
| ORCID | 0000-0001-8687-6609 0000-0001-7885-6050 0000-0001-9640-9385 0000-0001-5125-4957 |
| OpenAccessLink | https://hdl.handle.net/11380/1258568 |
| PMID | 35130142 |
| PQID | 2747610425 |
| PQPubID | 85458 |
| PageCount | 21 |
| ParticipantIDs | proquest_miscellaneous_2626891261 ieee_primary_9706348 crossref_primary_10_1109_TPAMI_2022_3148210 crossref_citationtrail_10_1109_TPAMI_2022_3148210 pubmed_primary_35130142 proquest_journals_2747610425 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-Jan.-1 2023-1-1 2023-01-00 20230101 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – month: 01 year: 2023 text: 2023-Jan.-1 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | United States |
| PublicationPlace_xml | – name: United States – name: New York |
| PublicationTitle | IEEE transactions on pattern analysis and machine intelligence |
| PublicationTitleAbbrev | TPAMI |
| PublicationTitleAlternate | IEEE Trans Pattern Anal Mach Intell |
| PublicationYear | 2023 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref57 ref207 ref56 ref208 ref205 elliott (ref221) 2015 ref58 ref206 ref203 ref52 ref204 ref55 ref201 ref54 ref202 mitchell (ref13) 2012 ref209 ref210 ref211 ref51 ref46 ref48 ref47 ref217 ref41 ref215 ref44 ref212 gupta (ref12) 2012 ref43 ref213 ref49 ref8 ref3 ref100 ref101 ref222 devlin (ref102) 2018 ref40 ref220 zhong (ref248) 2020 ref35 ref34 ref37 ref36 ref31 ref33 ref32 ref39 ref38 liu (ref83) 2020 ref24 xie (ref252) 2019 ref23 lu (ref99) 2019 ref25 ref20 ref22 ref21 yao (ref68) 2018 ref28 wang (ref216) 2017 chen (ref30) 2015 ref200 li (ref11) 2011 ref249 ref129 ref126 ref247 ref127 ref124 ref245 ref98 ref246 liu (ref87) 2019 ref133 ref254 ref134 ref131 banerjee (ref150) 2005 reed (ref148) 2016 ref132 ref253 dosovitskiy (ref90) 2021 ref250 ref130 ref251 ref89 kasai (ref149) 2021 ref86 ref137 ref85 ref138 ref88 ref135 ref136 touvron (ref91) 2021 kiros (ref6) 2014 ref144 ref81 ref145 zhang (ref123) 2017 ref84 radford (ref109) 2018 ref140 ref141 ref80 ref79 ref229 ref108 ref78 frome (ref5) 2013 ref227 ref106 ref228 ref75 ref225 ref226 zhu (ref107) 2020 ref105 ref223 ref76 ref224 ref103 bigazzi (ref164) 2020 hu (ref104) 2021 simonyan (ref29) 2015 xu (ref42) 2015 herdade (ref77) 2019 ref71 ref232 ref111 krizhevsky (ref26) 2012 del chiaro (ref191) 2020 ref233 ref112 ref73 ref230 ref72 ref231 yang (ref10) 2011 ref119 ref67 ref238 ref69 ref239 ref63 ref237 ref116 ref66 ref234 lin (ref118) 2004 ref65 ref235 ref114 liu (ref92) 2021 ordonez (ref4) 2011 ref60 ref243 ref122 ref244 ref62 ref241 ref120 ref61 ref242 ref121 ref240 hu (ref183) 2020 ref168 chen (ref218) 2019 sugano (ref53) 2016 kipf (ref70) 2017 ref170 ref177 ref178 ref175 ref176 ref173 ref174 welinder (ref143) 2010 ref179 bahdanau (ref45) 2014 van miltenburg (ref155) 2018 chunseong park (ref236) 2017 zhang (ref171) 2020 ref180 ref181 ref188 ref189 ref186 ref187 ref184 ref185 ref182 huang (ref64) 2019 pan (ref2) 2004 ref146 ref147 vaswani (ref74) 2017 ren (ref59) 2015 fei (ref110) 2019 ranzato (ref115) 2016 ref156 ref153 ref154 ref151 ref152 he (ref82) 2020 ramesh (ref139) 2021 ref159 ref157 ref158 guo (ref113) 2021 ref166 wang (ref94) 2021 ref167 shen (ref95) 2021 ref165 ref162 ref163 ref160 ref15 ref14 radford (ref93) 2021 mahajan (ref219) 2020 dai (ref169) 2018 dai (ref214) 2017 ref17 ref16 ref19 ref18 papineni (ref117) 2002 aker (ref9) 2010 jia (ref142) 2021 kusner (ref161) 2015 karpathy (ref7) 2014 lin (ref128) 2014 ref1 ref192 ref190 mokady (ref96) 2021 sundararajan (ref255) 2017 ref199 yang (ref50) 2016 ref197 ref198 ref195 cornia (ref97) 2021 ref196 ref193 xia (ref125) 2020 ref194 mao (ref27) 2015 unanue (ref172) 2021 |
| References_xml | – ident: ref48 doi: 10.1109/ICCV.2019.00184 – ident: ref111 doi: 10.24963/ijcai.2020/107 – ident: ref184 doi: 10.1007/978-3-030-01246-5_31 – ident: ref36 doi: 10.1145/3123266.3123275 – ident: ref203 doi: 10.1145/3394171.3413753 – ident: ref144 doi: 10.1109/ICVGIP.2008.47 – start-page: 74 year: 2004 ident: ref118 article-title: ROUGE: A package for automatic evaluation of summaries publication-title: Proc Annual Meeting of the Assoc Computational Linguistics – ident: ref173 doi: 10.18653/v1/D19-1220 – ident: ref73 doi: 10.1109/ICCV.2019.00271 – ident: ref222 doi: 10.1109/TMM.2019.2896494 – ident: ref187 doi: 10.1109/ICCV.2019.01042 – ident: ref185 doi: 10.1109/CVPR.2019.00425 – ident: ref1 doi: 10.1155/2015/565871 – ident: ref69 doi: 10.1145/3343031.3350943 – year: 2020 ident: ref107 article-title: AutoCaption: Image captioning with neural architecture search – ident: ref47 doi: 10.1109/CVPR.2017.780 – ident: ref24 doi: 10.1109/CVPR.2015.7298594 – start-page: 1575 year: 2020 ident: ref183 article-title: VIVO: Visual vocabulary pre-training for novel object captioning publication-title: Proc 35th AAAI Conf Artif Intell – ident: ref196 doi: 10.1109/CVPR.2019.00643 – ident: ref131 doi: 10.1109/CVPR46437.2021.00356 – ident: ref229 doi: 10.18653/v1/2021.acl-long.387 – ident: ref189 doi: 10.18653/v1/D19-1208 – ident: ref245 doi: 10.1109/CVPR46437.2021.01249 – start-page: 1889 year: 2014 ident: ref7 article-title: Deep fragment embeddings for bidirectional image sentence mapping publication-title: Proc 27th Int Conf Neural Inf Process Syst – ident: ref18 doi: 10.1145/3295748 – year: 2021 ident: ref149 article-title: Transparent human evaluation for image captioning – ident: ref35 doi: 10.1109/ICCV.2017.138 – ident: ref34 doi: 10.1109/CVPR.2016.29 – year: 2020 ident: ref171 article-title: BERTScore: Evaluating text generation with BERT publication-title: Proc Int Conf Learn Representations – ident: ref198 doi: 10.1109/ICCV.2017.364 – year: 2020 ident: ref219 article-title: Diverse image captioning with context-object split latent spaces publication-title: Proc 34th Int Conf Neural Inf Process Syst – ident: ref180 doi: 10.1109/CVPR.2019.01278 – ident: ref129 doi: 10.1162/tacl_a_00166 – year: 2021 ident: ref104 article-title: Scaling up vision-language pre-training for image captioning – ident: ref88 doi: 10.1109/CVPR42600.2020.01028 – ident: ref226 doi: 10.1145/3343031.3350996 – start-page: 6000 year: 2017 ident: ref74 article-title: Attention is all you need publication-title: Proc 31st Int Conf Neural Inf Process Syst – ident: ref182 doi: 10.18653/v1/D17-1098 – year: 2021 ident: ref93 article-title: Learning transferable visual models from natural language supervision – ident: ref63 doi: 10.1109/CVPR.2019.00856 – start-page: 2010 year: 2010 ident: ref143 article-title: Caltech-UCSD birds 200 – start-page: 65 year: 2005 ident: ref150 article-title: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments publication-title: Proc Annual Meeting of the Assoc Computational Linguistics – ident: ref228 doi: 10.1109/CVPR46437.2021.01354 – ident: ref197 doi: 10.1109/CVPR.2017.356 – ident: ref206 doi: 10.1109/CVPR46437.2021.01245 – start-page: 6847 year: 2019 ident: ref87 article-title: Aligning visual regions and textual concepts for semantic-grounded image representations publication-title: Proc Int Conf Neural Inf Process – ident: ref147 doi: 10.1007/978-3-030-01216-8_17 – ident: ref190 doi: 10.1109/TMM.2021.3060948 – ident: ref56 doi: 10.1145/3177745 – ident: ref16 doi: 10.1613/jair.4900 – year: 2021 ident: ref97 article-title: Universal captioner: Long-tail vision-and-language model training through content-style separation – ident: ref105 doi: 10.1162/neco.1997.9.8.1735 – ident: ref106 doi: 10.1109/CVPR.2018.00754 – ident: ref78 doi: 10.1109/CVPR42600.2020.01034 – ident: ref55 doi: 10.1109/CVPR.2017.334 – start-page: 220 year: 2011 ident: ref11 article-title: Composing simple image descriptions using web-scale n-grams publication-title: Proc 15th Conf Comput Natural Lang Learn – ident: ref220 doi: 10.1109/CVPR.2019.01095 – year: 2020 ident: ref83 article-title: Prophet attention: Predicting attention with future attention publication-title: Proc 34th Int Conf Neural Inf Process Syst – ident: ref132 doi: 10.1007/978-3-030-58520-4_25 – ident: ref89 doi: 10.1109/CVPR46437.2021.01521 – ident: ref209 doi: 10.1007/978-3-030-58568-6_34 – start-page: 2369 year: 2016 ident: ref50 article-title: Review networks for caption generation publication-title: Proc 30th Int Conf Neural Inf Process Syst – ident: ref227 doi: 10.18653/v1/P18-1240 – year: 2021 ident: ref95 article-title: How much can CLIP benefit vision-and-language tasks? – start-page: 1730 year: 2018 ident: ref155 article-title: Measuring the diversity of automatic image descriptions publication-title: Proc 27th Int Conf Comput Linguistics – start-page: 898 year: 2017 ident: ref214 article-title: Contrastive learning for image captioning publication-title: Proc 31st Int Conf Neural Inf Process Syst – ident: ref230 doi: 10.1109/ICCV48922.2021.00537 – ident: ref179 doi: 10.1109/CVPR.2017.559 – ident: ref116 doi: 10.1007/BF00992696 – ident: ref194 doi: 10.1609/aaai.v33i01.33018650 – ident: ref140 doi: 10.1145/2812802 – ident: ref237 doi: 10.1109/TPAMI.2018.2824816 – year: 2019 ident: ref110 article-title: Fast image caption generation with position alignment publication-title: Proc AAAI Conf Artif Intell Workshops – start-page: 311 year: 2002 ident: ref117 article-title: BLEU: A method for automatic evaluation of machine translation publication-title: Proc Annual Meeting of the Assoc Computational Linguistics – start-page: 2048 year: 2015 ident: ref42 article-title: Show, attend and tell: Neural image caption generation with visual attention publication-title: Proc 32nd Int Conf Mach Learn – ident: ref54 doi: 10.1109/ICCV.2017.272 – start-page: 2121 year: 2013 ident: ref5 article-title: DeViSE: A deep visual-semantic embedding model publication-title: Proc 26th Int Conf Neural Inf Process Syst – year: 2016 ident: ref115 article-title: Sequence level training with recurrent neural networks publication-title: Proc Int Conf Learn Representations – ident: ref66 doi: 10.1109/TPAMI.2019.2909864 – year: 2020 ident: ref125 article-title: XGPT: Cross-modal generative pre-training for image captioning – ident: ref223 doi: 10.18653/v1/P16-1168 – year: 2014 ident: ref6 article-title: Unifying visual-semantic embeddings with multimodal neural language models publication-title: Proc Int Conf Neural Inf Process Syst Workshops – start-page: 1929 year: 2019 ident: ref218 article-title: Variational structured semantic inference for diverse image captioning publication-title: Proc 33rd Int Conf Neural Inf Process Syst – year: 2019 ident: ref99 article-title: ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks publication-title: Proc 33rd Int Conf Neural Inf Process Syst – ident: ref166 doi: 10.18653/v1/2020.acl-main.93 – ident: ref40 doi: 10.1109/ICCV.2017.524 – start-page: 2422 year: 2015 ident: ref30 article-title: Mind's eye: A recurrent visual representation for image caption generation publication-title: Proc IEEE Conf Comput Vis Pattern Recognit – ident: ref126 doi: 10.1109/CVPR46437.2021.01101 – ident: ref25 doi: 10.1109/CVPR.2015.7298932 – ident: ref65 doi: 10.1609/aaai.v34i07.6898 – ident: ref215 doi: 10.1109/ICCV.2019.00434 – ident: ref84 doi: 10.1109/ICRA40945.2020.9196653 – ident: ref163 doi: 10.1109/CVPR.2019.00850 – ident: ref178 doi: 10.1109/ICCV.2019.00904 – ident: ref146 doi: 10.1007/978-3-319-46493-0_1 – ident: ref43 doi: 10.1109/CVPR.2017.345 – start-page: 1152 year: 2020 ident: ref164 article-title: Explore and Explain: Self-supervised navigation and recounting publication-title: Proc 25th Int Conf Pattern Recognit – year: 2019 ident: ref64 article-title: Adaptively aligned image captioning via adaptive attention time publication-title: Proc 33rd Int Conf Neural Inf Process Syst – start-page: 711 year: 2018 ident: ref68 article-title: Exploring visual relationship for image captioning publication-title: Proc Eur Conf Comput Vis – ident: ref195 doi: 10.1109/CVPR.2019.00640 – ident: ref108 doi: 10.1109/CVPR.2018.00583 – ident: ref243 doi: 10.1109/CVPR.2019.01280 – ident: ref28 doi: 10.1109/CVPR.2015.7298878 – ident: ref193 doi: 10.1109/CVPR.2017.214 – ident: ref76 doi: 10.1109/ICCV.2019.00902 – year: 2019 ident: ref77 article-title: Image captioning: Transforming objects into words publication-title: Proc 33rd Int Conf Neural Inf Process Syst – ident: ref160 doi: 10.1007/978-3-030-58571-6_37 – start-page: 6432 year: 2017 ident: ref236 article-title: Attend to you: Personalized image captioning with context sequence memory networks publication-title: Proc IEEE Conf Comput Vis Pattern Recognit – ident: ref174 doi: 10.1007/978-3-030-01225-0_13 – ident: ref31 doi: 10.1109/CVPR.2015.7298754 – ident: ref58 doi: 10.1109/CVPR.2018.00636 – ident: ref112 doi: 10.1145/3394171.3413901 – year: 2021 ident: ref94 article-title: SimVLM: Simple visual language model pretraining with weak supervision – ident: ref167 doi: 10.1109/CVPR46437.2021.01383 – ident: ref239 doi: 10.1109/CVPR.2017.108 – ident: ref39 doi: 10.1109/CVPR.2016.90 – ident: ref186 doi: 10.1109/ICCV.2019.00751 – ident: ref22 doi: 10.1109/PARC49193.2020.236619 – year: 2017 ident: ref123 article-title: Actor-critic sequence training for image captioning publication-title: Proc Int Conf Neural Inf Process – ident: ref23 doi: 10.1109/CVPR.2015.7298935 – ident: ref238 doi: 10.1609/aaai.v34i05.6503 – ident: ref165 doi: 10.18653/v1/2020.eval4nlp-1.4 – ident: ref217 doi: 10.1109/ICCV.2019.00436 – ident: ref231 doi: 10.1109/TPAMI.2012.118 – ident: ref119 doi: 10.1109/CVPR.2017.128 – year: 2021 ident: ref90 article-title: An image is worth 16x16 words: Transformers for image recognition at scale publication-title: Proc Int Conf Learn Representations – ident: ref192 doi: 10.1109/CVPR.2016.494 – ident: ref44 doi: 10.1007/978-3-030-01228-1_18 – ident: ref211 doi: 10.1109/ICCV48922.2021.00210 – ident: ref60 doi: 10.1109/TPAMI.2016.2577031 – ident: ref141 doi: 10.1145/3404835.3463257 – ident: ref254 doi: 10.1007/978-3-030-58577-8_7 – start-page: 153 year: 2020 ident: ref82 article-title: Image captioning through image transformer publication-title: Proc Asian Conf Comput Vis – ident: ref204 doi: 10.1109/CVPR46437.2021.00136 – year: 2017 ident: ref70 article-title: Semi-supervised classification with graph convolutional networks publication-title: Proc Int Conf Learn Representations – ident: ref61 doi: 10.1007/s11263-016-0981-7 – ident: ref213 doi: 10.1609/aaai.v32i1.12340 – ident: ref137 doi: 10.1007/978-3-030-58536-5_44 – start-page: 740 year: 2014 ident: ref128 article-title: Microsoft COCO: Common objects in context publication-title: Proc Eur Conf Comput Vis – ident: ref20 doi: 10.1007/978-3-030-49724-8_2 – ident: ref100 doi: 10.1007/978-3-030-58577-8_8 – ident: ref177 doi: 10.1109/CVPR.2017.130 – ident: ref71 doi: 10.1109/CVPR.2019.01094 – ident: ref244 doi: 10.1109/CVPR.2019.00859 – ident: ref85 doi: 10.1609/aaai.v35i2.16258 – ident: ref175 doi: 10.18653/v1/2021.emnlp-main.595 – start-page: 5758 year: 2017 ident: ref216 article-title: Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space publication-title: Proc 31th Int Conf Neural Inf Process Syst – ident: ref200 doi: 10.1007/978-3-030-01216-8_45 – ident: ref138 doi: 10.1007/978-3-030-58558-7_38 – start-page: 3319 year: 2017 ident: ref255 article-title: Axiomatic attribution for deep networks publication-title: Proc 34th Int Conf Mach Learn – ident: ref145 doi: 10.1109/ICCV.2017.64 – ident: ref249 doi: 10.1007/978-3-030-58601-0_42 – start-page: 444 year: 2011 ident: ref10 article-title: Corpus-guided sentence generation of natural images publication-title: Proc Conf Empir Methods Natural Lang Process – ident: ref8 doi: 10.1109/JPROC.2010.2050411 – ident: ref52 doi: 10.1007/978-3-030-01216-8_31 – ident: ref154 doi: 10.1109/ICCV.2017.445 – start-page: 606 year: 2012 ident: ref12 article-title: Choosing linguistics over vision to describe images publication-title: Proc 26th AAAI Conf Artif Intell – ident: ref133 doi: 10.1109/CVPR.2016.13 – ident: ref240 doi: 10.1109/CVPR.2018.00896 – ident: ref210 doi: 10.1109/TMM.2021.3074803 – ident: ref38 doi: 10.1109/CVPR.2017.131 – ident: ref86 doi: 10.1609/aaai.v35i3.16328 – ident: ref232 doi: 10.1109/CVPR42600.2020.01305 – ident: ref122 doi: 10.1109/CVPR.2015.7299087 – start-page: 10347 year: 2021 ident: ref91 article-title: Training data-efficient image transformers & distillation through attention publication-title: Proc 38th Int Conf Mach Learn – ident: ref81 doi: 10.1109/CVPR42600.2020.01059 – ident: ref79 doi: 10.1109/ICCV.2019.00473 – ident: ref67 doi: 10.1109/ICCV.2017.140 – ident: ref17 doi: 10.1016/j.neucom.2018.05.080 – start-page: 1060 year: 2016 ident: ref148 article-title: Generative adversarial text to image synthesis publication-title: Proc 33rd Int Conf Mach Learn – ident: ref205 doi: 10.1609/aaai.v35i4.16476 – start-page: 1987 year: 2004 ident: ref2 article-title: Automatic image captioning publication-title: Proc IEEE Int Conf Multimedia Expo – ident: ref201 doi: 10.1145/3343031.3350961 – ident: ref188 doi: 10.24963/ijcai.2020/128 – ident: ref241 doi: 10.1109/CVPR.2019.00433 – ident: ref57 doi: 10.1007/978-3-030-01252-6_5 – ident: ref19 doi: 10.1613/jair.3994 – start-page: 1250 year: 2010 ident: ref9 article-title: Generating image descriptions using dependency relational patterns publication-title: Proc Annual Meeting of the Assoc Computational Linguistics – start-page: 1097 year: 2012 ident: ref26 article-title: ImageNet classification with deep convolutional neural networks publication-title: Proc 25th Int Conf Neural Inf Process Syst – ident: ref253 doi: 10.18653/v1/2020.acl-main.583 – ident: ref157 doi: 10.1109/TPAMI.2020.3013834 – ident: ref51 doi: 10.1109/CVPR.2017.667 – ident: ref62 doi: 10.1109/ICCV.2019.00898 – ident: ref37 doi: 10.1109/CVPR.2018.00146 – year: 2015 ident: ref29 article-title: Very deep convolutional networks for large-scale image recognition publication-title: Proc Int Conf Learn Representations – year: 2021 ident: ref96 article-title: ClipCap: CLIP prefix for image captioning – ident: ref134 doi: 10.1007/978-3-030-58601-0_1 – ident: ref234 doi: 10.18653/v1/2021.emnlp-main.419 – start-page: 91 year: 2015 ident: ref59 article-title: Faster R-CNN: Towards real-time object detection with region proposal networks publication-title: Proc 28th Int Conf Neural Inf Process Syst – start-page: 211 year: 2020 ident: ref248 article-title: Comprehensive image captioning via scene graph decomposition publication-title: Proc Eur Conf Comput Vis – ident: ref181 doi: 10.1145/3240508.3240640 – ident: ref202 doi: 10.1109/CVPR46437.2021.00864 – ident: ref235 doi: 10.1145/2998181.2998364 – ident: ref233 doi: 10.18653/v1/2021.emnlp-main.542 – ident: ref225 doi: 10.1145/3123266.3123366 – start-page: 4171 year: 2018 ident: ref102 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: Proc Conf North Amer Chapter Assoc Comput Linguistics – year: 2015 ident: ref27 article-title: Deep captioning with multimodal recurrent neural networks (m-RNN) publication-title: Proc Int Conf Learn Representations – ident: ref158 doi: 10.18653/v1/D18-1437 – year: 2021 ident: ref92 article-title: CPTR: Full transformer network for image captioning – ident: ref247 doi: 10.1109/CVPR42600.2020.00998 – ident: ref3 doi: 10.1007/978-3-642-15561-1_2 – start-page: 747 year: 2012 ident: ref13 article-title: Midge: Generating image descriptions from computer vision detections publication-title: Proc 13th Conf Eur Chapter Assoc Comput Linguistics – start-page: 915 year: 2021 ident: ref172 article-title: BERTTune: Fine-tuning neural machine translation with BERTScore publication-title: Proc Annual Meeting of the Assoc Computational Linguistics – ident: ref152 doi: 10.1109/CVPR.2018.00608 – ident: ref72 doi: 10.18653/v1/2020.acl-main.664 – ident: ref207 doi: 10.18653/v1/D18-1436 – ident: ref32 doi: 10.1109/ICCV.2015.277 – ident: ref120 doi: 10.1109/ICCV.2017.100 – year: 2020 ident: ref191 article-title: RATT: Recurrent attention to transient tasks for continual image captioning publication-title: Proc 34th Int Conf Neural Inf Process Syst – ident: ref121 doi: 10.1007/978-3-319-46454-1_24 – ident: ref14 doi: 10.1109/TPAMI.2012.162 – ident: ref127 doi: 10.18653/v1/P16-1162 – ident: ref124 doi: 10.1109/CVPR.2019.00646 – ident: ref80 doi: 10.1109/CVPR42600.2020.01098 – year: 2014 ident: ref45 article-title: Neural machine translation by jointly learning to align and translate publication-title: Proc Int Conf Learn Representations – ident: ref159 doi: 10.18653/v1/D19-1156 – ident: ref41 doi: 10.1109/CVPR.2017.127 – ident: ref224 doi: 10.18653/v1/W16-3210 – ident: ref49 doi: 10.1609/aaai.v32i1.12266 – ident: ref208 doi: 10.1109/ICCV.2019.00472 – ident: ref156 doi: 10.1109/CVPR.2019.00432 – year: 2021 ident: ref139 article-title: Zero-shot text-to-image generation – ident: ref98 doi: 10.18653/v1/D19-1514 – ident: ref151 doi: 10.1007/978-3-030-01237-3_3 – year: 2018 ident: ref109 article-title: Improving language understanding by generative pre-training – ident: ref170 doi: 10.1109/ICCV.2017.323 – year: 2019 ident: ref252 article-title: Going beneath the surface: Evaluating image captioning for grammaticality, truthfulness and diversity – ident: ref103 doi: 10.1109/CVPR46437.2021.00553 – year: 2021 ident: ref113 article-title: Fast sequence generation with multi-agent reinforcement learning – ident: ref130 doi: 10.18653/v1/P18-1238 – ident: ref15 doi: 10.1162/tacl_a_00188 – ident: ref250 doi: 10.1109/CVPR42600.2020.00486 – ident: ref176 doi: 10.1109/CVPR.2016.8 – year: 2016 ident: ref53 article-title: Seeing with humans: Gaze-assisted neural image captioning – start-page: 656 year: 2018 ident: ref169 article-title: A neural compositional paradigm for image captioning publication-title: Proc 32nd Int Conf Neural Inf Process Syst – ident: ref46 doi: 10.1109/CVPR.2018.00834 – ident: ref114 doi: 10.1017/CBO9780511815829 – start-page: 1143 year: 2011 ident: ref4 article-title: Im2Text: Describing images using 1 million captioned photographs publication-title: Proc 24th Int Conf Neural Inf Process Syst – ident: ref242 doi: 10.1609/aaai.v34i07.6998 – ident: ref246 doi: 10.1109/CVPR46437.2021.01657 – ident: ref162 doi: 10.18653/v1/E17-1019 – ident: ref251 doi: 10.18653/v1/W16-3203 – year: 2015 ident: ref221 article-title: Multilingual image description with neural sequence models publication-title: Proc Int Conf Learn Representations – start-page: 957 year: 2015 ident: ref161 article-title: From word embeddings to document distances publication-title: Proc 32nd Int Conf Mach Learn – start-page: 4904 year: 2021 ident: ref142 article-title: Scaling up visual and vision-language representation learning with noisy text supervision publication-title: Proc 38th Int Conf Mach Learn – ident: ref153 doi: 10.18653/v1/2020.coling-main.210 – ident: ref75 doi: 10.1109/ICCV.2019.00435 – ident: ref136 doi: 10.1109/CVPR.2019.01275 – ident: ref168 doi: 10.18653/v1/2021.acl-short.29 – ident: ref212 doi: 10.1109/CVPR46437.2021.00275 – ident: ref135 doi: 10.1109/TPAMI.2017.2721945 – ident: ref101 doi: 10.1609/aaai.v34i07.7005 – ident: ref33 doi: 10.1109/CVPR.2016.503 – ident: ref21 doi: 10.1007/s00371-018-1566-y – ident: ref199 doi: 10.24963/ijcai.2018/592 |
| SSID | ssj0014503 |
| Score | 2.7293348 |
| SecondaryResourceType | review_article |
| Snippet | Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image... |
| SourceID | proquest pubmed crossref ieee |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 539 |
| SubjectTerms | Additives Algorithms Benchmarking Coders Computer vision Convolutional neural networks Deep Learning Feature extraction Image captioning Image coding Language Natural Language Processing Sentences survey Task analysis Training vision-and-language Visualization |
| Title | From Show to Tell: A Survey on Deep Learning-Based Image Captioning |
| URI | https://ieeexplore.ieee.org/document/9706348 https://www.ncbi.nlm.nih.gov/pubmed/35130142 https://www.proquest.com/docview/2747610425 https://www.proquest.com/docview/2626891261 |
| Volume | 45 |
| WOSCitedRecordID | wos000899419900033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared) customDbUrl: eissn: 2160-9292 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014503 issn: 0162-8828 databaseCode: RIE dateStart: 19790101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fa9RAEB7aIlIfrLa2RmtZwTeNvWSzya5v5-lhQUuhp9xbSHZntXBNjutdxf_e2c0mKKjgQyCQyQ92Znbmm83OB_BC146QpqjjKkNfrRJxzS2BlYKis7EjLXwp-8vH4vxczufqYgteDXthENH_fIav3alfyzet3rhS2akqKKBmchu2iyLv9moNKwaZ8CzIlMGQhxOM6DfIjNTp7GL86YygYJoSQs0kgZxduMtF4tBE-ls88gQrf881fcyZ7v3f1z6A-yG3ZOPOGB7CFjb7sNfzNrDgxvtw75cmhAcwma7aa3b5rf3O1i2b4WLxho3Z5WZ1iz9Y27B3iEsW2rB-jd9S1DPs7JqmITaplqGc-wg-T9_PJh_iQK0Q6yxP1nEqtK3zSkuTpxaN1Am3lnOd2sLynA6RW1VxPtIqN1hRFkAK1ZWRua4pngl-CDtN2-BjYJmyaIUmiG1kpuu60kKJDG3BBXJjbARJP8ClDn3HHf3FovT4Y6RKr5_S6acM-ong5XDPsuu68U_pAzf6g2QY-AiOez2WwTFvSgfCKWOkmSqC58Nlcim3TlI12G5IhkCeVAlhywiOOv0Pz-7N5smf3_kUdh0ffVejOYad9WqDz-COvl1f3axOyG7n8sTb7U98p-TI |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3db9MwED-Ngdh4YLCNLTDASLxBWBLbScxbKVSr6KpJK2hvUeKPDalLqq4d4r_n7DgRSIDEQ6RIcT50H777neP7AbyWlSWkyaqwZNpVq3hYUYNgJcPorEwkuStlf51k02l-cSHONuBtvxdGa-1-PtPv7Klby1eNXNtS2bHIMKCy_A7c5YwlUbtbq18zYNzxIGMOgz6OQKLbIhOJ49nZ4HSMYDBJEKOyHGHONtynPLZ4IvktIjmKlb9nmy7qjHb-73sfwUOfXZJBaw6PYUPXu7DTMTcQ78i78OCXNoR7MBwtm2tyftV8J6uGzPR8_p4MyPl6eat_kKYmH7VeEN-I9TL8gHFPkfE1TkRkWC58QXcfvow-zYYnoSdXCCVL41WYcGmqtJS5ShOjVS5jagylMjGZoSkePDWipDSSIlW6xDwAVSpLlaeywojG6RPYrJtaHwJhwmjDJYJslTNZVaXkgjNtMso1VcoEEHcCLqTvPG4JMOaFQyCRKJx-CqufwusngDf9PYu278Y_R-9Z6fcjveADOOr0WHjXvCksDMecEeeqAF71l9Gp7EpJWetmjWMQ5uUiRnQZwEGr__7Zndk8_fM7X8LWyex0UkzG08_PYNuy07cVmyPYXC3X-jnck7erbzfLF856fwInn-cn |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=From+Show+to+Tell%3A+A+Survey+on+Deep+Learning-Based+Image+Captioning&rft.jtitle=IEEE+transactions+on+pattern+analysis+and+machine+intelligence&rft.au=Stefanini%2C+Matteo&rft.au=Cornia%2C+Marcella&rft.au=Baraldi%2C+Lorenzo&rft.au=Cascianelli%2C+Silvia&rft.date=2023-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=0162-8828&rft.eissn=1939-3539&rft.volume=45&rft.issue=1&rft.spage=539&rft_id=info:doi/10.1109%2FTPAMI.2022.3148210&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0162-8828&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0162-8828&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0162-8828&client=summon |