From Show to Tell: A Survey on Deep Learning-Based Image Captioning

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on pattern analysis and machine intelligence Ročník 45; číslo 1; s. 539 - 559
Hlavní autori: Stefanini, Matteo, Cornia, Marcella, Baraldi, Lorenzo, Cascianelli, Silvia, Fiameni, Giuseppe, Cucchiara, Rita
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States IEEE 01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:0162-8828, 1939-3539, 2160-9292, 1939-3539
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
AbstractList Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
Author Cornia, Marcella
Cascianelli, Silvia
Cucchiara, Rita
Fiameni, Giuseppe
Baraldi, Lorenzo
Stefanini, Matteo
Author_xml – sequence: 1
  givenname: Matteo
  surname: Stefanini
  fullname: Stefanini, Matteo
  email: matteo.stefanini@unimore.it
  organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy
– sequence: 2
  givenname: Marcella
  orcidid: 0000-0001-9640-9385
  surname: Cornia
  fullname: Cornia, Marcella
  email: marcella.cornia@unimore.it
  organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy
– sequence: 3
  givenname: Lorenzo
  orcidid: 0000-0001-5125-4957
  surname: Baraldi
  fullname: Baraldi, Lorenzo
  email: lorenzo.baraldi@unimore.it
  organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy
– sequence: 4
  givenname: Silvia
  orcidid: 0000-0001-7885-6050
  surname: Cascianelli
  fullname: Cascianelli, Silvia
  email: silvia.cascianelli@unimore.it
  organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy
– sequence: 5
  givenname: Giuseppe
  orcidid: 0000-0001-8687-6609
  surname: Fiameni
  fullname: Fiameni, Giuseppe
  email: gfiameni@nvidia.com
  organization: NVIDIA AI Technology Centre, Milan, Italy
– sequence: 6
  givenname: Rita
  surname: Cucchiara
  fullname: Cucchiara, Rita
  email: rita.cucchiara@unimore.it
  organization: Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy
BackLink https://www.ncbi.nlm.nih.gov/pubmed/35130142$$D View this record in MEDLINE/PubMed
BookMark eNp9kcFu2zAMhoUixZJ2e4EOGAT00oszibJlebc0bdcAGTag2VlQZCpzEVupZG_o29de0h5y6IEgSHw_SfA_I6PGN0jIBWdTzlnxdfVr9mMxBQYwFTxVwNkJmQCXLCmggBGZMC4hUQrUmJzF-MgYTzMmPpCxyLjoC5iQ-V3wNX344__R1tMVbrff6Iw-dOEvPlPf0BvEHV2iCU3VbJJrE7Gki9pskM7Nrq380P5ITp3ZRvx0yOfk993tan6fLH9-X8xny8SmkrcJZNatpbGqlOCwVJYL54Sw4HInZB-ZdIURgtlClmgUB8Zya0ol7VqleSbOydV-7i74pw5jq-sq2v5k06DvogYJUhUcJO_RyyP00Xeh6a_TkKe55CyFYeCXA9Wtayz1LlS1Cc_69T09AHvABh9jQPeGcKYHD_R_D_TggT540IvUkchWrRl-1QZTbd-Xft5LK0R821XkTIpUiRe45JDn
CODEN ITPIDJ
CitedBy_id crossref_primary_10_15625_1813_9663_20929
crossref_primary_10_1007_s11263_024_02220_6
crossref_primary_10_1016_j_cviu_2023_103857
crossref_primary_10_1016_j_eswa_2023_121391
crossref_primary_10_3390_app131910894
crossref_primary_10_3390_app131911103
crossref_primary_10_1007_s10489_025_06384_7
crossref_primary_10_1007_s00530_024_01645_w
crossref_primary_10_1016_j_asoc_2025_113310
crossref_primary_10_1109_TCSVT_2024_3444782
crossref_primary_10_1016_j_cosrev_2025_100766
crossref_primary_10_1007_s11042_023_16560_x
crossref_primary_10_1007_s00530_025_01667_y
crossref_primary_10_1016_j_compbiomed_2024_108709
crossref_primary_10_1016_j_nlp_2025_100159
crossref_primary_10_1016_j_displa_2024_102653
crossref_primary_10_1016_j_compeleceng_2025_110077
crossref_primary_10_3390_s23031286
crossref_primary_10_1109_ACCESS_2023_3317276
crossref_primary_10_1016_j_patcog_2024_110941
crossref_primary_10_1016_j_prime_2023_100372
crossref_primary_10_1109_ACCESS_2025_3564873
crossref_primary_10_1016_j_jhydrol_2024_131783
crossref_primary_10_1007_s00530_025_01801_w
crossref_primary_10_3390_app15020523
crossref_primary_10_1016_j_sasc_2025_200231
crossref_primary_10_3390_biomedicines11082225
crossref_primary_10_3390_jimaging10010026
crossref_primary_10_1007_s11042_024_18467_7
crossref_primary_10_1016_j_eswa_2023_121168
crossref_primary_10_1016_j_neucom_2025_131534
crossref_primary_10_1109_ACCESS_2025_3588344
crossref_primary_10_1038_s41746_024_01101_z
crossref_primary_10_1109_JSTARS_2024_3471625
crossref_primary_10_1016_j_knosys_2023_110515
crossref_primary_10_1007_s11263_023_01949_w
crossref_primary_10_1049_ipr2_13287
crossref_primary_10_1007_s11633_023_1453_5
crossref_primary_10_1016_j_knosys_2025_114400
crossref_primary_10_1007_s00521_023_08957_4
crossref_primary_10_1007_s11277_024_11037_y
crossref_primary_10_1016_j_displa_2025_103126
crossref_primary_10_1016_j_inffus_2023_102204
crossref_primary_10_1109_TMI_2023_3335909
crossref_primary_10_1016_j_ress_2024_110347
crossref_primary_10_1016_j_knosys_2023_111056
crossref_primary_10_1145_3663667
crossref_primary_10_1007_s00521_025_11199_1
crossref_primary_10_1016_j_compmedimag_2024_102486
crossref_primary_10_1109_TPAMI_2025_3581174
crossref_primary_10_1109_TGRS_2022_3218921
crossref_primary_10_1016_j_cviu_2024_104210
crossref_primary_10_1109_TPAMI_2025_3525671
crossref_primary_10_1109_TIP_2025_3568746
crossref_primary_10_1038_s41598_024_69664_1
crossref_primary_10_1109_TGRS_2024_3510833
crossref_primary_10_1016_j_engappai_2024_109134
crossref_primary_10_1016_j_eswa_2023_120698
crossref_primary_10_3390_app12157724
crossref_primary_10_1371_journal_pone_0320701
crossref_primary_10_3390_electronics13193854
crossref_primary_10_1155_2022_2756396
crossref_primary_10_1109_TIFS_2024_3361151
crossref_primary_10_1016_j_compbiomed_2024_109275
crossref_primary_10_1007_s11263_025_02535_y
crossref_primary_10_1007_s11042_024_18307_8
crossref_primary_10_1109_ACCESS_2022_3232564
crossref_primary_10_1109_TPAMI_2024_3357631
crossref_primary_10_1109_ACCESS_2025_3536095
crossref_primary_10_1007_s00530_023_01249_w
crossref_primary_10_3390_app15105608
crossref_primary_10_1109_TMM_2024_3384678
crossref_primary_10_1007_s00371_025_04072_8
crossref_primary_10_1016_j_neucom_2024_127651
crossref_primary_10_1007_s00371_024_03469_1
crossref_primary_10_1016_j_engappai_2024_109288
crossref_primary_10_32604_cmes_2025_059192
crossref_primary_10_3390_a16020097
crossref_primary_10_1007_s11042_024_18748_1
crossref_primary_10_1109_TIP_2023_3328224
crossref_primary_10_1016_j_compbiomed_2024_108073
crossref_primary_10_1145_3711680
crossref_primary_10_3390_electronics14091716
crossref_primary_10_4018_JOEUC_347914
crossref_primary_10_1007_s12559_023_10231_7
crossref_primary_10_1109_ACCESS_2025_3596720
crossref_primary_10_1061_JCCEE5_CPENG_5744
crossref_primary_10_1016_j_cmpb_2025_108677
crossref_primary_10_1038_s44287_025_00170_w
crossref_primary_10_1080_0952813X_2025_2481044
crossref_primary_10_1109_TII_2025_3540483
crossref_primary_10_3390_app14062657
crossref_primary_10_1109_TIP_2025_3539468
crossref_primary_10_1007_s13735_025_00375_7
crossref_primary_10_1109_ACCESS_2023_3282444
crossref_primary_10_1007_s10489_024_05437_7
crossref_primary_10_1093_llc_fqae029
crossref_primary_10_1016_j_inffus_2025_103269
crossref_primary_10_1109_TGRS_2025_3585119
crossref_primary_10_1109_JSTARS_2025_3580686
crossref_primary_10_1109_TGRS_2024_3401576
crossref_primary_10_1007_s10489_023_05167_2
crossref_primary_10_1109_TSE_2024_3463747
crossref_primary_10_1007_s13735_025_00368_6
crossref_primary_10_1007_s42979_025_03719_6
crossref_primary_10_1007_s42979_025_03908_3
crossref_primary_10_1016_j_eswa_2025_126951
crossref_primary_10_1007_s11042_024_19927_w
crossref_primary_10_1007_s10489_023_05198_9
crossref_primary_10_1186_s12911_024_02445_y
crossref_primary_10_1109_ACCESS_2023_3310257
crossref_primary_10_1109_TMM_2024_3407695
crossref_primary_10_1016_j_imavis_2024_104946
crossref_primary_10_1080_21642583_2025_2467083
crossref_primary_10_1080_01691864_2024_2388114
crossref_primary_10_1109_ACCESS_2024_3401174
crossref_primary_10_1007_s11760_025_04066_y
crossref_primary_10_3390_app15105538
crossref_primary_10_1007_s11633_024_1535_z
crossref_primary_10_1109_ACCESS_2023_3302512
crossref_primary_10_1148_ryai_240625
crossref_primary_10_3390_app15094909
Cites_doi 10.1109/ICCV.2019.00184
10.24963/ijcai.2020/107
10.1007/978-3-030-01246-5_31
10.1145/3123266.3123275
10.1145/3394171.3413753
10.1109/ICVGIP.2008.47
10.18653/v1/D19-1220
10.1109/ICCV.2019.00271
10.1109/TMM.2019.2896494
10.1109/ICCV.2019.01042
10.1109/CVPR.2019.00425
10.1155/2015/565871
10.1145/3343031.3350943
10.1109/CVPR.2017.780
10.1109/CVPR.2015.7298594
10.1109/CVPR.2019.00643
10.1109/CVPR46437.2021.00356
10.18653/v1/2021.acl-long.387
10.18653/v1/D19-1208
10.1109/CVPR46437.2021.01249
10.1145/3295748
10.1109/ICCV.2017.138
10.1109/CVPR.2016.29
10.1109/ICCV.2017.364
10.1109/CVPR.2019.01278
10.1162/tacl_a_00166
10.1109/CVPR42600.2020.01028
10.1145/3343031.3350996
10.18653/v1/D17-1098
10.1109/CVPR.2019.00856
10.1109/CVPR46437.2021.01354
10.1109/CVPR.2017.356
10.1109/CVPR46437.2021.01245
10.1007/978-3-030-01216-8_17
10.1109/TMM.2021.3060948
10.1145/3177745
10.1613/jair.4900
10.1162/neco.1997.9.8.1735
10.1109/CVPR.2018.00754
10.1109/CVPR42600.2020.01034
10.1109/CVPR.2017.334
10.1109/CVPR.2019.01095
10.1007/978-3-030-58520-4_25
10.1109/CVPR46437.2021.01521
10.1007/978-3-030-58568-6_34
10.18653/v1/P18-1240
10.1109/ICCV48922.2021.00537
10.1109/CVPR.2017.559
10.1007/BF00992696
10.1609/aaai.v33i01.33018650
10.1145/2812802
10.1109/TPAMI.2018.2824816
10.1109/ICCV.2017.272
10.1109/TPAMI.2019.2909864
10.18653/v1/P16-1168
10.18653/v1/2020.acl-main.93
10.1109/ICCV.2017.524
10.1109/CVPR46437.2021.01101
10.1109/CVPR.2015.7298932
10.1609/aaai.v34i07.6898
10.1109/ICCV.2019.00434
10.1109/ICRA40945.2020.9196653
10.1109/CVPR.2019.00850
10.1109/ICCV.2019.00904
10.1007/978-3-319-46493-0_1
10.1109/CVPR.2017.345
10.1109/CVPR.2019.00640
10.1109/CVPR.2018.00583
10.1109/CVPR.2019.01280
10.1109/CVPR.2015.7298878
10.1109/CVPR.2017.214
10.1109/ICCV.2019.00902
10.1007/978-3-030-58571-6_37
10.1007/978-3-030-01225-0_13
10.1109/CVPR.2015.7298754
10.1109/CVPR.2018.00636
10.1145/3394171.3413901
10.1109/CVPR46437.2021.01383
10.1109/CVPR.2017.108
10.1109/CVPR.2016.90
10.1109/ICCV.2019.00751
10.1109/PARC49193.2020.236619
10.1109/CVPR.2015.7298935
10.1609/aaai.v34i05.6503
10.18653/v1/2020.eval4nlp-1.4
10.1109/ICCV.2019.00436
10.1109/TPAMI.2012.118
10.1109/CVPR.2017.128
10.1109/CVPR.2016.494
10.1007/978-3-030-01228-1_18
10.1109/ICCV48922.2021.00210
10.1109/TPAMI.2016.2577031
10.1145/3404835.3463257
10.1007/978-3-030-58577-8_7
10.1109/CVPR46437.2021.00136
10.1007/s11263-016-0981-7
10.1609/aaai.v32i1.12340
10.1007/978-3-030-58536-5_44
10.1007/978-3-030-49724-8_2
10.1007/978-3-030-58577-8_8
10.1109/CVPR.2017.130
10.1109/CVPR.2019.01094
10.1109/CVPR.2019.00859
10.1609/aaai.v35i2.16258
10.18653/v1/2021.emnlp-main.595
10.1007/978-3-030-01216-8_45
10.1007/978-3-030-58558-7_38
10.1109/ICCV.2017.64
10.1007/978-3-030-58601-0_42
10.1109/JPROC.2010.2050411
10.1007/978-3-030-01216-8_31
10.1109/ICCV.2017.445
10.1109/CVPR.2016.13
10.1109/CVPR.2018.00896
10.1109/TMM.2021.3074803
10.1109/CVPR.2017.131
10.1609/aaai.v35i3.16328
10.1109/CVPR42600.2020.01305
10.1109/CVPR.2015.7299087
10.1109/CVPR42600.2020.01059
10.1109/ICCV.2019.00473
10.1109/ICCV.2017.140
10.1016/j.neucom.2018.05.080
10.1609/aaai.v35i4.16476
10.1145/3343031.3350961
10.24963/ijcai.2020/128
10.1109/CVPR.2019.00433
10.1007/978-3-030-01252-6_5
10.1613/jair.3994
10.18653/v1/2020.acl-main.583
10.1109/TPAMI.2020.3013834
10.1109/CVPR.2017.667
10.1109/ICCV.2019.00898
10.1109/CVPR.2018.00146
10.1007/978-3-030-58601-0_1
10.18653/v1/2021.emnlp-main.419
10.1145/3240508.3240640
10.1109/CVPR46437.2021.00864
10.1145/2998181.2998364
10.18653/v1/2021.emnlp-main.542
10.1145/3123266.3123366
10.18653/v1/D18-1437
10.1109/CVPR42600.2020.00998
10.1007/978-3-642-15561-1_2
10.1109/CVPR.2018.00608
10.18653/v1/2020.acl-main.664
10.18653/v1/D18-1436
10.1109/ICCV.2015.277
10.1109/ICCV.2017.100
10.1007/978-3-319-46454-1_24
10.1109/TPAMI.2012.162
10.18653/v1/P16-1162
10.1109/CVPR.2019.00646
10.1109/CVPR42600.2020.01098
10.18653/v1/D19-1156
10.1109/CVPR.2017.127
10.18653/v1/W16-3210
10.1609/aaai.v32i1.12266
10.1109/ICCV.2019.00472
10.1109/CVPR.2019.00432
10.18653/v1/D19-1514
10.1007/978-3-030-01237-3_3
10.1109/ICCV.2017.323
10.1109/CVPR46437.2021.00553
10.18653/v1/P18-1238
10.1162/tacl_a_00188
10.1109/CVPR42600.2020.00486
10.1109/CVPR.2016.8
10.1109/CVPR.2018.00834
10.1017/CBO9780511815829
10.1609/aaai.v34i07.6998
10.1109/CVPR46437.2021.01657
10.18653/v1/E17-1019
10.18653/v1/W16-3203
10.18653/v1/2020.coling-main.210
10.1109/ICCV.2019.00435
10.1109/CVPR.2019.01275
10.18653/v1/2021.acl-short.29
10.1109/CVPR46437.2021.00275
10.1109/TPAMI.2017.2721945
10.1609/aaai.v34i07.7005
10.1109/CVPR.2016.503
10.1007/s00371-018-1566-y
10.24963/ijcai.2018/592
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID 97E
RIA
RIE
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
DOI 10.1109/TPAMI.2022.3148210
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE/IET Electronic Library (IEL) (UW System Shared)
CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
MEDLINE - Academic
DatabaseTitleList
Technology Research Database
MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
– sequence: 3
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 2160-9292
1939-3539
EndPage 559
ExternalDocumentID 35130142
10_1109_TPAMI_2022_3148210
9706348
Genre orig-research
Research Support, Non-U.S. Gov't
Journal Article
Review
GrantInformation_xml – fundername: H2020 ICT-48-2020 HumanE-AI-NET
– fundername: Fondazione di Modena
– fundername: Italian Ministry of Foreign Affairs and International Cooperation
– fundername: Artificial Intelligence for Cultural Heritage
GroupedDBID ---
-DZ
-~X
.DC
0R~
29I
4.4
53G
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
ACNCT
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
UHB
~02
AAYXX
CITATION
5VS
9M8
AAYOK
ABFSI
ADRHT
AETIX
AGSQL
AI.
AIBXA
ALLEH
CGR
CUY
CVF
ECM
EIF
FA8
H~9
IBMZZ
ICLAB
IFJZH
NPM
PKN
RIC
RIG
RNI
RZB
VH1
XJT
Z5M
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
7X8
ID FETCH-LOGICAL-c461t-25cfb6ac8d62fed8c13ff33c2f7f367f356f9a330c96dea812007cad86cb84753
IEDL.DBID RIE
ISICitedReferencesCount 191
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000899419900033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0162-8828
1939-3539
IngestDate Sun Nov 09 11:17:41 EST 2025
Sun Nov 09 08:59:08 EST 2025
Wed Feb 19 02:23:56 EST 2025
Tue Nov 18 22:17:44 EST 2025
Sat Nov 29 02:58:19 EST 2025
Wed Aug 27 02:14:46 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c461t-25cfb6ac8d62fed8c13ff33c2f7f367f356f9a330c96dea812007cad86cb84753
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-3
content type line 23
ObjectType-Review-1
ORCID 0000-0001-8687-6609
0000-0001-7885-6050
0000-0001-9640-9385
0000-0001-5125-4957
OpenAccessLink https://hdl.handle.net/11380/1258568
PMID 35130142
PQID 2747610425
PQPubID 85458
PageCount 21
ParticipantIDs proquest_miscellaneous_2626891261
ieee_primary_9706348
crossref_primary_10_1109_TPAMI_2022_3148210
crossref_citationtrail_10_1109_TPAMI_2022_3148210
pubmed_primary_35130142
proquest_journals_2747610425
PublicationCentury 2000
PublicationDate 2023-Jan.-1
2023-1-1
2023-01-00
20230101
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-Jan.-1
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: New York
PublicationTitle IEEE transactions on pattern analysis and machine intelligence
PublicationTitleAbbrev TPAMI
PublicationTitleAlternate IEEE Trans Pattern Anal Mach Intell
PublicationYear 2023
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref57
ref207
ref56
ref208
ref205
elliott (ref221) 2015
ref58
ref206
ref203
ref52
ref204
ref55
ref201
ref54
ref202
mitchell (ref13) 2012
ref209
ref210
ref211
ref51
ref46
ref48
ref47
ref217
ref41
ref215
ref44
ref212
gupta (ref12) 2012
ref43
ref213
ref49
ref8
ref3
ref100
ref101
ref222
devlin (ref102) 2018
ref40
ref220
zhong (ref248) 2020
ref35
ref34
ref37
ref36
ref31
ref33
ref32
ref39
ref38
liu (ref83) 2020
ref24
xie (ref252) 2019
ref23
lu (ref99) 2019
ref25
ref20
ref22
ref21
yao (ref68) 2018
ref28
wang (ref216) 2017
chen (ref30) 2015
ref200
li (ref11) 2011
ref249
ref129
ref126
ref247
ref127
ref124
ref245
ref98
ref246
liu (ref87) 2019
ref133
ref254
ref134
ref131
banerjee (ref150) 2005
reed (ref148) 2016
ref132
ref253
dosovitskiy (ref90) 2021
ref250
ref130
ref251
ref89
kasai (ref149) 2021
ref86
ref137
ref85
ref138
ref88
ref135
ref136
touvron (ref91) 2021
kiros (ref6) 2014
ref144
ref81
ref145
zhang (ref123) 2017
ref84
radford (ref109) 2018
ref140
ref141
ref80
ref79
ref229
ref108
ref78
frome (ref5) 2013
ref227
ref106
ref228
ref75
ref225
ref226
zhu (ref107) 2020
ref105
ref223
ref76
ref224
ref103
bigazzi (ref164) 2020
hu (ref104) 2021
simonyan (ref29) 2015
xu (ref42) 2015
herdade (ref77) 2019
ref71
ref232
ref111
krizhevsky (ref26) 2012
del chiaro (ref191) 2020
ref233
ref112
ref73
ref230
ref72
ref231
yang (ref10) 2011
ref119
ref67
ref238
ref69
ref239
ref63
ref237
ref116
ref66
ref234
lin (ref118) 2004
ref65
ref235
ref114
liu (ref92) 2021
ordonez (ref4) 2011
ref60
ref243
ref122
ref244
ref62
ref241
ref120
ref61
ref242
ref121
ref240
hu (ref183) 2020
ref168
chen (ref218) 2019
sugano (ref53) 2016
kipf (ref70) 2017
ref170
ref177
ref178
ref175
ref176
ref173
ref174
welinder (ref143) 2010
ref179
bahdanau (ref45) 2014
van miltenburg (ref155) 2018
chunseong park (ref236) 2017
zhang (ref171) 2020
ref180
ref181
ref188
ref189
ref186
ref187
ref184
ref185
ref182
huang (ref64) 2019
pan (ref2) 2004
ref146
ref147
vaswani (ref74) 2017
ren (ref59) 2015
fei (ref110) 2019
ranzato (ref115) 2016
ref156
ref153
ref154
ref151
ref152
he (ref82) 2020
ramesh (ref139) 2021
ref159
ref157
ref158
guo (ref113) 2021
ref166
wang (ref94) 2021
ref167
shen (ref95) 2021
ref165
ref162
ref163
ref160
ref15
ref14
radford (ref93) 2021
mahajan (ref219) 2020
dai (ref169) 2018
dai (ref214) 2017
ref17
ref16
ref19
ref18
papineni (ref117) 2002
aker (ref9) 2010
jia (ref142) 2021
kusner (ref161) 2015
karpathy (ref7) 2014
lin (ref128) 2014
ref1
ref192
ref190
mokady (ref96) 2021
sundararajan (ref255) 2017
ref199
yang (ref50) 2016
ref197
ref198
ref195
cornia (ref97) 2021
ref196
ref193
xia (ref125) 2020
ref194
mao (ref27) 2015
unanue (ref172) 2021
References_xml – ident: ref48
  doi: 10.1109/ICCV.2019.00184
– ident: ref111
  doi: 10.24963/ijcai.2020/107
– ident: ref184
  doi: 10.1007/978-3-030-01246-5_31
– ident: ref36
  doi: 10.1145/3123266.3123275
– ident: ref203
  doi: 10.1145/3394171.3413753
– ident: ref144
  doi: 10.1109/ICVGIP.2008.47
– start-page: 74
  year: 2004
  ident: ref118
  article-title: ROUGE: A package for automatic evaluation of summaries
  publication-title: Proc Annual Meeting of the Assoc Computational Linguistics
– ident: ref173
  doi: 10.18653/v1/D19-1220
– ident: ref73
  doi: 10.1109/ICCV.2019.00271
– ident: ref222
  doi: 10.1109/TMM.2019.2896494
– ident: ref187
  doi: 10.1109/ICCV.2019.01042
– ident: ref185
  doi: 10.1109/CVPR.2019.00425
– ident: ref1
  doi: 10.1155/2015/565871
– ident: ref69
  doi: 10.1145/3343031.3350943
– year: 2020
  ident: ref107
  article-title: AutoCaption: Image captioning with neural architecture search
– ident: ref47
  doi: 10.1109/CVPR.2017.780
– ident: ref24
  doi: 10.1109/CVPR.2015.7298594
– start-page: 1575
  year: 2020
  ident: ref183
  article-title: VIVO: Visual vocabulary pre-training for novel object captioning
  publication-title: Proc 35th AAAI Conf Artif Intell
– ident: ref196
  doi: 10.1109/CVPR.2019.00643
– ident: ref131
  doi: 10.1109/CVPR46437.2021.00356
– ident: ref229
  doi: 10.18653/v1/2021.acl-long.387
– ident: ref189
  doi: 10.18653/v1/D19-1208
– ident: ref245
  doi: 10.1109/CVPR46437.2021.01249
– start-page: 1889
  year: 2014
  ident: ref7
  article-title: Deep fragment embeddings for bidirectional image sentence mapping
  publication-title: Proc 27th Int Conf Neural Inf Process Syst
– ident: ref18
  doi: 10.1145/3295748
– year: 2021
  ident: ref149
  article-title: Transparent human evaluation for image captioning
– ident: ref35
  doi: 10.1109/ICCV.2017.138
– ident: ref34
  doi: 10.1109/CVPR.2016.29
– year: 2020
  ident: ref171
  article-title: BERTScore: Evaluating text generation with BERT
  publication-title: Proc Int Conf Learn Representations
– ident: ref198
  doi: 10.1109/ICCV.2017.364
– year: 2020
  ident: ref219
  article-title: Diverse image captioning with context-object split latent spaces
  publication-title: Proc 34th Int Conf Neural Inf Process Syst
– ident: ref180
  doi: 10.1109/CVPR.2019.01278
– ident: ref129
  doi: 10.1162/tacl_a_00166
– year: 2021
  ident: ref104
  article-title: Scaling up vision-language pre-training for image captioning
– ident: ref88
  doi: 10.1109/CVPR42600.2020.01028
– ident: ref226
  doi: 10.1145/3343031.3350996
– start-page: 6000
  year: 2017
  ident: ref74
  article-title: Attention is all you need
  publication-title: Proc 31st Int Conf Neural Inf Process Syst
– ident: ref182
  doi: 10.18653/v1/D17-1098
– year: 2021
  ident: ref93
  article-title: Learning transferable visual models from natural language supervision
– ident: ref63
  doi: 10.1109/CVPR.2019.00856
– start-page: 2010
  year: 2010
  ident: ref143
  article-title: Caltech-UCSD birds 200
– start-page: 65
  year: 2005
  ident: ref150
  article-title: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
  publication-title: Proc Annual Meeting of the Assoc Computational Linguistics
– ident: ref228
  doi: 10.1109/CVPR46437.2021.01354
– ident: ref197
  doi: 10.1109/CVPR.2017.356
– ident: ref206
  doi: 10.1109/CVPR46437.2021.01245
– start-page: 6847
  year: 2019
  ident: ref87
  article-title: Aligning visual regions and textual concepts for semantic-grounded image representations
  publication-title: Proc Int Conf Neural Inf Process
– ident: ref147
  doi: 10.1007/978-3-030-01216-8_17
– ident: ref190
  doi: 10.1109/TMM.2021.3060948
– ident: ref56
  doi: 10.1145/3177745
– ident: ref16
  doi: 10.1613/jair.4900
– year: 2021
  ident: ref97
  article-title: Universal captioner: Long-tail vision-and-language model training through content-style separation
– ident: ref105
  doi: 10.1162/neco.1997.9.8.1735
– ident: ref106
  doi: 10.1109/CVPR.2018.00754
– ident: ref78
  doi: 10.1109/CVPR42600.2020.01034
– ident: ref55
  doi: 10.1109/CVPR.2017.334
– start-page: 220
  year: 2011
  ident: ref11
  article-title: Composing simple image descriptions using web-scale n-grams
  publication-title: Proc 15th Conf Comput Natural Lang Learn
– ident: ref220
  doi: 10.1109/CVPR.2019.01095
– year: 2020
  ident: ref83
  article-title: Prophet attention: Predicting attention with future attention
  publication-title: Proc 34th Int Conf Neural Inf Process Syst
– ident: ref132
  doi: 10.1007/978-3-030-58520-4_25
– ident: ref89
  doi: 10.1109/CVPR46437.2021.01521
– ident: ref209
  doi: 10.1007/978-3-030-58568-6_34
– start-page: 2369
  year: 2016
  ident: ref50
  article-title: Review networks for caption generation
  publication-title: Proc 30th Int Conf Neural Inf Process Syst
– ident: ref227
  doi: 10.18653/v1/P18-1240
– year: 2021
  ident: ref95
  article-title: How much can CLIP benefit vision-and-language tasks?
– start-page: 1730
  year: 2018
  ident: ref155
  article-title: Measuring the diversity of automatic image descriptions
  publication-title: Proc 27th Int Conf Comput Linguistics
– start-page: 898
  year: 2017
  ident: ref214
  article-title: Contrastive learning for image captioning
  publication-title: Proc 31st Int Conf Neural Inf Process Syst
– ident: ref230
  doi: 10.1109/ICCV48922.2021.00537
– ident: ref179
  doi: 10.1109/CVPR.2017.559
– ident: ref116
  doi: 10.1007/BF00992696
– ident: ref194
  doi: 10.1609/aaai.v33i01.33018650
– ident: ref140
  doi: 10.1145/2812802
– ident: ref237
  doi: 10.1109/TPAMI.2018.2824816
– year: 2019
  ident: ref110
  article-title: Fast image caption generation with position alignment
  publication-title: Proc AAAI Conf Artif Intell Workshops
– start-page: 311
  year: 2002
  ident: ref117
  article-title: BLEU: A method for automatic evaluation of machine translation
  publication-title: Proc Annual Meeting of the Assoc Computational Linguistics
– start-page: 2048
  year: 2015
  ident: ref42
  article-title: Show, attend and tell: Neural image caption generation with visual attention
  publication-title: Proc 32nd Int Conf Mach Learn
– ident: ref54
  doi: 10.1109/ICCV.2017.272
– start-page: 2121
  year: 2013
  ident: ref5
  article-title: DeViSE: A deep visual-semantic embedding model
  publication-title: Proc 26th Int Conf Neural Inf Process Syst
– year: 2016
  ident: ref115
  article-title: Sequence level training with recurrent neural networks
  publication-title: Proc Int Conf Learn Representations
– ident: ref66
  doi: 10.1109/TPAMI.2019.2909864
– year: 2020
  ident: ref125
  article-title: XGPT: Cross-modal generative pre-training for image captioning
– ident: ref223
  doi: 10.18653/v1/P16-1168
– year: 2014
  ident: ref6
  article-title: Unifying visual-semantic embeddings with multimodal neural language models
  publication-title: Proc Int Conf Neural Inf Process Syst Workshops
– start-page: 1929
  year: 2019
  ident: ref218
  article-title: Variational structured semantic inference for diverse image captioning
  publication-title: Proc 33rd Int Conf Neural Inf Process Syst
– year: 2019
  ident: ref99
  article-title: ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
  publication-title: Proc 33rd Int Conf Neural Inf Process Syst
– ident: ref166
  doi: 10.18653/v1/2020.acl-main.93
– ident: ref40
  doi: 10.1109/ICCV.2017.524
– start-page: 2422
  year: 2015
  ident: ref30
  article-title: Mind's eye: A recurrent visual representation for image caption generation
  publication-title: Proc IEEE Conf Comput Vis Pattern Recognit
– ident: ref126
  doi: 10.1109/CVPR46437.2021.01101
– ident: ref25
  doi: 10.1109/CVPR.2015.7298932
– ident: ref65
  doi: 10.1609/aaai.v34i07.6898
– ident: ref215
  doi: 10.1109/ICCV.2019.00434
– ident: ref84
  doi: 10.1109/ICRA40945.2020.9196653
– ident: ref163
  doi: 10.1109/CVPR.2019.00850
– ident: ref178
  doi: 10.1109/ICCV.2019.00904
– ident: ref146
  doi: 10.1007/978-3-319-46493-0_1
– ident: ref43
  doi: 10.1109/CVPR.2017.345
– start-page: 1152
  year: 2020
  ident: ref164
  article-title: Explore and Explain: Self-supervised navigation and recounting
  publication-title: Proc 25th Int Conf Pattern Recognit
– year: 2019
  ident: ref64
  article-title: Adaptively aligned image captioning via adaptive attention time
  publication-title: Proc 33rd Int Conf Neural Inf Process Syst
– start-page: 711
  year: 2018
  ident: ref68
  article-title: Exploring visual relationship for image captioning
  publication-title: Proc Eur Conf Comput Vis
– ident: ref195
  doi: 10.1109/CVPR.2019.00640
– ident: ref108
  doi: 10.1109/CVPR.2018.00583
– ident: ref243
  doi: 10.1109/CVPR.2019.01280
– ident: ref28
  doi: 10.1109/CVPR.2015.7298878
– ident: ref193
  doi: 10.1109/CVPR.2017.214
– ident: ref76
  doi: 10.1109/ICCV.2019.00902
– year: 2019
  ident: ref77
  article-title: Image captioning: Transforming objects into words
  publication-title: Proc 33rd Int Conf Neural Inf Process Syst
– ident: ref160
  doi: 10.1007/978-3-030-58571-6_37
– start-page: 6432
  year: 2017
  ident: ref236
  article-title: Attend to you: Personalized image captioning with context sequence memory networks
  publication-title: Proc IEEE Conf Comput Vis Pattern Recognit
– ident: ref174
  doi: 10.1007/978-3-030-01225-0_13
– ident: ref31
  doi: 10.1109/CVPR.2015.7298754
– ident: ref58
  doi: 10.1109/CVPR.2018.00636
– ident: ref112
  doi: 10.1145/3394171.3413901
– year: 2021
  ident: ref94
  article-title: SimVLM: Simple visual language model pretraining with weak supervision
– ident: ref167
  doi: 10.1109/CVPR46437.2021.01383
– ident: ref239
  doi: 10.1109/CVPR.2017.108
– ident: ref39
  doi: 10.1109/CVPR.2016.90
– ident: ref186
  doi: 10.1109/ICCV.2019.00751
– ident: ref22
  doi: 10.1109/PARC49193.2020.236619
– year: 2017
  ident: ref123
  article-title: Actor-critic sequence training for image captioning
  publication-title: Proc Int Conf Neural Inf Process
– ident: ref23
  doi: 10.1109/CVPR.2015.7298935
– ident: ref238
  doi: 10.1609/aaai.v34i05.6503
– ident: ref165
  doi: 10.18653/v1/2020.eval4nlp-1.4
– ident: ref217
  doi: 10.1109/ICCV.2019.00436
– ident: ref231
  doi: 10.1109/TPAMI.2012.118
– ident: ref119
  doi: 10.1109/CVPR.2017.128
– year: 2021
  ident: ref90
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
  publication-title: Proc Int Conf Learn Representations
– ident: ref192
  doi: 10.1109/CVPR.2016.494
– ident: ref44
  doi: 10.1007/978-3-030-01228-1_18
– ident: ref211
  doi: 10.1109/ICCV48922.2021.00210
– ident: ref60
  doi: 10.1109/TPAMI.2016.2577031
– ident: ref141
  doi: 10.1145/3404835.3463257
– ident: ref254
  doi: 10.1007/978-3-030-58577-8_7
– start-page: 153
  year: 2020
  ident: ref82
  article-title: Image captioning through image transformer
  publication-title: Proc Asian Conf Comput Vis
– ident: ref204
  doi: 10.1109/CVPR46437.2021.00136
– year: 2017
  ident: ref70
  article-title: Semi-supervised classification with graph convolutional networks
  publication-title: Proc Int Conf Learn Representations
– ident: ref61
  doi: 10.1007/s11263-016-0981-7
– ident: ref213
  doi: 10.1609/aaai.v32i1.12340
– ident: ref137
  doi: 10.1007/978-3-030-58536-5_44
– start-page: 740
  year: 2014
  ident: ref128
  article-title: Microsoft COCO: Common objects in context
  publication-title: Proc Eur Conf Comput Vis
– ident: ref20
  doi: 10.1007/978-3-030-49724-8_2
– ident: ref100
  doi: 10.1007/978-3-030-58577-8_8
– ident: ref177
  doi: 10.1109/CVPR.2017.130
– ident: ref71
  doi: 10.1109/CVPR.2019.01094
– ident: ref244
  doi: 10.1109/CVPR.2019.00859
– ident: ref85
  doi: 10.1609/aaai.v35i2.16258
– ident: ref175
  doi: 10.18653/v1/2021.emnlp-main.595
– start-page: 5758
  year: 2017
  ident: ref216
  article-title: Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space
  publication-title: Proc 31th Int Conf Neural Inf Process Syst
– ident: ref200
  doi: 10.1007/978-3-030-01216-8_45
– ident: ref138
  doi: 10.1007/978-3-030-58558-7_38
– start-page: 3319
  year: 2017
  ident: ref255
  article-title: Axiomatic attribution for deep networks
  publication-title: Proc 34th Int Conf Mach Learn
– ident: ref145
  doi: 10.1109/ICCV.2017.64
– ident: ref249
  doi: 10.1007/978-3-030-58601-0_42
– start-page: 444
  year: 2011
  ident: ref10
  article-title: Corpus-guided sentence generation of natural images
  publication-title: Proc Conf Empir Methods Natural Lang Process
– ident: ref8
  doi: 10.1109/JPROC.2010.2050411
– ident: ref52
  doi: 10.1007/978-3-030-01216-8_31
– ident: ref154
  doi: 10.1109/ICCV.2017.445
– start-page: 606
  year: 2012
  ident: ref12
  article-title: Choosing linguistics over vision to describe images
  publication-title: Proc 26th AAAI Conf Artif Intell
– ident: ref133
  doi: 10.1109/CVPR.2016.13
– ident: ref240
  doi: 10.1109/CVPR.2018.00896
– ident: ref210
  doi: 10.1109/TMM.2021.3074803
– ident: ref38
  doi: 10.1109/CVPR.2017.131
– ident: ref86
  doi: 10.1609/aaai.v35i3.16328
– ident: ref232
  doi: 10.1109/CVPR42600.2020.01305
– ident: ref122
  doi: 10.1109/CVPR.2015.7299087
– start-page: 10347
  year: 2021
  ident: ref91
  article-title: Training data-efficient image transformers & distillation through attention
  publication-title: Proc 38th Int Conf Mach Learn
– ident: ref81
  doi: 10.1109/CVPR42600.2020.01059
– ident: ref79
  doi: 10.1109/ICCV.2019.00473
– ident: ref67
  doi: 10.1109/ICCV.2017.140
– ident: ref17
  doi: 10.1016/j.neucom.2018.05.080
– start-page: 1060
  year: 2016
  ident: ref148
  article-title: Generative adversarial text to image synthesis
  publication-title: Proc 33rd Int Conf Mach Learn
– ident: ref205
  doi: 10.1609/aaai.v35i4.16476
– start-page: 1987
  year: 2004
  ident: ref2
  article-title: Automatic image captioning
  publication-title: Proc IEEE Int Conf Multimedia Expo
– ident: ref201
  doi: 10.1145/3343031.3350961
– ident: ref188
  doi: 10.24963/ijcai.2020/128
– ident: ref241
  doi: 10.1109/CVPR.2019.00433
– ident: ref57
  doi: 10.1007/978-3-030-01252-6_5
– ident: ref19
  doi: 10.1613/jair.3994
– start-page: 1250
  year: 2010
  ident: ref9
  article-title: Generating image descriptions using dependency relational patterns
  publication-title: Proc Annual Meeting of the Assoc Computational Linguistics
– start-page: 1097
  year: 2012
  ident: ref26
  article-title: ImageNet classification with deep convolutional neural networks
  publication-title: Proc 25th Int Conf Neural Inf Process Syst
– ident: ref253
  doi: 10.18653/v1/2020.acl-main.583
– ident: ref157
  doi: 10.1109/TPAMI.2020.3013834
– ident: ref51
  doi: 10.1109/CVPR.2017.667
– ident: ref62
  doi: 10.1109/ICCV.2019.00898
– ident: ref37
  doi: 10.1109/CVPR.2018.00146
– year: 2015
  ident: ref29
  article-title: Very deep convolutional networks for large-scale image recognition
  publication-title: Proc Int Conf Learn Representations
– year: 2021
  ident: ref96
  article-title: ClipCap: CLIP prefix for image captioning
– ident: ref134
  doi: 10.1007/978-3-030-58601-0_1
– ident: ref234
  doi: 10.18653/v1/2021.emnlp-main.419
– start-page: 91
  year: 2015
  ident: ref59
  article-title: Faster R-CNN: Towards real-time object detection with region proposal networks
  publication-title: Proc 28th Int Conf Neural Inf Process Syst
– start-page: 211
  year: 2020
  ident: ref248
  article-title: Comprehensive image captioning via scene graph decomposition
  publication-title: Proc Eur Conf Comput Vis
– ident: ref181
  doi: 10.1145/3240508.3240640
– ident: ref202
  doi: 10.1109/CVPR46437.2021.00864
– ident: ref235
  doi: 10.1145/2998181.2998364
– ident: ref233
  doi: 10.18653/v1/2021.emnlp-main.542
– ident: ref225
  doi: 10.1145/3123266.3123366
– start-page: 4171
  year: 2018
  ident: ref102
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
  publication-title: Proc Conf North Amer Chapter Assoc Comput Linguistics
– year: 2015
  ident: ref27
  article-title: Deep captioning with multimodal recurrent neural networks (m-RNN)
  publication-title: Proc Int Conf Learn Representations
– ident: ref158
  doi: 10.18653/v1/D18-1437
– year: 2021
  ident: ref92
  article-title: CPTR: Full transformer network for image captioning
– ident: ref247
  doi: 10.1109/CVPR42600.2020.00998
– ident: ref3
  doi: 10.1007/978-3-642-15561-1_2
– start-page: 747
  year: 2012
  ident: ref13
  article-title: Midge: Generating image descriptions from computer vision detections
  publication-title: Proc 13th Conf Eur Chapter Assoc Comput Linguistics
– start-page: 915
  year: 2021
  ident: ref172
  article-title: BERTTune: Fine-tuning neural machine translation with BERTScore
  publication-title: Proc Annual Meeting of the Assoc Computational Linguistics
– ident: ref152
  doi: 10.1109/CVPR.2018.00608
– ident: ref72
  doi: 10.18653/v1/2020.acl-main.664
– ident: ref207
  doi: 10.18653/v1/D18-1436
– ident: ref32
  doi: 10.1109/ICCV.2015.277
– ident: ref120
  doi: 10.1109/ICCV.2017.100
– year: 2020
  ident: ref191
  article-title: RATT: Recurrent attention to transient tasks for continual image captioning
  publication-title: Proc 34th Int Conf Neural Inf Process Syst
– ident: ref121
  doi: 10.1007/978-3-319-46454-1_24
– ident: ref14
  doi: 10.1109/TPAMI.2012.162
– ident: ref127
  doi: 10.18653/v1/P16-1162
– ident: ref124
  doi: 10.1109/CVPR.2019.00646
– ident: ref80
  doi: 10.1109/CVPR42600.2020.01098
– year: 2014
  ident: ref45
  article-title: Neural machine translation by jointly learning to align and translate
  publication-title: Proc Int Conf Learn Representations
– ident: ref159
  doi: 10.18653/v1/D19-1156
– ident: ref41
  doi: 10.1109/CVPR.2017.127
– ident: ref224
  doi: 10.18653/v1/W16-3210
– ident: ref49
  doi: 10.1609/aaai.v32i1.12266
– ident: ref208
  doi: 10.1109/ICCV.2019.00472
– ident: ref156
  doi: 10.1109/CVPR.2019.00432
– year: 2021
  ident: ref139
  article-title: Zero-shot text-to-image generation
– ident: ref98
  doi: 10.18653/v1/D19-1514
– ident: ref151
  doi: 10.1007/978-3-030-01237-3_3
– year: 2018
  ident: ref109
  article-title: Improving language understanding by generative pre-training
– ident: ref170
  doi: 10.1109/ICCV.2017.323
– year: 2019
  ident: ref252
  article-title: Going beneath the surface: Evaluating image captioning for grammaticality, truthfulness and diversity
– ident: ref103
  doi: 10.1109/CVPR46437.2021.00553
– year: 2021
  ident: ref113
  article-title: Fast sequence generation with multi-agent reinforcement learning
– ident: ref130
  doi: 10.18653/v1/P18-1238
– ident: ref15
  doi: 10.1162/tacl_a_00188
– ident: ref250
  doi: 10.1109/CVPR42600.2020.00486
– ident: ref176
  doi: 10.1109/CVPR.2016.8
– year: 2016
  ident: ref53
  article-title: Seeing with humans: Gaze-assisted neural image captioning
– start-page: 656
  year: 2018
  ident: ref169
  article-title: A neural compositional paradigm for image captioning
  publication-title: Proc 32nd Int Conf Neural Inf Process Syst
– ident: ref46
  doi: 10.1109/CVPR.2018.00834
– ident: ref114
  doi: 10.1017/CBO9780511815829
– start-page: 1143
  year: 2011
  ident: ref4
  article-title: Im2Text: Describing images using 1 million captioned photographs
  publication-title: Proc 24th Int Conf Neural Inf Process Syst
– ident: ref242
  doi: 10.1609/aaai.v34i07.6998
– ident: ref246
  doi: 10.1109/CVPR46437.2021.01657
– ident: ref162
  doi: 10.18653/v1/E17-1019
– ident: ref251
  doi: 10.18653/v1/W16-3203
– year: 2015
  ident: ref221
  article-title: Multilingual image description with neural sequence models
  publication-title: Proc Int Conf Learn Representations
– start-page: 957
  year: 2015
  ident: ref161
  article-title: From word embeddings to document distances
  publication-title: Proc 32nd Int Conf Mach Learn
– start-page: 4904
  year: 2021
  ident: ref142
  article-title: Scaling up visual and vision-language representation learning with noisy text supervision
  publication-title: Proc 38th Int Conf Mach Learn
– ident: ref153
  doi: 10.18653/v1/2020.coling-main.210
– ident: ref75
  doi: 10.1109/ICCV.2019.00435
– ident: ref136
  doi: 10.1109/CVPR.2019.01275
– ident: ref168
  doi: 10.18653/v1/2021.acl-short.29
– ident: ref212
  doi: 10.1109/CVPR46437.2021.00275
– ident: ref135
  doi: 10.1109/TPAMI.2017.2721945
– ident: ref101
  doi: 10.1609/aaai.v34i07.7005
– ident: ref33
  doi: 10.1109/CVPR.2016.503
– ident: ref21
  doi: 10.1007/s00371-018-1566-y
– ident: ref199
  doi: 10.24963/ijcai.2018/592
SSID ssj0014503
Score 2.7293348
SecondaryResourceType review_article
Snippet Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image...
SourceID proquest
pubmed
crossref
ieee
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 539
SubjectTerms Additives
Algorithms
Benchmarking
Coders
Computer vision
Convolutional neural networks
Deep Learning
Feature extraction
Image captioning
Image coding
Language
Natural Language Processing
Sentences
survey
Task analysis
Training
vision-and-language
Visualization
Title From Show to Tell: A Survey on Deep Learning-Based Image Captioning
URI https://ieeexplore.ieee.org/document/9706348
https://www.ncbi.nlm.nih.gov/pubmed/35130142
https://www.proquest.com/docview/2747610425
https://www.proquest.com/docview/2626891261
Volume 45
WOSCitedRecordID wos000899419900033&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared)
  customDbUrl:
  eissn: 2160-9292
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014503
  issn: 0162-8828
  databaseCode: RIE
  dateStart: 19790101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3fa9RAEB7aIlIfrLa2RmtZwTeNvWSzya5v5-lhQUuhp9xbSHZntXBNjutdxf_e2c0mKKjgQyCQyQ92Znbmm83OB_BC146QpqjjKkNfrRJxzS2BlYKis7EjLXwp-8vH4vxczufqYgteDXthENH_fIav3alfyzet3rhS2akqKKBmchu2iyLv9moNKwaZ8CzIlMGQhxOM6DfIjNTp7GL86YygYJoSQs0kgZxduMtF4tBE-ls88gQrf881fcyZ7v3f1z6A-yG3ZOPOGB7CFjb7sNfzNrDgxvtw75cmhAcwma7aa3b5rf3O1i2b4WLxho3Z5WZ1iz9Y27B3iEsW2rB-jd9S1DPs7JqmITaplqGc-wg-T9_PJh_iQK0Q6yxP1nEqtK3zSkuTpxaN1Am3lnOd2sLynA6RW1VxPtIqN1hRFkAK1ZWRua4pngl-CDtN2-BjYJmyaIUmiG1kpuu60kKJDG3BBXJjbARJP8ClDn3HHf3FovT4Y6RKr5_S6acM-ong5XDPsuu68U_pAzf6g2QY-AiOez2WwTFvSgfCKWOkmSqC58Nlcim3TlI12G5IhkCeVAlhywiOOv0Pz-7N5smf3_kUdh0ffVejOYad9WqDz-COvl1f3axOyG7n8sTb7U98p-TI
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3db9MwED-Ngdh4YLCNLTDASLxBWBLbScxbKVSr6KpJK2hvUeKPDalLqq4d4r_n7DgRSIDEQ6RIcT50H777neP7AbyWlSWkyaqwZNpVq3hYUYNgJcPorEwkuStlf51k02l-cSHONuBtvxdGa-1-PtPv7Klby1eNXNtS2bHIMKCy_A7c5YwlUbtbq18zYNzxIGMOgz6OQKLbIhOJ49nZ4HSMYDBJEKOyHGHONtynPLZ4IvktIjmKlb9nmy7qjHb-73sfwUOfXZJBaw6PYUPXu7DTMTcQ78i78OCXNoR7MBwtm2tyftV8J6uGzPR8_p4MyPl6eat_kKYmH7VeEN-I9TL8gHFPkfE1TkRkWC58QXcfvow-zYYnoSdXCCVL41WYcGmqtJS5ShOjVS5jagylMjGZoSkePDWipDSSIlW6xDwAVSpLlaeywojG6RPYrJtaHwJhwmjDJYJslTNZVaXkgjNtMso1VcoEEHcCLqTvPG4JMOaFQyCRKJx-CqufwusngDf9PYu278Y_R-9Z6fcjveADOOr0WHjXvCksDMecEeeqAF71l9Gp7EpJWetmjWMQ5uUiRnQZwEGr__7Zndk8_fM7X8LWyex0UkzG08_PYNuy07cVmyPYXC3X-jnck7erbzfLF856fwInn-cn
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=From+Show+to+Tell%3A+A+Survey+on+Deep+Learning-Based+Image+Captioning&rft.jtitle=IEEE+transactions+on+pattern+analysis+and+machine+intelligence&rft.au=Stefanini%2C+Matteo&rft.au=Cornia%2C+Marcella&rft.au=Baraldi%2C+Lorenzo&rft.au=Cascianelli%2C+Silvia&rft.date=2023-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=0162-8828&rft.eissn=1939-3539&rft.volume=45&rft.issue=1&rft.spage=539&rft_id=info:doi/10.1109%2FTPAMI.2022.3148210&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0162-8828&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0162-8828&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0162-8828&client=summon