A survey on automatic image caption generation

Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both rese...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Neurocomputing (Amsterdam) Ročník 311; s. 291 - 304
Hlavní autoři: Bai, Shuang, An, Shan
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 15.10.2018
Témata:
ISSN:0925-2312, 1872-8286
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented.
AbstractList Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented.
Author Bai, Shuang
An, Shan
Author_xml – sequence: 1
  givenname: Shuang
  surname: Bai
  fullname: Bai, Shuang
  email: shuangb@bjtu.edu.cn
  organization: School of Electronic and Information Engineering, Beijing Jiaotong University, No.3 Shang Yuan Cun, Hai Dian District, Beijing, China
– sequence: 2
  givenname: Shan
  surname: An
  fullname: An, Shan
  organization: Beijing Jingdong Shangke Information Technology Co., Ltd, Beijing, China
BookMark eNqFj81Kw0AUhQdRsK2-gYu8QOK9M5lk4kIoxT8ouNH1cDM_ZUqblJm00Lc3ta5c6OoeLnyH803ZZdd3jrE7hAIBq_t10bm96bcFB1QFyAIUXLAJqprniqvqkk2g4TLnAvk1m6a0BsAaeTNhxTxL-3hwx6zvMtoP_ZaGYLKwpZXLDO2GMP5XrnORTvGGXXnaJHf7c2fs8_npY_GaL99f3hbzZW5EzYfclla0jXKtLyWU6CtBorJtA2TQNFYYSR49Ko4kyda1BAkCWqoEJ0GlFzNWnntN7FOKzutdHDfFo0bQJ2e91mdnfXLWIPXoPGIPvzAThu_hQ6Sw-Q9-PMNuFDsEF3UywXXG2RCdGbTtw98FX67-dzE
CitedBy_id crossref_primary_10_1007_s10462_021_10092_2
crossref_primary_10_3233_JIFS_222358
crossref_primary_10_1007_s11042_019_08571_4
crossref_primary_10_47164_ijngc_v13i4_769
crossref_primary_10_1007_s00034_022_02050_2
crossref_primary_10_1016_j_eswa_2022_117174
crossref_primary_10_1016_j_cosrev_2025_100766
crossref_primary_10_1007_s11042_023_16560_x
crossref_primary_10_1016_j_compbiomed_2024_108709
crossref_primary_10_1007_s00521_025_11341_z
crossref_primary_10_1016_j_compeleceng_2025_110077
crossref_primary_10_1007_s00530_023_01178_8
crossref_primary_10_1007_s13735_024_00328_6
crossref_primary_10_1109_ACCESS_2024_3402360
crossref_primary_10_3390_app13137981
crossref_primary_10_3390_app9102024
crossref_primary_10_1016_j_compeleceng_2020_106630
crossref_primary_10_1007_s11042_020_10165_4
crossref_primary_10_2478_jaiscr_2023_0005
crossref_primary_10_3390_app12010209
crossref_primary_10_1007_s42979_023_01671_x
crossref_primary_10_1016_j_neucom_2020_10_042
crossref_primary_10_1145_3708886
crossref_primary_10_1109_TPAMI_2022_3148210
crossref_primary_10_1016_j_patcog_2020_107413
crossref_primary_10_3390_jimaging7080125
crossref_primary_10_1016_j_ipm_2020_102261
crossref_primary_10_1007_s11633_022_1369_5
crossref_primary_10_1155_2018_5847460
crossref_primary_10_1049_ipr2_13287
crossref_primary_10_1109_TVCG_2025_3542504
crossref_primary_10_1145_3654795
crossref_primary_10_3390_rs12060939
crossref_primary_10_1016_j_neucom_2018_12_026
crossref_primary_10_1007_s00521_022_08072_w
crossref_primary_10_1007_s11263_024_02144_1
crossref_primary_10_1007_s00521_025_11199_1
crossref_primary_10_1007_s00371_020_01867_9
crossref_primary_10_1049_ipr2_12367
crossref_primary_10_3389_fnint_2020_00010
crossref_primary_10_1016_j_neucom_2019_05_027
crossref_primary_10_1016_j_ssci_2023_106122
crossref_primary_10_1109_JBHI_2023_3236661
crossref_primary_10_1007_s42979_022_01322_7
crossref_primary_10_3390_electronics13163306
crossref_primary_10_1016_j_eswa_2023_120698
crossref_primary_10_1371_journal_pone_0320701
crossref_primary_10_1155_2022_2756396
crossref_primary_10_1109_ACCESS_2020_3047091
crossref_primary_10_1109_ACCESS_2021_3128140
crossref_primary_10_1109_TCSVT_2021_3056684
crossref_primary_10_1080_23311916_2022_2104333
crossref_primary_10_1109_TETCI_2019_2892755
crossref_primary_10_1016_j_media_2024_103264
crossref_primary_10_1145_3623386
crossref_primary_10_1016_j_neucom_2023_126287
crossref_primary_10_1007_s11042_023_15555_y
crossref_primary_10_1007_s12559_019_09697_1
crossref_primary_10_32604_cmes_2025_059192
crossref_primary_10_3390_math8091606
crossref_primary_10_1109_ACCESS_2023_3249462
crossref_primary_10_3389_fpls_2023_1273029
crossref_primary_10_1109_ACCESS_2021_3058248
crossref_primary_10_1007_s00371_023_03180_7
crossref_primary_10_1109_ACCESS_2020_3013321
crossref_primary_10_1007_s11042_022_13443_5
crossref_primary_10_1109_ACCESS_2020_3021508
crossref_primary_10_1007_s10489_021_02293_7
crossref_primary_10_24054_rcta_v1i45_3751
crossref_primary_10_1016_j_patcog_2021_108485
crossref_primary_10_5604_01_3001_0053_9697
crossref_primary_10_1007_s13198_024_02495_5
crossref_primary_10_1007_s10462_023_10488_2
crossref_primary_10_1016_j_inffus_2022_11_011
crossref_primary_10_1109_TCSS_2022_3223539
crossref_primary_10_1051_shsconf_202213903014
crossref_primary_10_1007_s10994_020_05919_y
crossref_primary_10_1016_j_eswa_2025_128555
crossref_primary_10_1155_2023_9397325
crossref_primary_10_3390_s22218376
crossref_primary_10_1016_j_apenergy_2020_115098
crossref_primary_10_1080_00207543_2020_1859636
crossref_primary_10_1109_ACCESS_2020_2999568
crossref_primary_10_1007_s00521_021_06557_8
crossref_primary_10_3389_frai_2024_1430984
crossref_primary_10_1016_j_engappai_2023_106545
crossref_primary_10_1145_3717612
crossref_primary_10_1016_j_eswa_2023_119773
crossref_primary_10_1016_j_eswa_2023_119774
crossref_primary_10_1109_TGRS_2023_3312479
crossref_primary_10_1002_eng2_12785
crossref_primary_10_1007_s10115_022_01684_7
crossref_primary_10_1007_s11042_023_17904_3
crossref_primary_10_1016_j_patrec_2021_06_011
crossref_primary_10_1109_ACCESS_2021_3129782
crossref_primary_10_1109_ACCESS_2019_2939201
crossref_primary_10_1109_TASLP_2021_3120644
crossref_primary_10_3389_fnins_2023_1270850
crossref_primary_10_1007_s11042_022_12042_8
crossref_primary_10_1007_s11831_021_09542_5
crossref_primary_10_1109_ACCESS_2022_3186471
Cites_doi 10.1109/TPAMI.2012.231
10.1167/7.1.10
10.1109/TPAMI.2013.168
10.1145/2647868.2654889
10.1023/A:1020346032608
10.3115/1557769.1557781
10.1023/A:1011139631724
10.1162/neco.1997.9.8.1735
10.3115/1073083.1073135
10.1109/72.279181
10.1109/ICCV.2015.279
10.1007/s11263-013-0620-5
10.1109/TPAMI.2013.50
10.1109/TPAMI.2012.118
10.1613/jair.2433
10.1109/78.650093
10.3115/v1/P14-1062
10.1162/tacl_a_00177
10.1162/0899766042321814
10.3115/v1/D14-1179
10.1613/jair.3994
10.1162/089892904322984526
10.1016/j.asoc.2015.08.043
10.1207/s15516709cog1402_1
10.1109/TPAMI.2009.167
10.1109/TNNLS.2014.2307532
10.1016/j.asoc.2015.08.025
10.1080/135062800394667
10.1016/j.asoc.2015.07.040
10.1073/pnas.1422953112
10.1109/ICCV.2013.337
10.1109/TIP.2016.2628585
10.1016/j.eswa.2016.10.038
10.1109/TPAMI.2016.2587640
10.1109/TPAMI.2012.162
10.1109/TMM.2015.2477044
10.1038/nature14539
10.1162/tacl_a_00188
10.1016/j.asoc.2016.08.056
10.1109/TNNLS.2016.2582924
ContentType Journal Article
Copyright 2018
Copyright_xml – notice: 2018
DBID AAYXX
CITATION
DOI 10.1016/j.neucom.2018.05.080
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1872-8286
EndPage 304
ExternalDocumentID 10_1016_j_neucom_2018_05_080
S0925231218306659
GroupedDBID ---
--K
--M
.DC
.~1
0R~
123
1B1
1~.
1~5
4.4
457
4G.
53G
5VS
7-5
71M
8P~
9JM
9JN
AABNK
AACTN
AADPK
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAXLA
AAXUO
AAYFN
ABBOA
ABCQJ
ABFNM
ABJNI
ABMAC
ABYKQ
ACDAQ
ACGFS
ACRLP
ACZNC
ADBBV
ADEZE
AEBSH
AEKER
AENEX
AFKWA
AFTJW
AFXIZ
AGHFR
AGUBO
AGWIK
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
AXJTR
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
IHE
J1W
KOM
LG9
M41
MO0
MOBAO
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
RIG
ROL
RPZ
SDF
SDG
SDP
SES
SPC
SPCBC
SSN
SSV
SSZ
T5K
ZMT
~G-
29N
9DU
AAQXK
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ABXDB
ACLOT
ACNNM
ACRPL
ACVFH
ADCNI
ADJOM
ADMUD
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
ASPBG
AVWKF
AZFZN
CITATION
EFKBS
FEDTE
FGOYB
HLZ
HVGLF
HZ~
R2-
SBC
SEW
WUQ
XPP
~HD
ID FETCH-LOGICAL-c372t-d4d3b98ebf45041f63a36db90ac1c9d3c5af1f1821a5ad77505030ba632a3a4f3
ISICitedReferencesCount 132
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000438313100027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0925-2312
IngestDate Sat Nov 29 07:12:11 EST 2025
Tue Nov 18 22:26:21 EST 2025
Fri Feb 23 02:47:30 EST 2024
IsPeerReviewed true
IsScholarly true
Keywords Attention mechanism
Deep neural networks
Image captioning
Encoder–decoder framework
Multimodal embedding
Sentence template
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c372t-d4d3b98ebf45041f63a36db90ac1c9d3c5af1f1821a5ad77505030ba632a3a4f3
PageCount 14
ParticipantIDs crossref_primary_10_1016_j_neucom_2018_05_080
crossref_citationtrail_10_1016_j_neucom_2018_05_080
elsevier_sciencedirect_doi_10_1016_j_neucom_2018_05_080
PublicationCentury 2000
PublicationDate 2018-10-15
PublicationDateYYYYMMDD 2018-10-15
PublicationDate_xml – month: 10
  year: 2018
  text: 2018-10-15
  day: 15
PublicationDecade 2010
PublicationTitle Neurocomputing (Amsterdam)
PublicationYear 2018
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, arXiv
Schuster, Paliwal (bib0102) 1997; 45
Hardoon, Szedmak, Shawe-Taylor (bib0080) 2004; 16
Mason, Charniak (bib0049) 2014
Bach, Jordan (bib0079) 2002; 3
Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan (bib0034) 2015
Chen, Zitnick (bib0062) 2015
Krizhevsky, Sutskever, Hinton (bib0008) 2012
Socher, Karpathy, Le, Manning, Ng (bib0055) 2014; 2
K. Greff, R.K. Srivastava, J. KoutnÃk, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, arXiv
Ushiku, Yamaguchi, Mukuta, Harada (bib0054) 2015
N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv
Oruganti, Sah, Pillai, Ptucha (bib0074) 2016
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: Visual question answering., arXiv
Felzenszwalb, Girshick, McAllester, Ramanan (bib0002) 2010; 32
Mikolov, Karafiat, Burget, Cernocky, Khudanpur (bib0104) 2010
Zhang, Ding, Zhang, Xue (bib0021) 2017; 52
Ijjina, Mohan (bib0024) 2016; 46
Mao, Wei, Yang, Wang (bib0076) 2015
(2013).
Vedantam, Zitnick, Parikh (bib0126) 2015
Lavie, Agarwal (bib0125) 2007
Frome, Corrado, Shlens, Bengio, Dean, Mikolov (bib0091) 2013
Ma, Han (bib0073) 2016
Cho, Courville, Bengio (bib0129) 2015; 17
(2017).
Lin, Och (bib0124) 2004
Blei, Ng, Jordan (bib0122) 2003; 3
Hede, Moellic, Bourgeoys, Joint, Thomas (bib0012) 2004
Mitchell, Dodge, Goyal, Yamaguchi, Stratos, Han, Mensch, Berg, Berg, Daume (bib0053) 2012
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu., Are you talking to a machine? Dataset and methods for multilingual image question answering, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2296–2304.
Yan, Mikolajczyk (bib0057) 2015
Le, Ranzato, Monga, Devin, Chen, Corrado, Dean, Ng (bib0092) 2012
You, Jin, Wang, Fang, Luo (bib0069) 2016
C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv
D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv
D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304.
Oliva, Torralba (bib0085) 2001; 42
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, C. Zitnick, Microsoft COCO captions: data collection and evaluation server, arXiv
Lebret, Pinheiro, Collobert (bib0058) 2015
R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv
Kojima, Tamura, Fukunaga (bib0011) 2002; 50
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv
Sutskever, Vinyals, Le (bib0106) 2014
Ratnaparkhi (bib0119) 2000
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv
Hochreiter, Schmidhuber (bib0107) 1997; 9
Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, CNN: single-label to multi-label, arXiv
Jia, Gavves, Fernando, Tuytelaars (bib0065) 2015
Kuznetsova, Ordonez, Berg, Choi (bib0050) 2014; 2
Clarke, Lapata (bib0082) 2008; 31
Venugopalan, Hendricks, Mooney, Saenko (bib0048) 2016
Mikolov, Sutskever, Chen, Corrado, Dean (bib0031) 2013
M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 1682–1690.
(2016).
Ma, Lu, Lifeng, Li (bib0056) 2015
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell. 39(4).
Farhadi, Hejrati, Sadeghi, Young, Rashtchian, Hockenmaier, Forsyth (bib0013) 2010
Zhang, Platt, Viola (bib0117) 2005
Dunning (bib0086) 1993; 19
Bourdev, Malik, Maji (bib0006) 2011
(2014) 1–14.
Mnih, Hinton (bib0030) 2007
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, K. Saenko, YouTube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in: Proceedings of the International Conference on Computer Vision, pp. 2712–2719.
Karpathy, Li (bib0061) 2015
Berger, Pietra, Pietra (bib0118) 1996; 22
Gan, Yang, Gong (bib0005) 2016
Mnih, Kavukcuoglu (bib0099) 2013
C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8).
Pu, Gan, Henao, Yuan, Li, Stevens, Carin (bib0067) 2016
Young, Lai, Hodosh, Hockenmaier (bib0127) 2014
Roth, tau Yih (bib0081) 2004
Kulkarni, Premraj, Dhar, Li, Choi, Berg, Berg. (bib0087) 2011
A. Tariq, H. Foroosh, A context-driven extractive framework for generating realistic image descriptions, IEEE Trans. Image Process. 26(2).
Farhadi, Sadeghi (bib0089) 2013; 35
Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5(5).
Wang, Jiang, Chung, Qian (bib0025) 2015; 37
K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using RNN encoder–decoder for statistical machine translation, arXiv
Mnih, Hees, Graves, Kavukcuoglu (bib0115) 2014
Fang, Gupta, Iandola, Srivastava (bib0033) 2015
Wang, Song, Yang, Luo (bib0075) 2016
Karpathy, Joulin, Li (bib0037) 2014; 3
Mao, Xu, Yang, Wang, Huang, Yuille (bib0035) 2015
He, Zhang, Ren, Sun (bib0120) 2016
Lampert, Nickisch, Harmeling (bib0004) 2009
Andrew, Arora, Bilmes, Livescu. (bib0098) 2013
(2015).
Bai (bib0026) 2017; 71
Ushiku, Harada, Kuniyoshi (bib0084) 2012
LeCun, Bengio, Hinton (bib0090) 2015; 521
Fu, Jin, Cui, Sha, Zhang (bib0072) 2016
Y. Feng, M. Lapata, Automatic caption generation for news images, IEEE Trans. Pattern Anal. Mach. Intell. 35(4).
Yang, Yuan, Wu, Salakhutdinov, Cohen (bib0070) 2016
Malinowski, Rohrbach, Fritz (bib0040) 2015
Kuznetsova, Ordonez, Berg, Berg, Choi (bib0083) 2012
Li, Kulkarni, Berg, Berg, Choi (bib0052) 2011
Koehn (bib0088) 2005
Thomason, Venugopalan, Guadarrama, Saenko, Mooney (bib0046) 2014
Spratling, Johnson (bib0112) 2004; 16
Vinyals, Toshev, Bengio, Erhan (bib0064) 2015
Bengio, Courville, Vincent (bib0018) 2013; 35
J. Curran, S. Clark, J. Bos, Linguistically motivated large-scale NLP with CC and boxer, in: Proceedings of the Forty Fifth Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 33–36.
Elliott, Keller (bib0116) 2013
Kiros, Zemel, Salakhutdinov (bib0059) 2014
Elman (bib0101) 1990; 14
Marneffe, Maccartney, Manning (bib0093) 2006
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficien testimation of word representations in vector space, arXiv
L A Hendricks, Venugopalan (bib0036) 2016
Tran, He, Zhang, Sun (bib0071) 2016
Goh, Thome, Cord, Lim (bib0017) 2014; 25
A.Rensink (bib0111) 2000; 7
Uijlings, van de Sande, Gevers, Smeulders (bib0121) 2013; 104
Fei-Fei, Iyer, Koch, Perona. (bib0001) 2007; 7
Kalchbrenner, Blunsom (bib0105) 2013
Zhou, Lapedriza, Xiao, Torralba, Oliva (bib0009) 2014
Collobert, Weston (bib0029) 2008
Ba, Mnih, Kavukcuoglu (bib0114) 2015
Johnson, Karpathy, Fei-Fei (bib0130) 2016
Wu, Shen, Liu, Dick, van den Hengel (bib0066) 2016
Gong, Wang, Guo, Lazebnik (bib0010) 2014
K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the Meeting on Association for Computational Linguistics, vol.  4 (2002).
Yang, Teo, Daume, Aloimono (bib0014) 2011
D. Geman, S. Geman, N. Hallonquist, L. Younes, Visual turing test for computer vision systems, in: Proceedings of the National Academy of Sciences of the United States of America, vol. 112, pp. 3618–3623.
Hu, Lu, Li, Chen (bib0096) 2014
Kulkarni, Premraj, Ordonez, Dhar, Li, Choi, Berg, Berg (bib0051) 2013; 35
(2014).
Gupta, Verma, Jawahar. (bib0016) 2012; 5
Chao, Wang, Mihalcea, Deng (bib0007) 2015
Hodosh, Young, Hockenmaier (bib0032) 2013; 47
Ordonez, Kulkarni, Berg. (bib0015) 2011
(2018).
Papa, Scheirer, Cox (bib0022) 2016; 46
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, arXiv
Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, Darrell (bib0019) 2014
Venugopalan, Rohrbach, Donahue, Mooney, Darrell, Saenko (bib0047) 2015
Girshick, Donahue, Darrell, Malik (bib0003) 2014
Fang (10.1016/j.neucom.2018.05.080_bib0033) 2015
Bach (10.1016/j.neucom.2018.05.080_bib0079) 2002; 3
10.1016/j.neucom.2018.05.080_bib0077
10.1016/j.neucom.2018.05.080_bib0110
Hodosh (10.1016/j.neucom.2018.05.080_bib0032) 2013; 47
Wang (10.1016/j.neucom.2018.05.080_bib0075) 2016
10.1016/j.neucom.2018.05.080_bib0078
Girshick (10.1016/j.neucom.2018.05.080_bib0003) 2014
Collobert (10.1016/j.neucom.2018.05.080_bib0029) 2008
10.1016/j.neucom.2018.05.080_bib0113
Bourdev (10.1016/j.neucom.2018.05.080_bib0006) 2011
10.1016/j.neucom.2018.05.080_bib0109
10.1016/j.neucom.2018.05.080_bib0108
Blei (10.1016/j.neucom.2018.05.080_bib0122) 2003; 3
Zhang (10.1016/j.neucom.2018.05.080_bib0021) 2017; 52
Lampert (10.1016/j.neucom.2018.05.080_bib0004) 2009
Kuznetsova (10.1016/j.neucom.2018.05.080_bib0050) 2014; 2
Lebret (10.1016/j.neucom.2018.05.080_bib0058) 2015
Pu (10.1016/j.neucom.2018.05.080_bib0067) 2016
Goh (10.1016/j.neucom.2018.05.080_bib0017) 2014; 25
Dunning (10.1016/j.neucom.2018.05.080_bib0086) 1993; 19
Sutskever (10.1016/j.neucom.2018.05.080_bib0106) 2014
Ba (10.1016/j.neucom.2018.05.080_bib0114) 2015
10.1016/j.neucom.2018.05.080_bib0123
Johnson (10.1016/j.neucom.2018.05.080_bib0130) 2016
Gong (10.1016/j.neucom.2018.05.080_bib0010) 2014
Thomason (10.1016/j.neucom.2018.05.080_bib0046) 2014
Ratnaparkhi (10.1016/j.neucom.2018.05.080_bib0119) 2000
Gan (10.1016/j.neucom.2018.05.080_bib0005) 2016
Mao (10.1016/j.neucom.2018.05.080_bib0076) 2015
Andrew (10.1016/j.neucom.2018.05.080_bib0098) 2013
Mnih (10.1016/j.neucom.2018.05.080_bib0115) 2014
Fu (10.1016/j.neucom.2018.05.080_bib0072) 2016
Hardoon (10.1016/j.neucom.2018.05.080_bib0080) 2004; 16
Yan (10.1016/j.neucom.2018.05.080_bib0057) 2015
Karpathy (10.1016/j.neucom.2018.05.080_bib0061) 2015
Spratling (10.1016/j.neucom.2018.05.080_bib0112) 2004; 16
Elliott (10.1016/j.neucom.2018.05.080_bib0116) 2013
Le (10.1016/j.neucom.2018.05.080_bib0092) 2012
Ushiku (10.1016/j.neucom.2018.05.080_bib0054) 2015
Koehn (10.1016/j.neucom.2018.05.080_bib0088) 2005
Zhang (10.1016/j.neucom.2018.05.080_bib0117) 2005
Kulkarni (10.1016/j.neucom.2018.05.080_bib0087) 2011
10.1016/j.neucom.2018.05.080_bib0060
Clarke (10.1016/j.neucom.2018.05.080_bib0082) 2008; 31
Ushiku (10.1016/j.neucom.2018.05.080_bib0084) 2012
10.1016/j.neucom.2018.05.080_bib0063
10.1016/j.neucom.2018.05.080_bib0068
A.Rensink (10.1016/j.neucom.2018.05.080_bib0111) 2000; 7
10.1016/j.neucom.2018.05.080_bib0100
10.1016/j.neucom.2018.05.080_bib0103
Papa (10.1016/j.neucom.2018.05.080_bib0022) 2016; 46
Felzenszwalb (10.1016/j.neucom.2018.05.080_bib0002) 2010; 32
Hochreiter (10.1016/j.neucom.2018.05.080_bib0107) 1997; 9
Vedantam (10.1016/j.neucom.2018.05.080_bib0126) 2015
Hu (10.1016/j.neucom.2018.05.080_bib0096) 2014
Hede (10.1016/j.neucom.2018.05.080_bib0012) 2004
10.1016/j.neucom.2018.05.080_bib0039
10.1016/j.neucom.2018.05.080_bib0038
Mikolov (10.1016/j.neucom.2018.05.080_bib0104) 2010
Berger (10.1016/j.neucom.2018.05.080_bib0118) 1996; 22
Lin (10.1016/j.neucom.2018.05.080_bib0124) 2004
Donahue (10.1016/j.neucom.2018.05.080_bib0019) 2014
Kulkarni (10.1016/j.neucom.2018.05.080_bib0051) 2013; 35
Mnih (10.1016/j.neucom.2018.05.080_bib0030) 2007
Oliva (10.1016/j.neucom.2018.05.080_bib0085) 2001; 42
Farhadi (10.1016/j.neucom.2018.05.080_bib0089) 2013; 35
Oruganti (10.1016/j.neucom.2018.05.080_bib0074) 2016
Kuznetsova (10.1016/j.neucom.2018.05.080_bib0083) 2012
10.1016/j.neucom.2018.05.080_bib0042
10.1016/j.neucom.2018.05.080_bib0041
10.1016/j.neucom.2018.05.080_bib0044
Venugopalan (10.1016/j.neucom.2018.05.080_bib0048) 2016
Zhou (10.1016/j.neucom.2018.05.080_bib0009) 2014
10.1016/j.neucom.2018.05.080_bib0043
10.1016/j.neucom.2018.05.080_bib0045
Kalchbrenner (10.1016/j.neucom.2018.05.080_bib0105) 2013
Chen (10.1016/j.neucom.2018.05.080_bib0062) 2015
LeCun (10.1016/j.neucom.2018.05.080_bib0090) 2015; 521
Mason (10.1016/j.neucom.2018.05.080_bib0049) 2014
Chao (10.1016/j.neucom.2018.05.080_bib0007) 2015
Uijlings (10.1016/j.neucom.2018.05.080_bib0121) 2013; 104
Lavie (10.1016/j.neucom.2018.05.080_bib0125) 2007
You (10.1016/j.neucom.2018.05.080_bib0069) 2016
Mitchell (10.1016/j.neucom.2018.05.080_bib0053) 2012
Elman (10.1016/j.neucom.2018.05.080_bib0101) 1990; 14
Karpathy (10.1016/j.neucom.2018.05.080_bib0037) 2014; 3
Roth (10.1016/j.neucom.2018.05.080_bib0081) 2004
10.1016/j.neucom.2018.05.080_bib0095
10.1016/j.neucom.2018.05.080_bib0094
10.1016/j.neucom.2018.05.080_bib0097
Donahue (10.1016/j.neucom.2018.05.080_bib0034) 2015
Farhadi (10.1016/j.neucom.2018.05.080_bib0013) 2010
Ma (10.1016/j.neucom.2018.05.080_bib0056) 2015
Li (10.1016/j.neucom.2018.05.080_bib0052) 2011
Tran (10.1016/j.neucom.2018.05.080_bib0071) 2016
Schuster (10.1016/j.neucom.2018.05.080_bib0102) 1997; 45
Kiros (10.1016/j.neucom.2018.05.080_bib0059) 2014
Krizhevsky (10.1016/j.neucom.2018.05.080_bib0008) 2012
10.1016/j.neucom.2018.05.080_bib0128
Fei-Fei (10.1016/j.neucom.2018.05.080_bib0001) 2007; 7
Ijjina (10.1016/j.neucom.2018.05.080_bib0024) 2016; 46
Cho (10.1016/j.neucom.2018.05.080_bib0129) 2015; 17
Wu (10.1016/j.neucom.2018.05.080_bib0066) 2016
Mnih (10.1016/j.neucom.2018.05.080_bib0099) 2013
Ma (10.1016/j.neucom.2018.05.080_bib0073) 2016
Mikolov (10.1016/j.neucom.2018.05.080_bib0031) 2013
L A Hendricks (10.1016/j.neucom.2018.05.080_bib0036) 2016
Vinyals (10.1016/j.neucom.2018.05.080_bib0064) 2015
Kojima (10.1016/j.neucom.2018.05.080_bib0011) 2002; 50
10.1016/j.neucom.2018.05.080_bib0020
Yang (10.1016/j.neucom.2018.05.080_bib0014) 2011
10.1016/j.neucom.2018.05.080_bib0023
Mao (10.1016/j.neucom.2018.05.080_bib0035) 2015
Marneffe (10.1016/j.neucom.2018.05.080_bib0093) 2006
Bai (10.1016/j.neucom.2018.05.080_bib0026) 2017; 71
Wang (10.1016/j.neucom.2018.05.080_bib0025) 2015; 37
10.1016/j.neucom.2018.05.080_bib0028
10.1016/j.neucom.2018.05.080_bib0027
Bengio (10.1016/j.neucom.2018.05.080_bib0018) 2013; 35
Socher (10.1016/j.neucom.2018.05.080_bib0055) 2014; 2
Young (10.1016/j.neucom.2018.05.080_bib0127) 2014
Gupta (10.1016/j.neucom.2018.05.080_bib0016) 2012; 5
Malinowski (10.1016/j.neucom.2018.05.080_bib0040) 2015
Jia (10.1016/j.neucom.2018.05.080_bib0065) 2015
Ordonez (10.1016/j.neucom.2018.05.080_bib0015) 2011
Frome (10.1016/j.neucom.2018.05.080_bib0091) 2013
He (10.1016/j.neucom.2018.05.080_bib0120) 2016
Venugopalan (10.1016/j.neucom.2018.05.080_bib0047) 2015
Yang (10.1016/j.neucom.2018.05.080_bib0070) 2016
References_xml – start-page: 1292
  year: 2013
  end-page: 1302
  ident: bib0116
  article-title: Image description using visual dependency representations
  publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing
– start-page: 2625
  year: 2015
  end-page: 2634
  ident: bib0034
  article-title: Long-term recurrent convolutional networks for visual recognition and description
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 1
  year: 2016
  end-page: 6
  ident: bib0073
  article-title: Describing images by feeding LSTM with structural words
  publication-title: Proceedings of the IEEE International Conference on Multimedia and Expo
– year: 2012
  ident: bib0083
  article-title: Collective generation of natural image descriptions
  publication-title: Proceedings of the Meeting of the Association for Computational Linguistics
– reference: Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, CNN: single-label to multi-label, arXiv:
– volume: 3
  start-page: 993
  year: 2003
  end-page: 1022
  ident: bib0122
  article-title: Latent Dirichlet allocation
  publication-title: J. Mach. Learn. Res.
– reference: Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv:
– volume: 7
  start-page: 1
  year: 2007
  end-page: 29
  ident: bib0001
  article-title: What do we perceive in a glance of a real-world scene?
  publication-title: J. Vis.
– start-page: 2265
  year: 2013
  end-page: 2273
  ident: bib0099
  article-title: Learning word embeddings efficiently with noise-contrastive estimation
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– reference: K. Greff, R.K. Srivastava, J. KoutnÃk, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, arXiv:
– start-page: 1247
  year: 2013
  end-page: 1255
  ident: bib0098
  article-title: Deep canonical correlation analysis
  publication-title: Proceedings of the International Conference on Machine Learning
– reference: A. Tariq, H. Foroosh, A context-driven extractive framework for generating realistic image descriptions, IEEE Trans. Image Process. 26(2).
– volume: 46
  start-page: 875
  year: 2016
  end-page: 885
  ident: bib0022
  article-title: Fine-tuning deep belief networks using harmony search
  publication-title: Appl. Soft Comput.
– start-page: 641
  year: 2007
  end-page: 648
  ident: bib0030
  article-title: Three new graphical models for statistical language modelling
  publication-title: Proceedings of the Twenty Fourth International Conference on Machine Learning
– year: 2015
  ident: bib0035
  article-title: Deep captioning with multimodal recurrent neural networks
  publication-title: Proceedings of the International Conference on Learning Representation
– reference: S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: Visual question answering., arXiv:
– start-page: 2042
  year: 2014
  end-page: 2050
  ident: bib0096
  article-title: Convolutional neural network architectures for matching natural language sentences
  publication-title: Proceedings of the Twenty Seventh International Conference on Neural Information Processing Systems
– reference: (2014) 1–14.
– year: 2013
  ident: bib0031
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– volume: 47
  start-page: 853
  year: 2013
  end-page: 899
  ident: bib0032
  article-title: Framing image description as a ranking task: data, models and evaluation metrics
  publication-title: J. Artif. Intell. Res.
– volume: 31
  start-page: 339
  year: 2008
  end-page: 429
  ident: bib0082
  article-title: Global inference for sentence compression an integer linear programming approach
  publication-title: J. Artif. Intell. Res.
– reference: D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv:
– reference: K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:
– reference: Y. Feng, M. Lapata, Automatic caption generation for news images, IEEE Trans. Pattern Anal. Mach. Intell. 35(4).
– volume: 16
  start-page: 219
  year: 2004
  end-page: 237
  ident: bib0112
  article-title: A feedback model of visual attention
  publication-title: J. Cognit. Neurosci.
– start-page: 3613
  year: 2016
  end-page: 3617
  ident: bib0074
  article-title: Image description through fusion based recurrent multi-modal learning
  publication-title: Proceedings of the IEEE International Conference on Image Processing
– reference: M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 1682–1690.
– volume: 14
  start-page: 179
  year: 1990
  end-page: 211
  ident: bib0101
  article-title: Finding structure in time
  publication-title: Cognit. Sci.
– start-page: 160
  year: 2008
  end-page: 167
  ident: bib0029
  article-title: A unified architecture for natural language processing:deep neural networks with multitask learning
  publication-title: Proceedings of the Twenty Fifth International Conference on Machine Learning
– year: 2014
  ident: bib0115
  article-title: Recurrent models of visual attention
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– year: 2011
  ident: bib0087
  article-title: Baby talk: understanding and generating simple image descriptions
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 770
  year: 2016
  end-page: 778
  ident: bib0120
  article-title: Deep residual learning for image recognition
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 2
  start-page: 207
  year: 2014
  end-page: 218
  ident: bib0055
  article-title: Grounded compositional semantics for finding and describing images with sentences
  publication-title: TACL
– year: 2014
  ident: bib0106
  article-title: Sequence to sequence learning with neural networks
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– reference: (2017).
– volume: 104
  start-page: 154
  year: 2013
  end-page: 171
  ident: bib0121
  article-title: Selective search for object recognition
  publication-title: Int. J. Comput. Vis.
– start-page: 449
  year: 2006
  end-page: 454
  ident: bib0093
  article-title: Generating typed dependency parses from phrase structure parses
  publication-title: Proceedings of the LREC
– reference: (2018).
– reference: K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the Meeting on Association for Computational Linguistics, vol.  4 (2002).
– year: 2016
  ident: bib0072
  article-title: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– start-page: 2407
  year: 2015
  end-page: 2415
  ident: bib0065
  article-title: Guiding the long-short term memory model for image caption generation
  publication-title: Proceedings of the IEEE International Conference on Computer Vision
– volume: 521
  start-page: 436
  year: 2015
  end-page: 444
  ident: bib0090
  article-title: Deep learning
  publication-title: Nature
– reference: (2016).
– start-page: 444
  year: 2011
  end-page: 454
  ident: bib0014
  article-title: Corpus-guided sentence generation of natural images
  publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing
– volume: 50
  start-page: 171
  year: 2002
  end-page: 184
  ident: bib0011
  article-title: Natural language description of human activities from video images based on concept hierarchy of actions
  publication-title: Int. Comput. Vis.
– year: 2013
  ident: bib0105
  article-title: Recurrent continuous translation models
  publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing
– volume: 3
  start-page: 1
  year: 2002
  end-page: 48
  ident: bib0079
  article-title: Kernel independent component analysis
  publication-title: J. Mach. Learn. Res.
– year: 2004
  ident: bib0012
  article-title: Automatic generation of natural language descriptions for images
  publication-title: Proceedings of the Recherche Dinformation Assistee Par Ordinateur
– year: 2004
  ident: bib0124
  article-title: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics
  publication-title: Proceedings of the Meeting on Association for Computational Linguistics
– year: 2014
  ident: bib0049
  article-title: Nonparametric method for data driven image captioning
  publication-title: Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics
– reference: J. Curran, S. Clark, J. Bos, Linguistically motivated large-scale NLP with CC and boxer, in: Proceedings of the Forty Fifth Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 33–36.
– volume: 35
  start-page: 2854
  year: 2013
  end-page: 2865
  ident: bib0089
  article-title: Phrasal recognition
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– start-page: 1473
  year: 2015
  end-page: 1482
  ident: bib0033
  article-title: From captions to visual concepts and back.
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 15
  year: 2010
  end-page: 29
  ident: bib0013
  article-title: Every picture tells a story: Generating sentences from images
  publication-title: Proceedings of the European Conference on Computer Vision,
– year: 2005
  ident: bib0088
  article-title: Europarl: a parallel corpus for statistical machine translation
  publication-title: MT Summit
– start-page: 194
  year: 2000
  end-page: 201
  ident: bib0119
  article-title: Trainable methods for surface natural language generation
  publication-title: Proceedings of the North American chapter of the Association for Computational Linguistics conference
– volume: 19
  start-page: 61
  year: 1993
  end-page: 74
  ident: bib0086
  article-title: Accurate methods for the statistics of surprise and coincidence
  publication-title: Comput. Linguist.
– volume: 25
  start-page: 2212
  year: 2014
  end-page: 2225
  ident: bib0017
  article-title: Learning deep hierarchical visual feature coding
  publication-title: IEEE Trans. Neural Netw. Learn. Syst.
– reference: (2014).
– start-page: 2623
  year: 2015
  end-page: 2631
  ident: bib0056
  article-title: Multimodal convolutional neural networks for matching image and sentences
  publication-title: Proceedings of the IEEE International Conference on Computer Vision
– start-page: 4259
  year: 2015
  end-page: 4267
  ident: bib0007
  article-title: Mining semantic affordances of visual object categories
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 2668
  year: 2015
  end-page: 2676
  ident: bib0054
  article-title: Common subspace for model and similarity: phrase learning for caption generation from images
  publication-title: IEEE International Conference on Computer Vision
– volume: 22
  start-page: 39
  year: 1996
  end-page: 71
  ident: bib0118
  article-title: A maximum entropy approach to natural language processing
  publication-title: Comput. Linguist.
– volume: 16
  start-page: 2639
  year: 2004
  end-page: 2664
  ident: bib0080
  article-title: Canonical correlation analysis: an overview with application to learning methods
  publication-title: Neural Comput.
– volume: 5
  year: 2012
  ident: bib0016
  article-title: Choosing linguistics over vision to describe images
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
– start-page: 2422
  year: 2015
  end-page: 2431
  ident: bib0062
  article-title: Mind’s eye: a recurrent visual representation for image caption generation
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 37
  start-page: 125
  year: 2015
  end-page: 141
  ident: bib0025
  article-title: Feedforward kernel neural networks, generalized least learning machine, and its deep learning with application to image classification
  publication-title: Appl. Soft Comput.
– reference: C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv:
– start-page: 1419
  year: 2005
  end-page: 1426
  ident: bib0117
  article-title: Multiple instance boosting for object detection
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– reference: X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, C. Zitnick, Microsoft COCO captions: data collection and evaluation server, arXiv:
– start-page: 1143
  year: 2011
  end-page: 1151
  ident: bib0015
  article-title: Im2Text: describing images using 1 million captioned photographs
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– reference: (2013).
– start-page: 4565
  year: 2016
  end-page: 4574
  ident: bib0130
  article-title: DenseCap: fully convolutional localization networks for dense captioning
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 52
  start-page: 1210
  year: 2017
  end-page: 1221
  ident: bib0021
  article-title: Research on point-wise gated deep networks
  publication-title: Appl. Soft Comput.
– volume: 32
  start-page: 1627
  year: 2010
  end-page: 1645
  ident: bib0002
  article-title: Object detection with discriminatively trained part based models
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– reference: C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8).
– start-page: 580
  year: 2014
  end-page: 587
  ident: bib0003
  article-title: Rich feature hierarchies for accurate object detection and semantic segmentation
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 3177
  year: 2011
  end-page: 3184
  ident: bib0006
  article-title: Action recognition from a distributed representation of pose and appearance
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– reference: T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficien testimation of word representations in vector space, arXiv:
– volume: 35
  start-page: 1798
  year: 2013
  end-page: 1828
  ident: bib0018
  article-title: Representation learning: a review and new perspectives
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– start-page: 1097
  year: 2012
  end-page: 1105
  ident: bib0008
  article-title: Imagenet classification with deep convolutional neural networks
  publication-title: Proceedings of the Twenty Fifth International Conference on Neural Information Processing Systems
– year: 2015
  ident: bib0114
  article-title: Multiple object recognition with visual attention
  publication-title: Proceedings of the International Conference on Learning Representation
– year: 2015
  ident: bib0047
  article-title: Sequence to sequence – video to text
  publication-title: Proceedings of the International Conference on Computer Vision
– start-page: 434
  year: 2016
  end-page: 441
  ident: bib0071
  article-title: Rich image captioning in the wild
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 7
  start-page: 17
  year: 2000
  end-page: 42
  ident: bib0111
  article-title: The dynamic representation of scenes
  publication-title: Vis. Cognit.
– start-page: 228
  year: 2007
  end-page: 231
  ident: bib0125
  article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments
  publication-title: Proceedings of the Second Workshop on Statistical Machine Translation
– volume: 71
  start-page: 279
  year: 2017
  end-page: 287
  ident: bib0026
  article-title: Growing random forest on deep convolutional neural networks for scene categorization
  publication-title: Expert Syst. Appl.
– reference: (2015).
– start-page: 392
  year: 2014
  end-page: 407
  ident: bib0010
  article-title: Multi-scale orderless pooling of deep convolutional activation features
  publication-title: Proceedings of the European Conference on Computer Vision
– reference: K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using RNN encoder–decoder for statistical machine translation, arXiv:
– year: 2016
  ident: bib0067
  article-title: Variational autoencoder for deep learning of images, labels and captions
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– start-page: 203
  year: 2016
  end-page: 212
  ident: bib0066
  article-title: What value do explicit high level concepts have in vision to language problems?
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 3
  start-page: 1889
  year: 2014
  end-page: 1897
  ident: bib0037
  article-title: Deep fragment embeddings for bidirectional image sentence mapping
  publication-title: Proceedings of the Twenty Seventh Advances in Neural Information Processing Systems (NIPS)
– reference: (2018).
– reference: Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5(5).
– start-page: 2533
  year: 2015
  end-page: 2541
  ident: bib0076
  article-title: Learning like a child: fast novel visual concept learning from sentence descriptions of images
  publication-title: Proceedings of the IEEE International Conference on Computer Vision
– year: 2012
  ident: bib0092
  article-title: Building high-level features using large scale unsupervised learning
  publication-title: Proceedings of the International Conference on Machine Learning
– volume: 46
  start-page: 936
  year: 2016
  end-page: 952
  ident: bib0024
  article-title: Hybrid deep neural network model for human action recognition
  publication-title: Appl. Soft Comput.
– start-page: 1
  year: 2016
  end-page: 10
  ident: bib0036
  article-title: Deep compositional captioning: describing novel object categories without paired training data
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– year: 2012
  ident: bib0053
  article-title: Midge: Generating image descriptions from computer vision detections
  publication-title: Proceedings of the Thirteenth Conference of the European Chapter of the Association for Computational Linguistics
– volume: 45
  start-page: 2673
  year: 1997
  end-page: 2681
  ident: bib0102
  article-title: Bidirectional recurrent neural networks
  publication-title: IEEE Trans. Signal Process.
– year: 2015
  ident: bib0040
  article-title: Ask your neurons:a neural-based approach to answering questions about images
  publication-title: Proceedings of the International Conference on Computer Vision
– start-page: 647
  year: 2014
  end-page: 655
  ident: bib0019
  article-title: DeCAF: a deep convolutional activation feature for generic visual recognition
  publication-title: Proceedings of The Thirty First International Conference on Machine Learning
– year: 2016
  ident: bib0075
  article-title: A parallel-fusion RNN-LSTM architecture for image caption generation
  publication-title: Proceedings of the IEEE International Conference on Image Processing
– start-page: 87
  year: 2016
  end-page: 97
  ident: bib0005
  article-title: Learning attributes equals multi-source domain generalization
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– volume: 2
  start-page: 351
  year: 2014
  end-page: 362
  ident: bib0050
  article-title: TREETALK: composition and compression of trees for image descriptions
  publication-title: Trans. Assoc. Comput. Linguist.
– start-page: 2361
  year: 2016
  end-page: 2369
  ident: bib0070
  article-title: Review networks for caption generation
  publication-title: Proceedings of the Advances in Neural Information Processing Systems
– reference: H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu., Are you talking to a machine? Dataset and methods for multilingual image question answering, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2296–2304.
– reference: K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, arXiv:
– start-page: 4566
  year: 2015
  end-page: 4575
  ident: bib0126
  article-title: CIDEr: consensus-based image description evaluation
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 1045
  year: 2010
  end-page: 1048
  ident: bib0104
  article-title: Recurrent neural network based language model
  publication-title: Proceedings of the Conference of the International Speech Communication Association
– year: 2014
  ident: bib0046
  article-title: Integrating language and vision to generate natural language descriptions of videos in the wild
  publication-title: Proceedings of the International Conference on Computational Linguistics
– start-page: 487
  year: 2014
  end-page: 495
  ident: bib0009
  article-title: Learning deep features for scene recognition using places database
  publication-title: Proceedings of the Advances in Neural Information Processing Systems (NIPS)
– reference: S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, K. Saenko, YouTube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in: Proceedings of the International Conference on Computer Vision, pp. 2712–2719.
– year: 2015
  ident: bib0058
  article-title: Phrase-based image captioning
  publication-title: Proceedings of the International Conference on Machine Learning
– reference: O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell. 39(4).
– reference: D. Geman, S. Geman, N. Hallonquist, L. Younes, Visual turing test for computer vision systems, in: Proceedings of the National Academy of Sciences of the United States of America, vol. 112, pp. 3618–3623.
– volume: 35
  start-page: 2891
  year: 2013
  end-page: 2903
  ident: bib0051
  article-title: BabyTalk: understanding and generating simple image descriptions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– year: 2004
  ident: bib0081
  article-title: A linear programming formulation for global inference in natural language tasks
  publication-title: Proceedings of the Annual Conference on Computational Natural Language Learning
– reference: R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv:
– start-page: 3128
  year: 2015
  end-page: 3137
  ident: bib0061
  article-title: Deep visual-semantic alignments for generating image descriptions
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– start-page: 951
  year: 2009
  end-page: 958
  ident: bib0004
  article-title: Learning to detect unseen object classes by between class attribute transfer
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– year: 2012
  ident: bib0084
  article-title: Efficient image annotation for automatic sentence generation
  publication-title: Proceedings of the Twentieth ACM International Conference on Multimedia
– reference: N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv:
– start-page: 3156
  year: 2015
  end-page: 3164
  ident: bib0064
  article-title: Show and tell: a neural image caption generator
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– year: 2014
  ident: bib0059
  article-title: Multimodal neural language models
  publication-title: Proceedings of the International Conference on Machine Learning
– start-page: 4651
  year: 2016
  end-page: 4659
  ident: bib0069
  article-title: Image captioning with semantic attention
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– reference: D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304.
– volume: 42
  start-page: 145
  year: 2001
  end-page: 175
  ident: bib0085
  article-title: Modeling the shape of the scene: a holistic representation of the spatial envelope
  publication-title: Int. J. Comput. Vis.
– reference: J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, arXiv:
– start-page: 2121
  year: 2013
  end-page: 2129
  ident: bib0091
  article-title: Devise: a deep visual-semantic embedding model
  publication-title: Proceedings of the Twenty Sixth International Conference on Neural Information Processing Systems
– volume: 9
  start-page: 1735
  year: 1997
  end-page: 1780
  ident: bib0107
  article-title: Long short-term memory
  publication-title: Neural Comput.
– start-page: 3441
  year: 2015
  end-page: 3450
  ident: bib0057
  article-title: Deep correlation for matching images and text
  publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
– year: 2016
  ident: bib0048
  article-title: Improving LSTM-based video description with linguistic knowledge mined from text
  publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing
– start-page: 67
  year: 2014
  end-page: 78
  ident: bib0127
  article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
  publication-title: Proceedings of the Meeting on Association for Computational Linguistics
– volume: 17
  start-page: 1875
  year: 2015
  end-page: 1886
  ident: bib0129
  article-title: Describing multimedia content using attention-based encoder–decoder networks
  publication-title: IEEE Trans. Multimed.
– year: 2011
  ident: bib0052
  article-title: Composing simple image descriptions using web-scale n-grams
  publication-title: Proceedings of the Fifteenth Conference on Computational Natural Language Learning
– ident: 10.1016/j.neucom.2018.05.080_bib0023
  doi: 10.1109/TPAMI.2012.231
– volume: 7
  start-page: 1
  issue: 1
  year: 2007
  ident: 10.1016/j.neucom.2018.05.080_bib0001
  article-title: What do we perceive in a glance of a real-world scene?
  publication-title: J. Vis.
  doi: 10.1167/7.1.10
– volume: 35
  start-page: 2854
  issue: 12
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0089
  article-title: Phrasal recognition
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2013.168
– ident: 10.1016/j.neucom.2018.05.080_bib0020
  doi: 10.1145/2647868.2654889
– start-page: 1097
  year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0008
  article-title: Imagenet classification with deep convolutional neural networks
– year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0046
  article-title: Integrating language and vision to generate natural language descriptions of videos in the wild
– year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0084
  article-title: Efficient image annotation for automatic sentence generation
– volume: 50
  start-page: 171
  year: 2002
  ident: 10.1016/j.neucom.2018.05.080_bib0011
  article-title: Natural language description of human activities from video images based on concept hierarchy of actions
  publication-title: Int. Comput. Vis.
  doi: 10.1023/A:1020346032608
– year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0031
  article-title: Distributed representations of words and phrases and their compositionality
– year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0114
  article-title: Multiple object recognition with visual attention
– ident: 10.1016/j.neucom.2018.05.080_bib0078
  doi: 10.3115/1557769.1557781
– volume: 42
  start-page: 145
  issue: 3
  year: 2001
  ident: 10.1016/j.neucom.2018.05.080_bib0085
  article-title: Modeling the shape of the scene: a holistic representation of the spatial envelope
  publication-title: Int. J. Comput. Vis.
  doi: 10.1023/A:1011139631724
– volume: 9
  start-page: 1735
  issue: 8
  year: 1997
  ident: 10.1016/j.neucom.2018.05.080_bib0107
  article-title: Long short-term memory
  publication-title: Neural Comput.
  doi: 10.1162/neco.1997.9.8.1735
– start-page: 1292
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0116
  article-title: Image description using visual dependency representations
– start-page: 1473
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0033
  article-title: From captions to visual concepts and back.
– year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0072
  article-title: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– start-page: 15
  year: 2010
  ident: 10.1016/j.neucom.2018.05.080_bib0013
  article-title: Every picture tells a story: Generating sentences from images
– ident: 10.1016/j.neucom.2018.05.080_bib0123
  doi: 10.3115/1073083.1073135
– start-page: 160
  year: 2008
  ident: 10.1016/j.neucom.2018.05.080_bib0029
  article-title: A unified architecture for natural language processing:deep neural networks with multitask learning
– year: 2004
  ident: 10.1016/j.neucom.2018.05.080_bib0081
  article-title: A linear programming formulation for global inference in natural language tasks
– ident: 10.1016/j.neucom.2018.05.080_bib0103
  doi: 10.1109/72.279181
– year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0035
  article-title: Deep captioning with multimodal recurrent neural networks
– ident: 10.1016/j.neucom.2018.05.080_bib0110
– start-page: 3613
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0074
  article-title: Image description through fusion based recurrent multi-modal learning
– volume: 19
  start-page: 61
  issue: 1
  year: 1993
  ident: 10.1016/j.neucom.2018.05.080_bib0086
  article-title: Accurate methods for the statistics of surprise and coincidence
  publication-title: Comput. Linguist.
– start-page: 951
  year: 2009
  ident: 10.1016/j.neucom.2018.05.080_bib0004
  article-title: Learning to detect unseen object classes by between class attribute transfer
– year: 2011
  ident: 10.1016/j.neucom.2018.05.080_bib0052
  article-title: Composing simple image descriptions using web-scale n-grams
– start-page: 2533
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0076
  article-title: Learning like a child: fast novel visual concept learning from sentence descriptions of images
– year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0067
  article-title: Variational autoencoder for deep learning of images, labels and captions
– year: 2004
  ident: 10.1016/j.neucom.2018.05.080_bib0124
  article-title: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics
– start-page: 2625
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0034
  article-title: Long-term recurrent convolutional networks for visual recognition and description
– year: 2005
  ident: 10.1016/j.neucom.2018.05.080_bib0088
  article-title: Europarl: a parallel corpus for statistical machine translation
– start-page: 194
  year: 2000
  ident: 10.1016/j.neucom.2018.05.080_bib0119
  article-title: Trainable methods for surface natural language generation
– ident: 10.1016/j.neucom.2018.05.080_bib0038
  doi: 10.1109/ICCV.2015.279
– start-page: 2361
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0070
  article-title: Review networks for caption generation
– start-page: 2121
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0091
  article-title: Devise: a deep visual-semantic embedding model
– start-page: 1
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0073
  article-title: Describing images by feeding LSTM with structural words
– volume: 104
  start-page: 154
  issue: 2
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0121
  article-title: Selective search for object recognition
  publication-title: Int. J. Comput. Vis.
  doi: 10.1007/s11263-013-0620-5
– volume: 35
  start-page: 1798
  issue: 8
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0018
  article-title: Representation learning: a review and new perspectives
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2013.50
– ident: 10.1016/j.neucom.2018.05.080_bib0043
  doi: 10.1109/TPAMI.2012.118
– year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0083
  article-title: Collective generation of natural image descriptions
– start-page: 203
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0066
  article-title: What value do explicit high level concepts have in vision to language problems?
– ident: 10.1016/j.neucom.2018.05.080_bib0113
– start-page: 2407
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0065
  article-title: Guiding the long-short term memory model for image caption generation
– volume: 31
  start-page: 339
  year: 2008
  ident: 10.1016/j.neucom.2018.05.080_bib0082
  article-title: Global inference for sentence compression an integer linear programming approach
  publication-title: J. Artif. Intell. Res.
  doi: 10.1613/jair.2433
– volume: 45
  start-page: 2673
  issue: 11
  year: 1997
  ident: 10.1016/j.neucom.2018.05.080_bib0102
  article-title: Bidirectional recurrent neural networks
  publication-title: IEEE Trans. Signal Process.
  doi: 10.1109/78.650093
– start-page: 3156
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0064
  article-title: Show and tell: a neural image caption generator
– start-page: 2623
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0056
  article-title: Multimodal convolutional neural networks for matching image and sentences
– year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0053
  article-title: Midge: Generating image descriptions from computer vision detections
– year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0059
  article-title: Multimodal neural language models
– ident: 10.1016/j.neucom.2018.05.080_bib0039
– ident: 10.1016/j.neucom.2018.05.080_bib0097
  doi: 10.3115/v1/P14-1062
– volume: 2
  start-page: 207
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0055
  article-title: Grounded compositional semantics for finding and describing images with sentences
  publication-title: TACL
  doi: 10.1162/tacl_a_00177
– volume: 16
  start-page: 2639
  year: 2004
  ident: 10.1016/j.neucom.2018.05.080_bib0080
  article-title: Canonical correlation analysis: an overview with application to learning methods
  publication-title: Neural Comput.
  doi: 10.1162/0899766042321814
– start-page: 1247
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0098
  article-title: Deep canonical correlation analysis
– start-page: 770
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0120
  article-title: Deep residual learning for image recognition
– start-page: 4566
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0126
  article-title: CIDEr: consensus-based image description evaluation
– start-page: 87
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0005
  article-title: Learning attributes equals multi-source domain generalization
– start-page: 444
  year: 2011
  ident: 10.1016/j.neucom.2018.05.080_bib0014
  article-title: Corpus-guided sentence generation of natural images
– start-page: 3441
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0057
  article-title: Deep correlation for matching images and text
– year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0106
  article-title: Sequence to sequence learning with neural networks
– ident: 10.1016/j.neucom.2018.05.080_bib0128
– volume: 5
  year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0016
  article-title: Choosing linguistics over vision to describe images
– start-page: 1143
  year: 2011
  ident: 10.1016/j.neucom.2018.05.080_bib0015
  article-title: Im2Text: describing images using 1 million captioned photographs
– ident: 10.1016/j.neucom.2018.05.080_bib0077
– start-page: 641
  year: 2007
  ident: 10.1016/j.neucom.2018.05.080_bib0030
  article-title: Three new graphical models for statistical language modelling
– ident: 10.1016/j.neucom.2018.05.080_bib0028
  doi: 10.3115/v1/D14-1179
– volume: 47
  start-page: 853
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0032
  article-title: Framing image description as a ranking task: data, models and evaluation metrics
  publication-title: J. Artif. Intell. Res.
  doi: 10.1613/jair.3994
– volume: 16
  start-page: 219
  issue: 2
  year: 2004
  ident: 10.1016/j.neucom.2018.05.080_bib0112
  article-title: A feedback model of visual attention
  publication-title: J. Cognit. Neurosci.
  doi: 10.1162/089892904322984526
– start-page: 647
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0019
  article-title: DeCAF: a deep convolutional activation feature for generic visual recognition
– volume: 3
  start-page: 1889
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0037
  article-title: Deep fragment embeddings for bidirectional image sentence mapping
– volume: 46
  start-page: 875
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0022
  article-title: Fine-tuning deep belief networks using harmony search
  publication-title: Appl. Soft Comput.
  doi: 10.1016/j.asoc.2015.08.043
– start-page: 392
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0010
  article-title: Multi-scale orderless pooling of deep convolutional activation features
– start-page: 4651
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0069
  article-title: Image captioning with semantic attention
– year: 2012
  ident: 10.1016/j.neucom.2018.05.080_bib0092
  article-title: Building high-level features using large scale unsupervised learning
– ident: 10.1016/j.neucom.2018.05.080_bib0063
– start-page: 1
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0036
  article-title: Deep compositional captioning: describing novel object categories without paired training data
– ident: 10.1016/j.neucom.2018.05.080_bib0095
– volume: 14
  start-page: 179
  issue: 2
  year: 1990
  ident: 10.1016/j.neucom.2018.05.080_bib0101
  article-title: Finding structure in time
  publication-title: Cognit. Sci.
  doi: 10.1207/s15516709cog1402_1
– volume: 32
  start-page: 1627
  issue: 9
  year: 2010
  ident: 10.1016/j.neucom.2018.05.080_bib0002
  article-title: Object detection with discriminatively trained part based models
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2009.167
– start-page: 67
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0127
  article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
– start-page: 4259
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0007
  article-title: Mining semantic affordances of visual object categories
– volume: 25
  start-page: 2212
  issue: 12
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0017
  article-title: Learning deep hierarchical visual feature coding
  publication-title: IEEE Trans. Neural Netw. Learn. Syst.
  doi: 10.1109/TNNLS.2014.2307532
– year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0075
  article-title: A parallel-fusion RNN-LSTM architecture for image caption generation
– volume: 46
  start-page: 936
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0024
  article-title: Hybrid deep neural network model for human action recognition
  publication-title: Appl. Soft Comput.
  doi: 10.1016/j.asoc.2015.08.025
– start-page: 3128
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0061
  article-title: Deep visual-semantic alignments for generating image descriptions
– volume: 7
  start-page: 17
  issue: 1
  year: 2000
  ident: 10.1016/j.neucom.2018.05.080_bib0111
  article-title: The dynamic representation of scenes
  publication-title: Vis. Cognit.
  doi: 10.1080/135062800394667
– start-page: 2422
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0062
  article-title: Mind’s eye: a recurrent visual representation for image caption generation
– year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0048
  article-title: Improving LSTM-based video description with linguistic knowledge mined from text
– volume: 37
  start-page: 125
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0025
  article-title: Feedforward kernel neural networks, generalized least learning machine, and its deep learning with application to image classification
  publication-title: Appl. Soft Comput.
  doi: 10.1016/j.asoc.2015.07.040
– ident: 10.1016/j.neucom.2018.05.080_bib0060
– year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0040
  article-title: Ask your neurons:a neural-based approach to answering questions about images
– ident: 10.1016/j.neucom.2018.05.080_bib0042
  doi: 10.1073/pnas.1422953112
– ident: 10.1016/j.neucom.2018.05.080_bib0045
  doi: 10.1109/ICCV.2013.337
– ident: 10.1016/j.neucom.2018.05.080_bib0068
– ident: 10.1016/j.neucom.2018.05.080_bib0044
  doi: 10.1109/TIP.2016.2628585
– ident: 10.1016/j.neucom.2018.05.080_bib0100
– volume: 71
  start-page: 279
  year: 2017
  ident: 10.1016/j.neucom.2018.05.080_bib0026
  article-title: Growing random forest on deep convolutional neural networks for scene categorization
  publication-title: Expert Syst. Appl.
  doi: 10.1016/j.eswa.2016.10.038
– ident: 10.1016/j.neucom.2018.05.080_bib0108
  doi: 10.1109/TPAMI.2016.2587640
– ident: 10.1016/j.neucom.2018.05.080_bib0094
– start-page: 2668
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0054
  article-title: Common subspace for model and similarity: phrase learning for caption generation from images
– start-page: 580
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0003
  article-title: Rich feature hierarchies for accurate object detection and semantic segmentation
– year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0047
  article-title: Sequence to sequence – video to text
– start-page: 1419
  year: 2005
  ident: 10.1016/j.neucom.2018.05.080_bib0117
  article-title: Multiple instance boosting for object detection
– year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0058
  article-title: Phrase-based image captioning
– start-page: 434
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0071
  article-title: Rich image captioning in the wild
– start-page: 1045
  year: 2010
  ident: 10.1016/j.neucom.2018.05.080_bib0104
  article-title: Recurrent neural network based language model
– year: 2011
  ident: 10.1016/j.neucom.2018.05.080_bib0087
  article-title: Baby talk: understanding and generating simple image descriptions
– year: 2004
  ident: 10.1016/j.neucom.2018.05.080_bib0012
  article-title: Automatic generation of natural language descriptions for images
– start-page: 3177
  year: 2011
  ident: 10.1016/j.neucom.2018.05.080_bib0006
  article-title: Action recognition from a distributed representation of pose and appearance
– start-page: 2265
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0099
  article-title: Learning word embeddings efficiently with noise-contrastive estimation
– volume: 35
  start-page: 2891
  issue: 12
  year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0051
  article-title: BabyTalk: understanding and generating simple image descriptions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2012.162
– start-page: 4565
  year: 2016
  ident: 10.1016/j.neucom.2018.05.080_bib0130
  article-title: DenseCap: fully convolutional localization networks for dense captioning
– ident: 10.1016/j.neucom.2018.05.080_bib0027
– volume: 17
  start-page: 1875
  issue: 11
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0129
  article-title: Describing multimedia content using attention-based encoder–decoder networks
  publication-title: IEEE Trans. Multimed.
  doi: 10.1109/TMM.2015.2477044
– start-page: 449
  year: 2006
  ident: 10.1016/j.neucom.2018.05.080_bib0093
  article-title: Generating typed dependency parses from phrase structure parses
– year: 2013
  ident: 10.1016/j.neucom.2018.05.080_bib0105
  article-title: Recurrent continuous translation models
– year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0049
  article-title: Nonparametric method for data driven image captioning
– volume: 22
  start-page: 39
  issue: 1
  year: 1996
  ident: 10.1016/j.neucom.2018.05.080_bib0118
  article-title: A maximum entropy approach to natural language processing
  publication-title: Comput. Linguist.
– start-page: 228
  year: 2007
  ident: 10.1016/j.neucom.2018.05.080_bib0125
  article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments
– volume: 3
  start-page: 993
  year: 2003
  ident: 10.1016/j.neucom.2018.05.080_bib0122
  article-title: Latent Dirichlet allocation
  publication-title: J. Mach. Learn. Res.
– ident: 10.1016/j.neucom.2018.05.080_bib0041
– volume: 521
  start-page: 436
  issue: 7553
  year: 2015
  ident: 10.1016/j.neucom.2018.05.080_bib0090
  article-title: Deep learning
  publication-title: Nature
  doi: 10.1038/nature14539
– year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0115
  article-title: Recurrent models of visual attention
– volume: 2
  start-page: 351
  issue: 10
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0050
  article-title: TREETALK: composition and compression of trees for image descriptions
  publication-title: Trans. Assoc. Comput. Linguist.
  doi: 10.1162/tacl_a_00188
– start-page: 2042
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0096
  article-title: Convolutional neural network architectures for matching natural language sentences
– volume: 3
  start-page: 1
  year: 2002
  ident: 10.1016/j.neucom.2018.05.080_bib0079
  article-title: Kernel independent component analysis
  publication-title: J. Mach. Learn. Res.
– start-page: 487
  year: 2014
  ident: 10.1016/j.neucom.2018.05.080_bib0009
  article-title: Learning deep features for scene recognition using places database
– volume: 52
  start-page: 1210
  year: 2017
  ident: 10.1016/j.neucom.2018.05.080_bib0021
  article-title: Research on point-wise gated deep networks
  publication-title: Appl. Soft Comput.
  doi: 10.1016/j.asoc.2016.08.056
– ident: 10.1016/j.neucom.2018.05.080_bib0109
  doi: 10.1109/TNNLS.2016.2582924
SSID ssj0017129
Score 2.58497
Snippet Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 291
SubjectTerms Attention mechanism
Deep neural networks
Encoder–decoder framework
Image captioning
Multimodal embedding
Sentence template
Title A survey on automatic image caption generation
URI https://dx.doi.org/10.1016/j.neucom.2018.05.080
Volume 311
WOSCitedRecordID wos000438313100027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-8286
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0017129
  issn: 0925-2312
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Nb9QwELVgy4ELlC9RoMgHbshVYjtxfAyoqOVQIbVIe4sc24GuaLba3VT9-Yw9TlqxiC-JSxRZ612vnzMeT2beI-SN7nhZeWsYbMaKSdcZ1paFY6V3xvNMe5nZKDahTk6q-Vx_SsKF6ygnoPq-ur7Wl_8VamgDsEPp7F_APX0pNMA9gA5XgB2ufwR8_XY9rK7gUQ95xsNmiZys5xchOccatBBfItn0hMlipHAaYDuLMg8pgFBfBB4FFxbNFDB4hwLWp18Hk3a9GEXAtrTWUhghj5yuWEiJsa2t-hYMEvKCgQeI9tKjiawUj8Xnt22oSBYzWUEU4EobqkB94S1bjWGDxUHvh5C4EwYVSVRR2ekHFuzTMJQwEjBB4W2Rvkt2uCp0NSM79fHh_OP06kjlHAkW09DHesmY1Lf9Wz_3R275GGe75EE6HNAaQX1E7vj-MXk4Cm_QZIefkIOaIsZ02dMJYxoxpgljeoPxU_L5w-HZ-yOWhC-YFYpvmJNOtLrybSeLTOZdKYwoXaszY3OrnbCF6fIOToa5KYxT4PQVYKtbUwpuhJGdeEZm_bL3zwm1AjzArG2501ZaA-cBKVVmfVeCn2hau0fE-Pcbm1jhgzjJt2ZM_1s0OGlNmLQmKxqYtD3Cpl6XyIrym8-rcWab5Nmhx9bAYvhlzxf_3PMluX-zzl-R2WY1-H1yz15tzter12nVfAf6x3I0
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+survey+on+automatic+image+caption+generation&rft.jtitle=Neurocomputing+%28Amsterdam%29&rft.au=Bai%2C+Shuang&rft.au=An%2C+Shan&rft.date=2018-10-15&rft.pub=Elsevier+B.V&rft.issn=0925-2312&rft.eissn=1872-8286&rft.volume=311&rft.spage=291&rft.epage=304&rft_id=info:doi/10.1016%2Fj.neucom.2018.05.080&rft.externalDocID=S0925231218306659
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0925-2312&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0925-2312&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0925-2312&client=summon