A survey on automatic image caption generation
Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both rese...
Uloženo v:
| Vydáno v: | Neurocomputing (Amsterdam) Ročník 311; s. 291 - 304 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier B.V
15.10.2018
|
| Témata: | |
| ISSN: | 0925-2312, 1872-8286 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented. |
|---|---|
| AbstractList | Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented. |
| Author | Bai, Shuang An, Shan |
| Author_xml | – sequence: 1 givenname: Shuang surname: Bai fullname: Bai, Shuang email: shuangb@bjtu.edu.cn organization: School of Electronic and Information Engineering, Beijing Jiaotong University, No.3 Shang Yuan Cun, Hai Dian District, Beijing, China – sequence: 2 givenname: Shan surname: An fullname: An, Shan organization: Beijing Jingdong Shangke Information Technology Co., Ltd, Beijing, China |
| BookMark | eNqFj81Kw0AUhQdRsK2-gYu8QOK9M5lk4kIoxT8ouNH1cDM_ZUqblJm00Lc3ta5c6OoeLnyH803ZZdd3jrE7hAIBq_t10bm96bcFB1QFyAIUXLAJqprniqvqkk2g4TLnAvk1m6a0BsAaeTNhxTxL-3hwx6zvMtoP_ZaGYLKwpZXLDO2GMP5XrnORTvGGXXnaJHf7c2fs8_npY_GaL99f3hbzZW5EzYfclla0jXKtLyWU6CtBorJtA2TQNFYYSR49Ko4kyda1BAkCWqoEJ0GlFzNWnntN7FOKzutdHDfFo0bQJ2e91mdnfXLWIPXoPGIPvzAThu_hQ6Sw-Q9-PMNuFDsEF3UywXXG2RCdGbTtw98FX67-dzE |
| CitedBy_id | crossref_primary_10_1007_s10462_021_10092_2 crossref_primary_10_3233_JIFS_222358 crossref_primary_10_1007_s11042_019_08571_4 crossref_primary_10_47164_ijngc_v13i4_769 crossref_primary_10_1007_s00034_022_02050_2 crossref_primary_10_1016_j_eswa_2022_117174 crossref_primary_10_1016_j_cosrev_2025_100766 crossref_primary_10_1007_s11042_023_16560_x crossref_primary_10_1016_j_compbiomed_2024_108709 crossref_primary_10_1007_s00521_025_11341_z crossref_primary_10_1016_j_compeleceng_2025_110077 crossref_primary_10_1007_s00530_023_01178_8 crossref_primary_10_1007_s13735_024_00328_6 crossref_primary_10_1109_ACCESS_2024_3402360 crossref_primary_10_3390_app13137981 crossref_primary_10_3390_app9102024 crossref_primary_10_1016_j_compeleceng_2020_106630 crossref_primary_10_1007_s11042_020_10165_4 crossref_primary_10_2478_jaiscr_2023_0005 crossref_primary_10_3390_app12010209 crossref_primary_10_1007_s42979_023_01671_x crossref_primary_10_1016_j_neucom_2020_10_042 crossref_primary_10_1145_3708886 crossref_primary_10_1109_TPAMI_2022_3148210 crossref_primary_10_1016_j_patcog_2020_107413 crossref_primary_10_3390_jimaging7080125 crossref_primary_10_1016_j_ipm_2020_102261 crossref_primary_10_1007_s11633_022_1369_5 crossref_primary_10_1155_2018_5847460 crossref_primary_10_1049_ipr2_13287 crossref_primary_10_1109_TVCG_2025_3542504 crossref_primary_10_1145_3654795 crossref_primary_10_3390_rs12060939 crossref_primary_10_1016_j_neucom_2018_12_026 crossref_primary_10_1007_s00521_022_08072_w crossref_primary_10_1007_s11263_024_02144_1 crossref_primary_10_1007_s00521_025_11199_1 crossref_primary_10_1007_s00371_020_01867_9 crossref_primary_10_1049_ipr2_12367 crossref_primary_10_3389_fnint_2020_00010 crossref_primary_10_1016_j_neucom_2019_05_027 crossref_primary_10_1016_j_ssci_2023_106122 crossref_primary_10_1109_JBHI_2023_3236661 crossref_primary_10_1007_s42979_022_01322_7 crossref_primary_10_3390_electronics13163306 crossref_primary_10_1016_j_eswa_2023_120698 crossref_primary_10_1371_journal_pone_0320701 crossref_primary_10_1155_2022_2756396 crossref_primary_10_1109_ACCESS_2020_3047091 crossref_primary_10_1109_ACCESS_2021_3128140 crossref_primary_10_1109_TCSVT_2021_3056684 crossref_primary_10_1080_23311916_2022_2104333 crossref_primary_10_1109_TETCI_2019_2892755 crossref_primary_10_1016_j_media_2024_103264 crossref_primary_10_1145_3623386 crossref_primary_10_1016_j_neucom_2023_126287 crossref_primary_10_1007_s11042_023_15555_y crossref_primary_10_1007_s12559_019_09697_1 crossref_primary_10_32604_cmes_2025_059192 crossref_primary_10_3390_math8091606 crossref_primary_10_1109_ACCESS_2023_3249462 crossref_primary_10_3389_fpls_2023_1273029 crossref_primary_10_1109_ACCESS_2021_3058248 crossref_primary_10_1007_s00371_023_03180_7 crossref_primary_10_1109_ACCESS_2020_3013321 crossref_primary_10_1007_s11042_022_13443_5 crossref_primary_10_1109_ACCESS_2020_3021508 crossref_primary_10_1007_s10489_021_02293_7 crossref_primary_10_24054_rcta_v1i45_3751 crossref_primary_10_1016_j_patcog_2021_108485 crossref_primary_10_5604_01_3001_0053_9697 crossref_primary_10_1007_s13198_024_02495_5 crossref_primary_10_1007_s10462_023_10488_2 crossref_primary_10_1016_j_inffus_2022_11_011 crossref_primary_10_1109_TCSS_2022_3223539 crossref_primary_10_1051_shsconf_202213903014 crossref_primary_10_1007_s10994_020_05919_y crossref_primary_10_1016_j_eswa_2025_128555 crossref_primary_10_1155_2023_9397325 crossref_primary_10_3390_s22218376 crossref_primary_10_1016_j_apenergy_2020_115098 crossref_primary_10_1080_00207543_2020_1859636 crossref_primary_10_1109_ACCESS_2020_2999568 crossref_primary_10_1007_s00521_021_06557_8 crossref_primary_10_3389_frai_2024_1430984 crossref_primary_10_1016_j_engappai_2023_106545 crossref_primary_10_1145_3717612 crossref_primary_10_1016_j_eswa_2023_119773 crossref_primary_10_1016_j_eswa_2023_119774 crossref_primary_10_1109_TGRS_2023_3312479 crossref_primary_10_1002_eng2_12785 crossref_primary_10_1007_s10115_022_01684_7 crossref_primary_10_1007_s11042_023_17904_3 crossref_primary_10_1016_j_patrec_2021_06_011 crossref_primary_10_1109_ACCESS_2021_3129782 crossref_primary_10_1109_ACCESS_2019_2939201 crossref_primary_10_1109_TASLP_2021_3120644 crossref_primary_10_3389_fnins_2023_1270850 crossref_primary_10_1007_s11042_022_12042_8 crossref_primary_10_1007_s11831_021_09542_5 crossref_primary_10_1109_ACCESS_2022_3186471 |
| Cites_doi | 10.1109/TPAMI.2012.231 10.1167/7.1.10 10.1109/TPAMI.2013.168 10.1145/2647868.2654889 10.1023/A:1020346032608 10.3115/1557769.1557781 10.1023/A:1011139631724 10.1162/neco.1997.9.8.1735 10.3115/1073083.1073135 10.1109/72.279181 10.1109/ICCV.2015.279 10.1007/s11263-013-0620-5 10.1109/TPAMI.2013.50 10.1109/TPAMI.2012.118 10.1613/jair.2433 10.1109/78.650093 10.3115/v1/P14-1062 10.1162/tacl_a_00177 10.1162/0899766042321814 10.3115/v1/D14-1179 10.1613/jair.3994 10.1162/089892904322984526 10.1016/j.asoc.2015.08.043 10.1207/s15516709cog1402_1 10.1109/TPAMI.2009.167 10.1109/TNNLS.2014.2307532 10.1016/j.asoc.2015.08.025 10.1080/135062800394667 10.1016/j.asoc.2015.07.040 10.1073/pnas.1422953112 10.1109/ICCV.2013.337 10.1109/TIP.2016.2628585 10.1016/j.eswa.2016.10.038 10.1109/TPAMI.2016.2587640 10.1109/TPAMI.2012.162 10.1109/TMM.2015.2477044 10.1038/nature14539 10.1162/tacl_a_00188 10.1016/j.asoc.2016.08.056 10.1109/TNNLS.2016.2582924 |
| ContentType | Journal Article |
| Copyright | 2018 |
| Copyright_xml | – notice: 2018 |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.neucom.2018.05.080 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1872-8286 |
| EndPage | 304 |
| ExternalDocumentID | 10_1016_j_neucom_2018_05_080 S0925231218306659 |
| GroupedDBID | --- --K --M .DC .~1 0R~ 123 1B1 1~. 1~5 4.4 457 4G. 53G 5VS 7-5 71M 8P~ 9JM 9JN AABNK AACTN AADPK AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAXLA AAXUO AAYFN ABBOA ABCQJ ABFNM ABJNI ABMAC ABYKQ ACDAQ ACGFS ACRLP ACZNC ADBBV ADEZE AEBSH AEKER AENEX AFKWA AFTJW AFXIZ AGHFR AGUBO AGWIK AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD AXJTR BKOJK BLXMC CS3 DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 F5P FDB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ IHE J1W KOM LG9 M41 MO0 MOBAO N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 RIG ROL RPZ SDF SDG SDP SES SPC SPCBC SSN SSV SSZ T5K ZMT ~G- 29N 9DU AAQXK AATTM AAXKI AAYWO AAYXX ABWVN ABXDB ACLOT ACNNM ACRPL ACVFH ADCNI ADJOM ADMUD ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP ASPBG AVWKF AZFZN CITATION EFKBS FEDTE FGOYB HLZ HVGLF HZ~ R2- SBC SEW WUQ XPP ~HD |
| ID | FETCH-LOGICAL-c372t-d4d3b98ebf45041f63a36db90ac1c9d3c5af1f1821a5ad77505030ba632a3a4f3 |
| ISICitedReferencesCount | 132 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000438313100027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0925-2312 |
| IngestDate | Sat Nov 29 07:12:11 EST 2025 Tue Nov 18 22:26:21 EST 2025 Fri Feb 23 02:47:30 EST 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Attention mechanism Deep neural networks Image captioning Encoder–decoder framework Multimodal embedding Sentence template |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c372t-d4d3b98ebf45041f63a36db90ac1c9d3c5af1f1821a5ad77505030ba632a3a4f3 |
| PageCount | 14 |
| ParticipantIDs | crossref_primary_10_1016_j_neucom_2018_05_080 crossref_citationtrail_10_1016_j_neucom_2018_05_080 elsevier_sciencedirect_doi_10_1016_j_neucom_2018_05_080 |
| PublicationCentury | 2000 |
| PublicationDate | 2018-10-15 |
| PublicationDateYYYYMMDD | 2018-10-15 |
| PublicationDate_xml | – month: 10 year: 2018 text: 2018-10-15 day: 15 |
| PublicationDecade | 2010 |
| PublicationTitle | Neurocomputing (Amsterdam) |
| PublicationYear | 2018 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, arXiv Schuster, Paliwal (bib0102) 1997; 45 Hardoon, Szedmak, Shawe-Taylor (bib0080) 2004; 16 Mason, Charniak (bib0049) 2014 Bach, Jordan (bib0079) 2002; 3 Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan (bib0034) 2015 Chen, Zitnick (bib0062) 2015 Krizhevsky, Sutskever, Hinton (bib0008) 2012 Socher, Karpathy, Le, Manning, Ng (bib0055) 2014; 2 K. Greff, R.K. Srivastava, J. KoutnÃk, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, arXiv Ushiku, Yamaguchi, Mukuta, Harada (bib0054) 2015 N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv Oruganti, Sah, Pillai, Ptucha (bib0074) 2016 S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: Visual question answering., arXiv Felzenszwalb, Girshick, McAllester, Ramanan (bib0002) 2010; 32 Mikolov, Karafiat, Burget, Cernocky, Khudanpur (bib0104) 2010 Zhang, Ding, Zhang, Xue (bib0021) 2017; 52 Ijjina, Mohan (bib0024) 2016; 46 Mao, Wei, Yang, Wang (bib0076) 2015 (2013). Vedantam, Zitnick, Parikh (bib0126) 2015 Lavie, Agarwal (bib0125) 2007 Frome, Corrado, Shlens, Bengio, Dean, Mikolov (bib0091) 2013 Ma, Han (bib0073) 2016 Cho, Courville, Bengio (bib0129) 2015; 17 (2017). Lin, Och (bib0124) 2004 Blei, Ng, Jordan (bib0122) 2003; 3 Hede, Moellic, Bourgeoys, Joint, Thomas (bib0012) 2004 Mitchell, Dodge, Goyal, Yamaguchi, Stratos, Han, Mensch, Berg, Berg, Daume (bib0053) 2012 H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu., Are you talking to a machine? Dataset and methods for multilingual image question answering, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2296–2304. Yan, Mikolajczyk (bib0057) 2015 Le, Ranzato, Monga, Devin, Chen, Corrado, Dean, Ng (bib0092) 2012 You, Jin, Wang, Fang, Luo (bib0069) 2016 C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304. Oliva, Torralba (bib0085) 2001; 42 X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, C. Zitnick, Microsoft COCO captions: data collection and evaluation server, arXiv Lebret, Pinheiro, Collobert (bib0058) 2015 R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv Kojima, Tamura, Fukunaga (bib0011) 2002; 50 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv Sutskever, Vinyals, Le (bib0106) 2014 Ratnaparkhi (bib0119) 2000 K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv Hochreiter, Schmidhuber (bib0107) 1997; 9 Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, CNN: single-label to multi-label, arXiv Jia, Gavves, Fernando, Tuytelaars (bib0065) 2015 Kuznetsova, Ordonez, Berg, Choi (bib0050) 2014; 2 Clarke, Lapata (bib0082) 2008; 31 Venugopalan, Hendricks, Mooney, Saenko (bib0048) 2016 Mikolov, Sutskever, Chen, Corrado, Dean (bib0031) 2013 M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 1682–1690. (2016). Ma, Lu, Lifeng, Li (bib0056) 2015 O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell. 39(4). Farhadi, Hejrati, Sadeghi, Young, Rashtchian, Hockenmaier, Forsyth (bib0013) 2010 Zhang, Platt, Viola (bib0117) 2005 Dunning (bib0086) 1993; 19 Bourdev, Malik, Maji (bib0006) 2011 (2014) 1–14. Mnih, Hinton (bib0030) 2007 S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, K. Saenko, YouTube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in: Proceedings of the International Conference on Computer Vision, pp. 2712–2719. Karpathy, Li (bib0061) 2015 Berger, Pietra, Pietra (bib0118) 1996; 22 Gan, Yang, Gong (bib0005) 2016 Mnih, Kavukcuoglu (bib0099) 2013 C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8). Pu, Gan, Henao, Yuan, Li, Stevens, Carin (bib0067) 2016 Young, Lai, Hodosh, Hockenmaier (bib0127) 2014 Roth, tau Yih (bib0081) 2004 Kulkarni, Premraj, Dhar, Li, Choi, Berg, Berg. (bib0087) 2011 A. Tariq, H. Foroosh, A context-driven extractive framework for generating realistic image descriptions, IEEE Trans. Image Process. 26(2). Farhadi, Sadeghi (bib0089) 2013; 35 Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5(5). Wang, Jiang, Chung, Qian (bib0025) 2015; 37 K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using RNN encoder–decoder for statistical machine translation, arXiv Mnih, Hees, Graves, Kavukcuoglu (bib0115) 2014 Fang, Gupta, Iandola, Srivastava (bib0033) 2015 Wang, Song, Yang, Luo (bib0075) 2016 Karpathy, Joulin, Li (bib0037) 2014; 3 Mao, Xu, Yang, Wang, Huang, Yuille (bib0035) 2015 He, Zhang, Ren, Sun (bib0120) 2016 Lampert, Nickisch, Harmeling (bib0004) 2009 Andrew, Arora, Bilmes, Livescu. (bib0098) 2013 (2015). Bai (bib0026) 2017; 71 Ushiku, Harada, Kuniyoshi (bib0084) 2012 LeCun, Bengio, Hinton (bib0090) 2015; 521 Fu, Jin, Cui, Sha, Zhang (bib0072) 2016 Y. Feng, M. Lapata, Automatic caption generation for news images, IEEE Trans. Pattern Anal. Mach. Intell. 35(4). Yang, Yuan, Wu, Salakhutdinov, Cohen (bib0070) 2016 Malinowski, Rohrbach, Fritz (bib0040) 2015 Kuznetsova, Ordonez, Berg, Berg, Choi (bib0083) 2012 Li, Kulkarni, Berg, Berg, Choi (bib0052) 2011 Koehn (bib0088) 2005 Thomason, Venugopalan, Guadarrama, Saenko, Mooney (bib0046) 2014 Spratling, Johnson (bib0112) 2004; 16 Vinyals, Toshev, Bengio, Erhan (bib0064) 2015 Bengio, Courville, Vincent (bib0018) 2013; 35 J. Curran, S. Clark, J. Bos, Linguistically motivated large-scale NLP with CC and boxer, in: Proceedings of the Forty Fifth Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 33–36. Elliott, Keller (bib0116) 2013 Kiros, Zemel, Salakhutdinov (bib0059) 2014 Elman (bib0101) 1990; 14 Marneffe, Maccartney, Manning (bib0093) 2006 T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficien testimation of word representations in vector space, arXiv L A Hendricks, Venugopalan (bib0036) 2016 Tran, He, Zhang, Sun (bib0071) 2016 Goh, Thome, Cord, Lim (bib0017) 2014; 25 A.Rensink (bib0111) 2000; 7 Uijlings, van de Sande, Gevers, Smeulders (bib0121) 2013; 104 Fei-Fei, Iyer, Koch, Perona. (bib0001) 2007; 7 Kalchbrenner, Blunsom (bib0105) 2013 Zhou, Lapedriza, Xiao, Torralba, Oliva (bib0009) 2014 Collobert, Weston (bib0029) 2008 Ba, Mnih, Kavukcuoglu (bib0114) 2015 Johnson, Karpathy, Fei-Fei (bib0130) 2016 Wu, Shen, Liu, Dick, van den Hengel (bib0066) 2016 Gong, Wang, Guo, Lazebnik (bib0010) 2014 K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the Meeting on Association for Computational Linguistics, vol. 4 (2002). Yang, Teo, Daume, Aloimono (bib0014) 2011 D. Geman, S. Geman, N. Hallonquist, L. Younes, Visual turing test for computer vision systems, in: Proceedings of the National Academy of Sciences of the United States of America, vol. 112, pp. 3618–3623. Hu, Lu, Li, Chen (bib0096) 2014 Kulkarni, Premraj, Ordonez, Dhar, Li, Choi, Berg, Berg (bib0051) 2013; 35 (2014). Gupta, Verma, Jawahar. (bib0016) 2012; 5 Chao, Wang, Mihalcea, Deng (bib0007) 2015 Hodosh, Young, Hockenmaier (bib0032) 2013; 47 Ordonez, Kulkarni, Berg. (bib0015) 2011 (2018). Papa, Scheirer, Cox (bib0022) 2016; 46 K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, arXiv Donahue, Jia, Vinyals, Hoffman, Zhang, Tzeng, Darrell (bib0019) 2014 Venugopalan, Rohrbach, Donahue, Mooney, Darrell, Saenko (bib0047) 2015 Girshick, Donahue, Darrell, Malik (bib0003) 2014 Fang (10.1016/j.neucom.2018.05.080_bib0033) 2015 Bach (10.1016/j.neucom.2018.05.080_bib0079) 2002; 3 10.1016/j.neucom.2018.05.080_bib0077 10.1016/j.neucom.2018.05.080_bib0110 Hodosh (10.1016/j.neucom.2018.05.080_bib0032) 2013; 47 Wang (10.1016/j.neucom.2018.05.080_bib0075) 2016 10.1016/j.neucom.2018.05.080_bib0078 Girshick (10.1016/j.neucom.2018.05.080_bib0003) 2014 Collobert (10.1016/j.neucom.2018.05.080_bib0029) 2008 10.1016/j.neucom.2018.05.080_bib0113 Bourdev (10.1016/j.neucom.2018.05.080_bib0006) 2011 10.1016/j.neucom.2018.05.080_bib0109 10.1016/j.neucom.2018.05.080_bib0108 Blei (10.1016/j.neucom.2018.05.080_bib0122) 2003; 3 Zhang (10.1016/j.neucom.2018.05.080_bib0021) 2017; 52 Lampert (10.1016/j.neucom.2018.05.080_bib0004) 2009 Kuznetsova (10.1016/j.neucom.2018.05.080_bib0050) 2014; 2 Lebret (10.1016/j.neucom.2018.05.080_bib0058) 2015 Pu (10.1016/j.neucom.2018.05.080_bib0067) 2016 Goh (10.1016/j.neucom.2018.05.080_bib0017) 2014; 25 Dunning (10.1016/j.neucom.2018.05.080_bib0086) 1993; 19 Sutskever (10.1016/j.neucom.2018.05.080_bib0106) 2014 Ba (10.1016/j.neucom.2018.05.080_bib0114) 2015 10.1016/j.neucom.2018.05.080_bib0123 Johnson (10.1016/j.neucom.2018.05.080_bib0130) 2016 Gong (10.1016/j.neucom.2018.05.080_bib0010) 2014 Thomason (10.1016/j.neucom.2018.05.080_bib0046) 2014 Ratnaparkhi (10.1016/j.neucom.2018.05.080_bib0119) 2000 Gan (10.1016/j.neucom.2018.05.080_bib0005) 2016 Mao (10.1016/j.neucom.2018.05.080_bib0076) 2015 Andrew (10.1016/j.neucom.2018.05.080_bib0098) 2013 Mnih (10.1016/j.neucom.2018.05.080_bib0115) 2014 Fu (10.1016/j.neucom.2018.05.080_bib0072) 2016 Hardoon (10.1016/j.neucom.2018.05.080_bib0080) 2004; 16 Yan (10.1016/j.neucom.2018.05.080_bib0057) 2015 Karpathy (10.1016/j.neucom.2018.05.080_bib0061) 2015 Spratling (10.1016/j.neucom.2018.05.080_bib0112) 2004; 16 Elliott (10.1016/j.neucom.2018.05.080_bib0116) 2013 Le (10.1016/j.neucom.2018.05.080_bib0092) 2012 Ushiku (10.1016/j.neucom.2018.05.080_bib0054) 2015 Koehn (10.1016/j.neucom.2018.05.080_bib0088) 2005 Zhang (10.1016/j.neucom.2018.05.080_bib0117) 2005 Kulkarni (10.1016/j.neucom.2018.05.080_bib0087) 2011 10.1016/j.neucom.2018.05.080_bib0060 Clarke (10.1016/j.neucom.2018.05.080_bib0082) 2008; 31 Ushiku (10.1016/j.neucom.2018.05.080_bib0084) 2012 10.1016/j.neucom.2018.05.080_bib0063 10.1016/j.neucom.2018.05.080_bib0068 A.Rensink (10.1016/j.neucom.2018.05.080_bib0111) 2000; 7 10.1016/j.neucom.2018.05.080_bib0100 10.1016/j.neucom.2018.05.080_bib0103 Papa (10.1016/j.neucom.2018.05.080_bib0022) 2016; 46 Felzenszwalb (10.1016/j.neucom.2018.05.080_bib0002) 2010; 32 Hochreiter (10.1016/j.neucom.2018.05.080_bib0107) 1997; 9 Vedantam (10.1016/j.neucom.2018.05.080_bib0126) 2015 Hu (10.1016/j.neucom.2018.05.080_bib0096) 2014 Hede (10.1016/j.neucom.2018.05.080_bib0012) 2004 10.1016/j.neucom.2018.05.080_bib0039 10.1016/j.neucom.2018.05.080_bib0038 Mikolov (10.1016/j.neucom.2018.05.080_bib0104) 2010 Berger (10.1016/j.neucom.2018.05.080_bib0118) 1996; 22 Lin (10.1016/j.neucom.2018.05.080_bib0124) 2004 Donahue (10.1016/j.neucom.2018.05.080_bib0019) 2014 Kulkarni (10.1016/j.neucom.2018.05.080_bib0051) 2013; 35 Mnih (10.1016/j.neucom.2018.05.080_bib0030) 2007 Oliva (10.1016/j.neucom.2018.05.080_bib0085) 2001; 42 Farhadi (10.1016/j.neucom.2018.05.080_bib0089) 2013; 35 Oruganti (10.1016/j.neucom.2018.05.080_bib0074) 2016 Kuznetsova (10.1016/j.neucom.2018.05.080_bib0083) 2012 10.1016/j.neucom.2018.05.080_bib0042 10.1016/j.neucom.2018.05.080_bib0041 10.1016/j.neucom.2018.05.080_bib0044 Venugopalan (10.1016/j.neucom.2018.05.080_bib0048) 2016 Zhou (10.1016/j.neucom.2018.05.080_bib0009) 2014 10.1016/j.neucom.2018.05.080_bib0043 10.1016/j.neucom.2018.05.080_bib0045 Kalchbrenner (10.1016/j.neucom.2018.05.080_bib0105) 2013 Chen (10.1016/j.neucom.2018.05.080_bib0062) 2015 LeCun (10.1016/j.neucom.2018.05.080_bib0090) 2015; 521 Mason (10.1016/j.neucom.2018.05.080_bib0049) 2014 Chao (10.1016/j.neucom.2018.05.080_bib0007) 2015 Uijlings (10.1016/j.neucom.2018.05.080_bib0121) 2013; 104 Lavie (10.1016/j.neucom.2018.05.080_bib0125) 2007 You (10.1016/j.neucom.2018.05.080_bib0069) 2016 Mitchell (10.1016/j.neucom.2018.05.080_bib0053) 2012 Elman (10.1016/j.neucom.2018.05.080_bib0101) 1990; 14 Karpathy (10.1016/j.neucom.2018.05.080_bib0037) 2014; 3 Roth (10.1016/j.neucom.2018.05.080_bib0081) 2004 10.1016/j.neucom.2018.05.080_bib0095 10.1016/j.neucom.2018.05.080_bib0094 10.1016/j.neucom.2018.05.080_bib0097 Donahue (10.1016/j.neucom.2018.05.080_bib0034) 2015 Farhadi (10.1016/j.neucom.2018.05.080_bib0013) 2010 Ma (10.1016/j.neucom.2018.05.080_bib0056) 2015 Li (10.1016/j.neucom.2018.05.080_bib0052) 2011 Tran (10.1016/j.neucom.2018.05.080_bib0071) 2016 Schuster (10.1016/j.neucom.2018.05.080_bib0102) 1997; 45 Kiros (10.1016/j.neucom.2018.05.080_bib0059) 2014 Krizhevsky (10.1016/j.neucom.2018.05.080_bib0008) 2012 10.1016/j.neucom.2018.05.080_bib0128 Fei-Fei (10.1016/j.neucom.2018.05.080_bib0001) 2007; 7 Ijjina (10.1016/j.neucom.2018.05.080_bib0024) 2016; 46 Cho (10.1016/j.neucom.2018.05.080_bib0129) 2015; 17 Wu (10.1016/j.neucom.2018.05.080_bib0066) 2016 Mnih (10.1016/j.neucom.2018.05.080_bib0099) 2013 Ma (10.1016/j.neucom.2018.05.080_bib0073) 2016 Mikolov (10.1016/j.neucom.2018.05.080_bib0031) 2013 L A Hendricks (10.1016/j.neucom.2018.05.080_bib0036) 2016 Vinyals (10.1016/j.neucom.2018.05.080_bib0064) 2015 Kojima (10.1016/j.neucom.2018.05.080_bib0011) 2002; 50 10.1016/j.neucom.2018.05.080_bib0020 Yang (10.1016/j.neucom.2018.05.080_bib0014) 2011 10.1016/j.neucom.2018.05.080_bib0023 Mao (10.1016/j.neucom.2018.05.080_bib0035) 2015 Marneffe (10.1016/j.neucom.2018.05.080_bib0093) 2006 Bai (10.1016/j.neucom.2018.05.080_bib0026) 2017; 71 Wang (10.1016/j.neucom.2018.05.080_bib0025) 2015; 37 10.1016/j.neucom.2018.05.080_bib0028 10.1016/j.neucom.2018.05.080_bib0027 Bengio (10.1016/j.neucom.2018.05.080_bib0018) 2013; 35 Socher (10.1016/j.neucom.2018.05.080_bib0055) 2014; 2 Young (10.1016/j.neucom.2018.05.080_bib0127) 2014 Gupta (10.1016/j.neucom.2018.05.080_bib0016) 2012; 5 Malinowski (10.1016/j.neucom.2018.05.080_bib0040) 2015 Jia (10.1016/j.neucom.2018.05.080_bib0065) 2015 Ordonez (10.1016/j.neucom.2018.05.080_bib0015) 2011 Frome (10.1016/j.neucom.2018.05.080_bib0091) 2013 He (10.1016/j.neucom.2018.05.080_bib0120) 2016 Venugopalan (10.1016/j.neucom.2018.05.080_bib0047) 2015 Yang (10.1016/j.neucom.2018.05.080_bib0070) 2016 |
| References_xml | – start-page: 1292 year: 2013 end-page: 1302 ident: bib0116 article-title: Image description using visual dependency representations publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing – start-page: 2625 year: 2015 end-page: 2634 ident: bib0034 article-title: Long-term recurrent convolutional networks for visual recognition and description publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 1 year: 2016 end-page: 6 ident: bib0073 article-title: Describing images by feeding LSTM with structural words publication-title: Proceedings of the IEEE International Conference on Multimedia and Expo – year: 2012 ident: bib0083 article-title: Collective generation of natural image descriptions publication-title: Proceedings of the Meeting of the Association for Computational Linguistics – reference: Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, CNN: single-label to multi-label, arXiv: – volume: 3 start-page: 993 year: 2003 end-page: 1022 ident: bib0122 article-title: Latent Dirichlet allocation publication-title: J. Mach. Learn. Res. – reference: Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, arXiv: – volume: 7 start-page: 1 year: 2007 end-page: 29 ident: bib0001 article-title: What do we perceive in a glance of a real-world scene? publication-title: J. Vis. – start-page: 2265 year: 2013 end-page: 2273 ident: bib0099 article-title: Learning word embeddings efficiently with noise-contrastive estimation publication-title: Proceedings of the Advances in Neural Information Processing Systems – reference: K. Greff, R.K. Srivastava, J. KoutnÃk, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, arXiv: – start-page: 1247 year: 2013 end-page: 1255 ident: bib0098 article-title: Deep canonical correlation analysis publication-title: Proceedings of the International Conference on Machine Learning – reference: A. Tariq, H. Foroosh, A context-driven extractive framework for generating realistic image descriptions, IEEE Trans. Image Process. 26(2). – volume: 46 start-page: 875 year: 2016 end-page: 885 ident: bib0022 article-title: Fine-tuning deep belief networks using harmony search publication-title: Appl. Soft Comput. – start-page: 641 year: 2007 end-page: 648 ident: bib0030 article-title: Three new graphical models for statistical language modelling publication-title: Proceedings of the Twenty Fourth International Conference on Machine Learning – year: 2015 ident: bib0035 article-title: Deep captioning with multimodal recurrent neural networks publication-title: Proceedings of the International Conference on Learning Representation – reference: S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: Visual question answering., arXiv: – start-page: 2042 year: 2014 end-page: 2050 ident: bib0096 article-title: Convolutional neural network architectures for matching natural language sentences publication-title: Proceedings of the Twenty Seventh International Conference on Neural Information Processing Systems – reference: (2014) 1–14. – year: 2013 ident: bib0031 article-title: Distributed representations of words and phrases and their compositionality publication-title: Proceedings of the Advances in Neural Information Processing Systems – volume: 47 start-page: 853 year: 2013 end-page: 899 ident: bib0032 article-title: Framing image description as a ranking task: data, models and evaluation metrics publication-title: J. Artif. Intell. Res. – volume: 31 start-page: 339 year: 2008 end-page: 429 ident: bib0082 article-title: Global inference for sentence compression an integer linear programming approach publication-title: J. Artif. Intell. Res. – reference: D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv: – reference: K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv: – reference: Y. Feng, M. Lapata, Automatic caption generation for news images, IEEE Trans. Pattern Anal. Mach. Intell. 35(4). – volume: 16 start-page: 219 year: 2004 end-page: 237 ident: bib0112 article-title: A feedback model of visual attention publication-title: J. Cognit. Neurosci. – start-page: 3613 year: 2016 end-page: 3617 ident: bib0074 article-title: Image description through fusion based recurrent multi-modal learning publication-title: Proceedings of the IEEE International Conference on Image Processing – reference: M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 1682–1690. – volume: 14 start-page: 179 year: 1990 end-page: 211 ident: bib0101 article-title: Finding structure in time publication-title: Cognit. Sci. – start-page: 160 year: 2008 end-page: 167 ident: bib0029 article-title: A unified architecture for natural language processing:deep neural networks with multitask learning publication-title: Proceedings of the Twenty Fifth International Conference on Machine Learning – year: 2014 ident: bib0115 article-title: Recurrent models of visual attention publication-title: Proceedings of the Advances in Neural Information Processing Systems – year: 2011 ident: bib0087 article-title: Baby talk: understanding and generating simple image descriptions publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 770 year: 2016 end-page: 778 ident: bib0120 article-title: Deep residual learning for image recognition publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 2 start-page: 207 year: 2014 end-page: 218 ident: bib0055 article-title: Grounded compositional semantics for finding and describing images with sentences publication-title: TACL – year: 2014 ident: bib0106 article-title: Sequence to sequence learning with neural networks publication-title: Proceedings of the Advances in Neural Information Processing Systems – reference: (2017). – volume: 104 start-page: 154 year: 2013 end-page: 171 ident: bib0121 article-title: Selective search for object recognition publication-title: Int. J. Comput. Vis. – start-page: 449 year: 2006 end-page: 454 ident: bib0093 article-title: Generating typed dependency parses from phrase structure parses publication-title: Proceedings of the LREC – reference: (2018). – reference: K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the Meeting on Association for Computational Linguistics, vol. 4 (2002). – year: 2016 ident: bib0072 article-title: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – start-page: 2407 year: 2015 end-page: 2415 ident: bib0065 article-title: Guiding the long-short term memory model for image caption generation publication-title: Proceedings of the IEEE International Conference on Computer Vision – volume: 521 start-page: 436 year: 2015 end-page: 444 ident: bib0090 article-title: Deep learning publication-title: Nature – reference: (2016). – start-page: 444 year: 2011 end-page: 454 ident: bib0014 article-title: Corpus-guided sentence generation of natural images publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing – volume: 50 start-page: 171 year: 2002 end-page: 184 ident: bib0011 article-title: Natural language description of human activities from video images based on concept hierarchy of actions publication-title: Int. Comput. Vis. – year: 2013 ident: bib0105 article-title: Recurrent continuous translation models publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing – volume: 3 start-page: 1 year: 2002 end-page: 48 ident: bib0079 article-title: Kernel independent component analysis publication-title: J. Mach. Learn. Res. – year: 2004 ident: bib0012 article-title: Automatic generation of natural language descriptions for images publication-title: Proceedings of the Recherche Dinformation Assistee Par Ordinateur – year: 2004 ident: bib0124 article-title: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics publication-title: Proceedings of the Meeting on Association for Computational Linguistics – year: 2014 ident: bib0049 article-title: Nonparametric method for data driven image captioning publication-title: Proceedings of the Fifty Second Annual Meeting of the Association for Computational Linguistics – reference: J. Curran, S. Clark, J. Bos, Linguistically motivated large-scale NLP with CC and boxer, in: Proceedings of the Forty Fifth Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 33–36. – volume: 35 start-page: 2854 year: 2013 end-page: 2865 ident: bib0089 article-title: Phrasal recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – start-page: 1473 year: 2015 end-page: 1482 ident: bib0033 article-title: From captions to visual concepts and back. publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 15 year: 2010 end-page: 29 ident: bib0013 article-title: Every picture tells a story: Generating sentences from images publication-title: Proceedings of the European Conference on Computer Vision, – year: 2005 ident: bib0088 article-title: Europarl: a parallel corpus for statistical machine translation publication-title: MT Summit – start-page: 194 year: 2000 end-page: 201 ident: bib0119 article-title: Trainable methods for surface natural language generation publication-title: Proceedings of the North American chapter of the Association for Computational Linguistics conference – volume: 19 start-page: 61 year: 1993 end-page: 74 ident: bib0086 article-title: Accurate methods for the statistics of surprise and coincidence publication-title: Comput. Linguist. – volume: 25 start-page: 2212 year: 2014 end-page: 2225 ident: bib0017 article-title: Learning deep hierarchical visual feature coding publication-title: IEEE Trans. Neural Netw. Learn. Syst. – reference: (2014). – start-page: 2623 year: 2015 end-page: 2631 ident: bib0056 article-title: Multimodal convolutional neural networks for matching image and sentences publication-title: Proceedings of the IEEE International Conference on Computer Vision – start-page: 4259 year: 2015 end-page: 4267 ident: bib0007 article-title: Mining semantic affordances of visual object categories publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 2668 year: 2015 end-page: 2676 ident: bib0054 article-title: Common subspace for model and similarity: phrase learning for caption generation from images publication-title: IEEE International Conference on Computer Vision – volume: 22 start-page: 39 year: 1996 end-page: 71 ident: bib0118 article-title: A maximum entropy approach to natural language processing publication-title: Comput. Linguist. – volume: 16 start-page: 2639 year: 2004 end-page: 2664 ident: bib0080 article-title: Canonical correlation analysis: an overview with application to learning methods publication-title: Neural Comput. – volume: 5 year: 2012 ident: bib0016 article-title: Choosing linguistics over vision to describe images publication-title: Proceedings of the AAAI Conference on Artificial Intelligence – start-page: 2422 year: 2015 end-page: 2431 ident: bib0062 article-title: Mind’s eye: a recurrent visual representation for image caption generation publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 37 start-page: 125 year: 2015 end-page: 141 ident: bib0025 article-title: Feedforward kernel neural networks, generalized least learning machine, and its deep learning with application to image classification publication-title: Appl. Soft Comput. – reference: C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich., Going deeper with convolutions, arXiv: – start-page: 1419 year: 2005 end-page: 1426 ident: bib0117 article-title: Multiple instance boosting for object detection publication-title: Proceedings of the Advances in Neural Information Processing Systems – reference: X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollar, C. Zitnick, Microsoft COCO captions: data collection and evaluation server, arXiv: – start-page: 1143 year: 2011 end-page: 1151 ident: bib0015 article-title: Im2Text: describing images using 1 million captioned photographs publication-title: Proceedings of the Advances in Neural Information Processing Systems – reference: (2013). – start-page: 4565 year: 2016 end-page: 4574 ident: bib0130 article-title: DenseCap: fully convolutional localization networks for dense captioning publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 52 start-page: 1210 year: 2017 end-page: 1221 ident: bib0021 article-title: Research on point-wise gated deep networks publication-title: Appl. Soft Comput. – volume: 32 start-page: 1627 year: 2010 end-page: 1645 ident: bib0002 article-title: Object detection with discriminatively trained part based models publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – reference: C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling., IEEE Trans. Pattern Anal. Mach. Intell. 35(8). – start-page: 580 year: 2014 end-page: 587 ident: bib0003 article-title: Rich feature hierarchies for accurate object detection and semantic segmentation publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 3177 year: 2011 end-page: 3184 ident: bib0006 article-title: Action recognition from a distributed representation of pose and appearance publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – reference: T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficien testimation of word representations in vector space, arXiv: – volume: 35 start-page: 1798 year: 2013 end-page: 1828 ident: bib0018 article-title: Representation learning: a review and new perspectives publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – start-page: 1097 year: 2012 end-page: 1105 ident: bib0008 article-title: Imagenet classification with deep convolutional neural networks publication-title: Proceedings of the Twenty Fifth International Conference on Neural Information Processing Systems – year: 2015 ident: bib0114 article-title: Multiple object recognition with visual attention publication-title: Proceedings of the International Conference on Learning Representation – year: 2015 ident: bib0047 article-title: Sequence to sequence – video to text publication-title: Proceedings of the International Conference on Computer Vision – start-page: 434 year: 2016 end-page: 441 ident: bib0071 article-title: Rich image captioning in the wild publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 7 start-page: 17 year: 2000 end-page: 42 ident: bib0111 article-title: The dynamic representation of scenes publication-title: Vis. Cognit. – start-page: 228 year: 2007 end-page: 231 ident: bib0125 article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments publication-title: Proceedings of the Second Workshop on Statistical Machine Translation – volume: 71 start-page: 279 year: 2017 end-page: 287 ident: bib0026 article-title: Growing random forest on deep convolutional neural networks for scene categorization publication-title: Expert Syst. Appl. – reference: (2015). – start-page: 392 year: 2014 end-page: 407 ident: bib0010 article-title: Multi-scale orderless pooling of deep convolutional activation features publication-title: Proceedings of the European Conference on Computer Vision – reference: K. Cho, B.V. Merrinboer, C. Gulcehre, Learning phrase representations using RNN encoder–decoder for statistical machine translation, arXiv: – year: 2016 ident: bib0067 article-title: Variational autoencoder for deep learning of images, labels and captions publication-title: Proceedings of the Advances in Neural Information Processing Systems – start-page: 203 year: 2016 end-page: 212 ident: bib0066 article-title: What value do explicit high level concepts have in vision to language problems? publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 3 start-page: 1889 year: 2014 end-page: 1897 ident: bib0037 article-title: Deep fragment embeddings for bidirectional image sentence mapping publication-title: Proceedings of the Twenty Seventh Advances in Neural Information Processing Systems (NIPS) – reference: (2018). – reference: Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5(5). – start-page: 2533 year: 2015 end-page: 2541 ident: bib0076 article-title: Learning like a child: fast novel visual concept learning from sentence descriptions of images publication-title: Proceedings of the IEEE International Conference on Computer Vision – year: 2012 ident: bib0092 article-title: Building high-level features using large scale unsupervised learning publication-title: Proceedings of the International Conference on Machine Learning – volume: 46 start-page: 936 year: 2016 end-page: 952 ident: bib0024 article-title: Hybrid deep neural network model for human action recognition publication-title: Appl. Soft Comput. – start-page: 1 year: 2016 end-page: 10 ident: bib0036 article-title: Deep compositional captioning: describing novel object categories without paired training data publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – year: 2012 ident: bib0053 article-title: Midge: Generating image descriptions from computer vision detections publication-title: Proceedings of the Thirteenth Conference of the European Chapter of the Association for Computational Linguistics – volume: 45 start-page: 2673 year: 1997 end-page: 2681 ident: bib0102 article-title: Bidirectional recurrent neural networks publication-title: IEEE Trans. Signal Process. – year: 2015 ident: bib0040 article-title: Ask your neurons:a neural-based approach to answering questions about images publication-title: Proceedings of the International Conference on Computer Vision – start-page: 647 year: 2014 end-page: 655 ident: bib0019 article-title: DeCAF: a deep convolutional activation feature for generic visual recognition publication-title: Proceedings of The Thirty First International Conference on Machine Learning – year: 2016 ident: bib0075 article-title: A parallel-fusion RNN-LSTM architecture for image caption generation publication-title: Proceedings of the IEEE International Conference on Image Processing – start-page: 87 year: 2016 end-page: 97 ident: bib0005 article-title: Learning attributes equals multi-source domain generalization publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – volume: 2 start-page: 351 year: 2014 end-page: 362 ident: bib0050 article-title: TREETALK: composition and compression of trees for image descriptions publication-title: Trans. Assoc. Comput. Linguist. – start-page: 2361 year: 2016 end-page: 2369 ident: bib0070 article-title: Review networks for caption generation publication-title: Proceedings of the Advances in Neural Information Processing Systems – reference: H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu., Are you talking to a machine? Dataset and methods for multilingual image question answering, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2296–2304. – reference: K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, arXiv: – start-page: 4566 year: 2015 end-page: 4575 ident: bib0126 article-title: CIDEr: consensus-based image description evaluation publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 1045 year: 2010 end-page: 1048 ident: bib0104 article-title: Recurrent neural network based language model publication-title: Proceedings of the Conference of the International Speech Communication Association – year: 2014 ident: bib0046 article-title: Integrating language and vision to generate natural language descriptions of videos in the wild publication-title: Proceedings of the International Conference on Computational Linguistics – start-page: 487 year: 2014 end-page: 495 ident: bib0009 article-title: Learning deep features for scene recognition using places database publication-title: Proceedings of the Advances in Neural Information Processing Systems (NIPS) – reference: S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, K. Saenko, YouTube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in: Proceedings of the International Conference on Computer Vision, pp. 2712–2719. – year: 2015 ident: bib0058 article-title: Phrase-based image captioning publication-title: Proceedings of the International Conference on Machine Learning – reference: O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Trans. Pattern Anal. Mach. Intell. 39(4). – reference: D. Geman, S. Geman, N. Hallonquist, L. Younes, Visual turing test for computer vision systems, in: Proceedings of the National Academy of Sciences of the United States of America, vol. 112, pp. 3618–3623. – volume: 35 start-page: 2891 year: 2013 end-page: 2903 ident: bib0051 article-title: BabyTalk: understanding and generating simple image descriptions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – year: 2004 ident: bib0081 article-title: A linear programming formulation for global inference in natural language tasks publication-title: Proceedings of the Annual Conference on Computational Natural Language Learning – reference: R. Kiros, R. Salakhutdinov, R. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, arXiv: – start-page: 3128 year: 2015 end-page: 3137 ident: bib0061 article-title: Deep visual-semantic alignments for generating image descriptions publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – start-page: 951 year: 2009 end-page: 958 ident: bib0004 article-title: Learning to detect unseen object classes by between class attribute transfer publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – year: 2012 ident: bib0084 article-title: Efficient image annotation for automatic sentence generation publication-title: Proceedings of the Twentieth ACM International Conference on Multimedia – reference: N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, arXiv: – start-page: 3156 year: 2015 end-page: 3164 ident: bib0064 article-title: Show and tell: a neural image caption generator publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – year: 2014 ident: bib0059 article-title: Multimodal neural language models publication-title: Proceedings of the International Conference on Machine Learning – start-page: 4651 year: 2016 end-page: 4659 ident: bib0069 article-title: Image captioning with semantic attention publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – reference: D. Lin, An information-theoretic definition of similarity, in: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304. – volume: 42 start-page: 145 year: 2001 end-page: 175 ident: bib0085 article-title: Modeling the shape of the scene: a holistic representation of the spatial envelope publication-title: Int. J. Comput. Vis. – reference: J. Mao, W. Xu, Y. Yang, J. Wang, A.L. Yuille, Explain images with multimodal recurrent neural networks, arXiv: – start-page: 2121 year: 2013 end-page: 2129 ident: bib0091 article-title: Devise: a deep visual-semantic embedding model publication-title: Proceedings of the Twenty Sixth International Conference on Neural Information Processing Systems – volume: 9 start-page: 1735 year: 1997 end-page: 1780 ident: bib0107 article-title: Long short-term memory publication-title: Neural Comput. – start-page: 3441 year: 2015 end-page: 3450 ident: bib0057 article-title: Deep correlation for matching images and text publication-title: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – year: 2016 ident: bib0048 article-title: Improving LSTM-based video description with linguistic knowledge mined from text publication-title: Proceedings of the Conference on Empirical Methods in Natural Language Processing – start-page: 67 year: 2014 end-page: 78 ident: bib0127 article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions publication-title: Proceedings of the Meeting on Association for Computational Linguistics – volume: 17 start-page: 1875 year: 2015 end-page: 1886 ident: bib0129 article-title: Describing multimedia content using attention-based encoder–decoder networks publication-title: IEEE Trans. Multimed. – year: 2011 ident: bib0052 article-title: Composing simple image descriptions using web-scale n-grams publication-title: Proceedings of the Fifteenth Conference on Computational Natural Language Learning – ident: 10.1016/j.neucom.2018.05.080_bib0023 doi: 10.1109/TPAMI.2012.231 – volume: 7 start-page: 1 issue: 1 year: 2007 ident: 10.1016/j.neucom.2018.05.080_bib0001 article-title: What do we perceive in a glance of a real-world scene? publication-title: J. Vis. doi: 10.1167/7.1.10 – volume: 35 start-page: 2854 issue: 12 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0089 article-title: Phrasal recognition publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2013.168 – ident: 10.1016/j.neucom.2018.05.080_bib0020 doi: 10.1145/2647868.2654889 – start-page: 1097 year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0008 article-title: Imagenet classification with deep convolutional neural networks – year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0046 article-title: Integrating language and vision to generate natural language descriptions of videos in the wild – year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0084 article-title: Efficient image annotation for automatic sentence generation – volume: 50 start-page: 171 year: 2002 ident: 10.1016/j.neucom.2018.05.080_bib0011 article-title: Natural language description of human activities from video images based on concept hierarchy of actions publication-title: Int. Comput. Vis. doi: 10.1023/A:1020346032608 – year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0031 article-title: Distributed representations of words and phrases and their compositionality – year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0114 article-title: Multiple object recognition with visual attention – ident: 10.1016/j.neucom.2018.05.080_bib0078 doi: 10.3115/1557769.1557781 – volume: 42 start-page: 145 issue: 3 year: 2001 ident: 10.1016/j.neucom.2018.05.080_bib0085 article-title: Modeling the shape of the scene: a holistic representation of the spatial envelope publication-title: Int. J. Comput. Vis. doi: 10.1023/A:1011139631724 – volume: 9 start-page: 1735 issue: 8 year: 1997 ident: 10.1016/j.neucom.2018.05.080_bib0107 article-title: Long short-term memory publication-title: Neural Comput. doi: 10.1162/neco.1997.9.8.1735 – start-page: 1292 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0116 article-title: Image description using visual dependency representations – start-page: 1473 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0033 article-title: From captions to visual concepts and back. – year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0072 article-title: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – start-page: 15 year: 2010 ident: 10.1016/j.neucom.2018.05.080_bib0013 article-title: Every picture tells a story: Generating sentences from images – ident: 10.1016/j.neucom.2018.05.080_bib0123 doi: 10.3115/1073083.1073135 – start-page: 160 year: 2008 ident: 10.1016/j.neucom.2018.05.080_bib0029 article-title: A unified architecture for natural language processing:deep neural networks with multitask learning – year: 2004 ident: 10.1016/j.neucom.2018.05.080_bib0081 article-title: A linear programming formulation for global inference in natural language tasks – ident: 10.1016/j.neucom.2018.05.080_bib0103 doi: 10.1109/72.279181 – year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0035 article-title: Deep captioning with multimodal recurrent neural networks – ident: 10.1016/j.neucom.2018.05.080_bib0110 – start-page: 3613 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0074 article-title: Image description through fusion based recurrent multi-modal learning – volume: 19 start-page: 61 issue: 1 year: 1993 ident: 10.1016/j.neucom.2018.05.080_bib0086 article-title: Accurate methods for the statistics of surprise and coincidence publication-title: Comput. Linguist. – start-page: 951 year: 2009 ident: 10.1016/j.neucom.2018.05.080_bib0004 article-title: Learning to detect unseen object classes by between class attribute transfer – year: 2011 ident: 10.1016/j.neucom.2018.05.080_bib0052 article-title: Composing simple image descriptions using web-scale n-grams – start-page: 2533 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0076 article-title: Learning like a child: fast novel visual concept learning from sentence descriptions of images – year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0067 article-title: Variational autoencoder for deep learning of images, labels and captions – year: 2004 ident: 10.1016/j.neucom.2018.05.080_bib0124 article-title: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics – start-page: 2625 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0034 article-title: Long-term recurrent convolutional networks for visual recognition and description – year: 2005 ident: 10.1016/j.neucom.2018.05.080_bib0088 article-title: Europarl: a parallel corpus for statistical machine translation – start-page: 194 year: 2000 ident: 10.1016/j.neucom.2018.05.080_bib0119 article-title: Trainable methods for surface natural language generation – ident: 10.1016/j.neucom.2018.05.080_bib0038 doi: 10.1109/ICCV.2015.279 – start-page: 2361 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0070 article-title: Review networks for caption generation – start-page: 2121 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0091 article-title: Devise: a deep visual-semantic embedding model – start-page: 1 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0073 article-title: Describing images by feeding LSTM with structural words – volume: 104 start-page: 154 issue: 2 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0121 article-title: Selective search for object recognition publication-title: Int. J. Comput. Vis. doi: 10.1007/s11263-013-0620-5 – volume: 35 start-page: 1798 issue: 8 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0018 article-title: Representation learning: a review and new perspectives publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2013.50 – ident: 10.1016/j.neucom.2018.05.080_bib0043 doi: 10.1109/TPAMI.2012.118 – year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0083 article-title: Collective generation of natural image descriptions – start-page: 203 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0066 article-title: What value do explicit high level concepts have in vision to language problems? – ident: 10.1016/j.neucom.2018.05.080_bib0113 – start-page: 2407 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0065 article-title: Guiding the long-short term memory model for image caption generation – volume: 31 start-page: 339 year: 2008 ident: 10.1016/j.neucom.2018.05.080_bib0082 article-title: Global inference for sentence compression an integer linear programming approach publication-title: J. Artif. Intell. Res. doi: 10.1613/jair.2433 – volume: 45 start-page: 2673 issue: 11 year: 1997 ident: 10.1016/j.neucom.2018.05.080_bib0102 article-title: Bidirectional recurrent neural networks publication-title: IEEE Trans. Signal Process. doi: 10.1109/78.650093 – start-page: 3156 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0064 article-title: Show and tell: a neural image caption generator – start-page: 2623 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0056 article-title: Multimodal convolutional neural networks for matching image and sentences – year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0053 article-title: Midge: Generating image descriptions from computer vision detections – year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0059 article-title: Multimodal neural language models – ident: 10.1016/j.neucom.2018.05.080_bib0039 – ident: 10.1016/j.neucom.2018.05.080_bib0097 doi: 10.3115/v1/P14-1062 – volume: 2 start-page: 207 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0055 article-title: Grounded compositional semantics for finding and describing images with sentences publication-title: TACL doi: 10.1162/tacl_a_00177 – volume: 16 start-page: 2639 year: 2004 ident: 10.1016/j.neucom.2018.05.080_bib0080 article-title: Canonical correlation analysis: an overview with application to learning methods publication-title: Neural Comput. doi: 10.1162/0899766042321814 – start-page: 1247 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0098 article-title: Deep canonical correlation analysis – start-page: 770 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0120 article-title: Deep residual learning for image recognition – start-page: 4566 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0126 article-title: CIDEr: consensus-based image description evaluation – start-page: 87 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0005 article-title: Learning attributes equals multi-source domain generalization – start-page: 444 year: 2011 ident: 10.1016/j.neucom.2018.05.080_bib0014 article-title: Corpus-guided sentence generation of natural images – start-page: 3441 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0057 article-title: Deep correlation for matching images and text – year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0106 article-title: Sequence to sequence learning with neural networks – ident: 10.1016/j.neucom.2018.05.080_bib0128 – volume: 5 year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0016 article-title: Choosing linguistics over vision to describe images – start-page: 1143 year: 2011 ident: 10.1016/j.neucom.2018.05.080_bib0015 article-title: Im2Text: describing images using 1 million captioned photographs – ident: 10.1016/j.neucom.2018.05.080_bib0077 – start-page: 641 year: 2007 ident: 10.1016/j.neucom.2018.05.080_bib0030 article-title: Three new graphical models for statistical language modelling – ident: 10.1016/j.neucom.2018.05.080_bib0028 doi: 10.3115/v1/D14-1179 – volume: 47 start-page: 853 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0032 article-title: Framing image description as a ranking task: data, models and evaluation metrics publication-title: J. Artif. Intell. Res. doi: 10.1613/jair.3994 – volume: 16 start-page: 219 issue: 2 year: 2004 ident: 10.1016/j.neucom.2018.05.080_bib0112 article-title: A feedback model of visual attention publication-title: J. Cognit. Neurosci. doi: 10.1162/089892904322984526 – start-page: 647 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0019 article-title: DeCAF: a deep convolutional activation feature for generic visual recognition – volume: 3 start-page: 1889 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0037 article-title: Deep fragment embeddings for bidirectional image sentence mapping – volume: 46 start-page: 875 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0022 article-title: Fine-tuning deep belief networks using harmony search publication-title: Appl. Soft Comput. doi: 10.1016/j.asoc.2015.08.043 – start-page: 392 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0010 article-title: Multi-scale orderless pooling of deep convolutional activation features – start-page: 4651 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0069 article-title: Image captioning with semantic attention – year: 2012 ident: 10.1016/j.neucom.2018.05.080_bib0092 article-title: Building high-level features using large scale unsupervised learning – ident: 10.1016/j.neucom.2018.05.080_bib0063 – start-page: 1 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0036 article-title: Deep compositional captioning: describing novel object categories without paired training data – ident: 10.1016/j.neucom.2018.05.080_bib0095 – volume: 14 start-page: 179 issue: 2 year: 1990 ident: 10.1016/j.neucom.2018.05.080_bib0101 article-title: Finding structure in time publication-title: Cognit. Sci. doi: 10.1207/s15516709cog1402_1 – volume: 32 start-page: 1627 issue: 9 year: 2010 ident: 10.1016/j.neucom.2018.05.080_bib0002 article-title: Object detection with discriminatively trained part based models publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2009.167 – start-page: 67 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0127 article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions – start-page: 4259 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0007 article-title: Mining semantic affordances of visual object categories – volume: 25 start-page: 2212 issue: 12 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0017 article-title: Learning deep hierarchical visual feature coding publication-title: IEEE Trans. Neural Netw. Learn. Syst. doi: 10.1109/TNNLS.2014.2307532 – year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0075 article-title: A parallel-fusion RNN-LSTM architecture for image caption generation – volume: 46 start-page: 936 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0024 article-title: Hybrid deep neural network model for human action recognition publication-title: Appl. Soft Comput. doi: 10.1016/j.asoc.2015.08.025 – start-page: 3128 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0061 article-title: Deep visual-semantic alignments for generating image descriptions – volume: 7 start-page: 17 issue: 1 year: 2000 ident: 10.1016/j.neucom.2018.05.080_bib0111 article-title: The dynamic representation of scenes publication-title: Vis. Cognit. doi: 10.1080/135062800394667 – start-page: 2422 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0062 article-title: Mind’s eye: a recurrent visual representation for image caption generation – year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0048 article-title: Improving LSTM-based video description with linguistic knowledge mined from text – volume: 37 start-page: 125 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0025 article-title: Feedforward kernel neural networks, generalized least learning machine, and its deep learning with application to image classification publication-title: Appl. Soft Comput. doi: 10.1016/j.asoc.2015.07.040 – ident: 10.1016/j.neucom.2018.05.080_bib0060 – year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0040 article-title: Ask your neurons:a neural-based approach to answering questions about images – ident: 10.1016/j.neucom.2018.05.080_bib0042 doi: 10.1073/pnas.1422953112 – ident: 10.1016/j.neucom.2018.05.080_bib0045 doi: 10.1109/ICCV.2013.337 – ident: 10.1016/j.neucom.2018.05.080_bib0068 – ident: 10.1016/j.neucom.2018.05.080_bib0044 doi: 10.1109/TIP.2016.2628585 – ident: 10.1016/j.neucom.2018.05.080_bib0100 – volume: 71 start-page: 279 year: 2017 ident: 10.1016/j.neucom.2018.05.080_bib0026 article-title: Growing random forest on deep convolutional neural networks for scene categorization publication-title: Expert Syst. Appl. doi: 10.1016/j.eswa.2016.10.038 – ident: 10.1016/j.neucom.2018.05.080_bib0108 doi: 10.1109/TPAMI.2016.2587640 – ident: 10.1016/j.neucom.2018.05.080_bib0094 – start-page: 2668 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0054 article-title: Common subspace for model and similarity: phrase learning for caption generation from images – start-page: 580 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0003 article-title: Rich feature hierarchies for accurate object detection and semantic segmentation – year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0047 article-title: Sequence to sequence – video to text – start-page: 1419 year: 2005 ident: 10.1016/j.neucom.2018.05.080_bib0117 article-title: Multiple instance boosting for object detection – year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0058 article-title: Phrase-based image captioning – start-page: 434 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0071 article-title: Rich image captioning in the wild – start-page: 1045 year: 2010 ident: 10.1016/j.neucom.2018.05.080_bib0104 article-title: Recurrent neural network based language model – year: 2011 ident: 10.1016/j.neucom.2018.05.080_bib0087 article-title: Baby talk: understanding and generating simple image descriptions – year: 2004 ident: 10.1016/j.neucom.2018.05.080_bib0012 article-title: Automatic generation of natural language descriptions for images – start-page: 3177 year: 2011 ident: 10.1016/j.neucom.2018.05.080_bib0006 article-title: Action recognition from a distributed representation of pose and appearance – start-page: 2265 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0099 article-title: Learning word embeddings efficiently with noise-contrastive estimation – volume: 35 start-page: 2891 issue: 12 year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0051 article-title: BabyTalk: understanding and generating simple image descriptions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2012.162 – start-page: 4565 year: 2016 ident: 10.1016/j.neucom.2018.05.080_bib0130 article-title: DenseCap: fully convolutional localization networks for dense captioning – ident: 10.1016/j.neucom.2018.05.080_bib0027 – volume: 17 start-page: 1875 issue: 11 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0129 article-title: Describing multimedia content using attention-based encoder–decoder networks publication-title: IEEE Trans. Multimed. doi: 10.1109/TMM.2015.2477044 – start-page: 449 year: 2006 ident: 10.1016/j.neucom.2018.05.080_bib0093 article-title: Generating typed dependency parses from phrase structure parses – year: 2013 ident: 10.1016/j.neucom.2018.05.080_bib0105 article-title: Recurrent continuous translation models – year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0049 article-title: Nonparametric method for data driven image captioning – volume: 22 start-page: 39 issue: 1 year: 1996 ident: 10.1016/j.neucom.2018.05.080_bib0118 article-title: A maximum entropy approach to natural language processing publication-title: Comput. Linguist. – start-page: 228 year: 2007 ident: 10.1016/j.neucom.2018.05.080_bib0125 article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments – volume: 3 start-page: 993 year: 2003 ident: 10.1016/j.neucom.2018.05.080_bib0122 article-title: Latent Dirichlet allocation publication-title: J. Mach. Learn. Res. – ident: 10.1016/j.neucom.2018.05.080_bib0041 – volume: 521 start-page: 436 issue: 7553 year: 2015 ident: 10.1016/j.neucom.2018.05.080_bib0090 article-title: Deep learning publication-title: Nature doi: 10.1038/nature14539 – year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0115 article-title: Recurrent models of visual attention – volume: 2 start-page: 351 issue: 10 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0050 article-title: TREETALK: composition and compression of trees for image descriptions publication-title: Trans. Assoc. Comput. Linguist. doi: 10.1162/tacl_a_00188 – start-page: 2042 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0096 article-title: Convolutional neural network architectures for matching natural language sentences – volume: 3 start-page: 1 year: 2002 ident: 10.1016/j.neucom.2018.05.080_bib0079 article-title: Kernel independent component analysis publication-title: J. Mach. Learn. Res. – start-page: 487 year: 2014 ident: 10.1016/j.neucom.2018.05.080_bib0009 article-title: Learning deep features for scene recognition using places database – volume: 52 start-page: 1210 year: 2017 ident: 10.1016/j.neucom.2018.05.080_bib0021 article-title: Research on point-wise gated deep networks publication-title: Appl. Soft Comput. doi: 10.1016/j.asoc.2016.08.056 – ident: 10.1016/j.neucom.2018.05.080_bib0109 doi: 10.1109/TNNLS.2016.2582924 |
| SSID | ssj0017129 |
| Score | 2.58497 |
| Snippet | Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 291 |
| SubjectTerms | Attention mechanism Deep neural networks Encoder–decoder framework Image captioning Multimodal embedding Sentence template |
| Title | A survey on automatic image caption generation |
| URI | https://dx.doi.org/10.1016/j.neucom.2018.05.080 |
| Volume | 311 |
| WOSCitedRecordID | wos000438313100027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-8286 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0017129 issn: 0925-2312 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Nb9QwELVgy4ELlC9RoMgHbshVYjtxfAyoqOVQIbVIe4sc24GuaLba3VT9-Yw9TlqxiC-JSxRZ612vnzMeT2beI-SN7nhZeWsYbMaKSdcZ1paFY6V3xvNMe5nZKDahTk6q-Vx_SsKF6ygnoPq-ur7Wl_8VamgDsEPp7F_APX0pNMA9gA5XgB2ufwR8_XY9rK7gUQ95xsNmiZys5xchOccatBBfItn0hMlipHAaYDuLMg8pgFBfBB4FFxbNFDB4hwLWp18Hk3a9GEXAtrTWUhghj5yuWEiJsa2t-hYMEvKCgQeI9tKjiawUj8Xnt22oSBYzWUEU4EobqkB94S1bjWGDxUHvh5C4EwYVSVRR2ekHFuzTMJQwEjBB4W2Rvkt2uCp0NSM79fHh_OP06kjlHAkW09DHesmY1Lf9Wz_3R275GGe75EE6HNAaQX1E7vj-MXk4Cm_QZIefkIOaIsZ02dMJYxoxpgljeoPxU_L5w-HZ-yOWhC-YFYpvmJNOtLrybSeLTOZdKYwoXaszY3OrnbCF6fIOToa5KYxT4PQVYKtbUwpuhJGdeEZm_bL3zwm1AjzArG2501ZaA-cBKVVmfVeCn2hau0fE-Pcbm1jhgzjJt2ZM_1s0OGlNmLQmKxqYtD3Cpl6XyIrym8-rcWab5Nmhx9bAYvhlzxf_3PMluX-zzl-R2WY1-H1yz15tzter12nVfAf6x3I0 |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+survey+on+automatic+image+caption+generation&rft.jtitle=Neurocomputing+%28Amsterdam%29&rft.au=Bai%2C+Shuang&rft.au=An%2C+Shan&rft.date=2018-10-15&rft.pub=Elsevier+B.V&rft.issn=0925-2312&rft.eissn=1872-8286&rft.volume=311&rft.spage=291&rft.epage=304&rft_id=info:doi/10.1016%2Fj.neucom.2018.05.080&rft.externalDocID=S0925231218306659 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0925-2312&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0925-2312&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0925-2312&client=summon |