Empirical autopsy of deep video captioning encoder-decoder architecture
Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language m...
Uloženo v:
| Vydáno v: | Array (New York) Ročník 9; s. 100052 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Inc
01.03.2021
Elsevier |
| Témata: | |
| ISSN: | 2590-0056, 2590-0056 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language model to map visual features to natural language captions. Due to its composite nature, the encoder-decoder pipeline provides the freedom of multiple choices for each of its components, e.g., the choices of CNN models, feature transformations, word embeddings, and language models etc. Component selection can have drastic effects on the overall video captioning performance. However, current literature is void of any systematic investigation in this regard. This article fills this gap by providing the first thorough empirical analysis of the role that each major component plays in a widely adopted video captioning pipeline. We perform extensive experiments by varying the constituent components of the video captioning framework, and quantify the performance gains that are possible by mere component selection. We use the popular MSVD dataset as the test-bed, and demonstrate that substantial performance gains are possible by careful selection of the constituent components without major changes to the pipeline itself. These results are expected to provide guiding principles for research in the fast growing direction of video captioning. |
|---|---|
| AbstractList | Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language model to map visual features to natural language captions. Due to its composite nature, the encoder-decoder pipeline provides the freedom of multiple choices for each of its components, e.g., the choices of CNN models, feature transformations, word embeddings, and language models etc. Component selection can have drastic effects on the overall video captioning performance. However, current literature is void of any systematic investigation in this regard. This article fills this gap by providing the first thorough empirical analysis of the role that each major component plays in a widely adopted video captioning pipeline. We perform extensive experiments by varying the constituent components of the video captioning framework, and quantify the performance gains that are possible by mere component selection. We use the popular MSVD dataset as the test-bed, and demonstrate that substantial performance gains are possible by careful selection of the constituent components without major changes to the pipeline itself. These results are expected to provide guiding principles for research in the fast growing direction of video captioning. |
| ArticleNumber | 100052 |
| Author | Mian, Ajmal Aafaq, Nayyer Akhtar, Naveed Liu, Wei |
| Author_xml | – sequence: 1 givenname: Nayyer orcidid: 0000-0003-2763-2094 surname: Aafaq fullname: Aafaq, Nayyer email: nayyer.aafaq@research.uwa.edu.au – sequence: 2 givenname: Naveed orcidid: 0000-0003-3406-673X surname: Akhtar fullname: Akhtar, Naveed – sequence: 3 givenname: Wei surname: Liu fullname: Liu, Wei – sequence: 4 givenname: Ajmal surname: Mian fullname: Mian, Ajmal |
| BookMark | eNqFkMFKAzEQQINUsNZ-gZf9ga3ZZNNmDx6k1FooeNFzmJ3M1pR2s2S3hf692VZEPOhphiHvQd4tG9S-JsbuMz7JeDZ92E4gBDhNBBf9hXMlrthQqIKncZ8Ofuw3bNy22_hEqCzLlB6y5WLfuOAQdgkcOt-0p8RXiSVqkqOz5BOEpnO-dvUmoRq9pZBaOs8EAn64jrA7BLpj1xXsWhp_zRF7f168zV_S9etyNX9apyjzvEunZS5QVkrkyuaFRYSpLUpblVILSTlyhBkItLxArnMNM7QzRUjEKy4VVXLEVhev9bA1TXB7CCfjwZnzwYeNgdA53JEpNEqprCyljSqrdGkFga7KTBUAWkeXvLgw-LYNVH37Mm76tGZrzmlNn9Zc0kaq-EWh66Bv1AVwu3_YxwtLMdHRUTAtupiVrAuxY_yD-5P_BCW6mTc |
| CitedBy_id | crossref_primary_10_1007_s42235_025_00743_3 crossref_primary_10_1007_s00607_024_01334_6 crossref_primary_10_1007_s11042_024_19247_z crossref_primary_10_1109_TAI_2021_3134190 crossref_primary_10_1007_s11042_023_15933_6 crossref_primary_10_1016_j_eswa_2023_120454 crossref_primary_10_1016_j_patcog_2022_109202 |
| Cites_doi | 10.1162/tacl_a_00051 10.1109/TGRS.2017.2783902 10.1207/s15516709cog1402_1 10.1007/s11263-015-0816-y 10.1162/neco.1997.9.8.1735 10.1109/TGRS.2016.2523563 10.1023/A:1020346032608 10.1109/TIP.2017.2694222 |
| ContentType | Journal Article |
| Copyright | 2020 The Author(s) |
| Copyright_xml | – notice: 2020 The Author(s) |
| DBID | 6I. AAFTH AAYXX CITATION DOA |
| DOI | 10.1016/j.array.2020.100052 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2590-0056 |
| ExternalDocumentID | oai_doaj_org_article_98c335d3b3d848d58bd2ea8fb159aa88 10_1016_j_array_2020_100052 S2590005620300370 |
| GroupedDBID | 0SF 6I. AAEDW AAFTH AALRI AAXUO AEXQZ AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ EBS EJD FDB GROUPED_DOAJ M41 M~E NCXOZ OK1 ROL 0R~ AAYWO AAYXX ACVFH ADCNI ADVLN AEUPX AFJKZ AFPUW AIGII AKBMS AKYEP APXCP CITATION |
| ID | FETCH-LOGICAL-c344t-6b42c3f5245d49dcca6d9bdfb3823e4c0ca7a2cd09c0848a7cd75ecee0f035ef3 |
| IEDL.DBID | DOA |
| ISICitedReferencesCount | 4 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001141397800001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2590-0056 |
| IngestDate | Fri Oct 03 12:50:26 EDT 2025 Thu Nov 20 00:46:57 EST 2025 Tue Nov 18 22:40:50 EST 2025 Tue Jul 25 21:03:00 EDT 2023 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Recurrent neural networks Language model Video to text CNN architecture Video captioning Language and vision Natural language processing Encoder-decoder Word embeddings |
| Language | English |
| License | This is an open access article under the CC BY license. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c344t-6b42c3f5245d49dcca6d9bdfb3823e4c0ca7a2cd09c0848a7cd75ecee0f035ef3 |
| ORCID | 0000-0003-3406-673X 0000-0003-2763-2094 |
| OpenAccessLink | https://doaj.org/article/98c335d3b3d848d58bd2ea8fb159aa88 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_98c335d3b3d848d58bd2ea8fb159aa88 crossref_primary_10_1016_j_array_2020_100052 crossref_citationtrail_10_1016_j_array_2020_100052 elsevier_sciencedirect_doi_10_1016_j_array_2020_100052 |
| PublicationCentury | 2000 |
| PublicationDate | March 2021 2021-03-00 2021-03-01 |
| PublicationDateYYYYMMDD | 2021-03-01 |
| PublicationDate_xml | – month: 03 year: 2021 text: March 2021 |
| PublicationDecade | 2020 |
| PublicationTitle | Array (New York) |
| PublicationYear | 2021 |
| Publisher | Elsevier Inc Elsevier |
| Publisher_xml | – name: Elsevier Inc – name: Elsevier |
| References | Wang, Wang, Huang, Wang, Tan (bib29) 2018 Chen, Wang, Zhang, Huang (bib30) 2018 Pei, Zhang, Wang, Ke, Shen, Tai (bib11) 2019 Szegedy, Vanhoucke, Ioffe, Shlens, Wojna (bib46) 2016 Krishnamoorthy, Malkarnenkar, Mooney, Saenko, Guadarrama (bib21) 2013 Chen, Fang, Lin, Vedantam, Gupta, Dollár, Zitnick (bib40) 2015 Gan, Gan, He, Pu, Tran, Gao, Carin, Deng (bib9) 2017 Mikolov, Sutskever, Chen, Corrado, Dean (bib17) 2013 Chen, Dolan (bib41) 2011 Yao, Torabi, Cho, Ballas, Pal, Larochelle, Courville (bib8) 2015 Kojima, Tamura, Fukunaga (bib19) 2002; 50 Lin (bib34) 2004 Krishna, Hata, Ren, Fei-Fei, Carlos Niebles (bib48) 2017 Donahue, Anne Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, Darrell (bib1) 2015 Pennington, Socher, Manning (bib18) 2014 Cheng, Yang, Yao, Guo, Han (bib24) 2018; 56 Zheng, Wang, Tao (bib32) 2020 Bojanowski, Grave, Joulin, Mikolov (bib50) 2017 Park, Darrell, Rohrbach (bib26) 2020 Pan, Mei, Yao, Li, Rui (bib6) 2016 Vedantam, Lawrence Zitnick, Parikh (bib36) 2015 Chung, Gulcehre, Cho, Bengio (bib51) 2014 Banerjee, Lavie (bib35) 2005 Wang, Chen, Wu, Wang, Yang Wang (bib15) 2018 Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei (bib43) 2014 Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, Bengio (bib28) 2015 Yu, Wang, Huang, Yang, Xu (bib23) 2016 Hochreiter, Schmidhuber (bib4) 1997; 9 Papineni, Roukos, Ward, Zhu (bib33) 2002 Simonyan, Zisserman (bib45) 2015 Tran, Bourdev, Fergus, Torresani, Paluri (bib44) 2015 Elman (bib2) 1990; 14 Venugopalan, Xu, Donahue, Rohrbach, Mooney, Saenko (bib5) 2014 Wang, Ma, Zhang, Liu (bib16) 2018 Robertson (bib39) Oct 2004; 60 Aafaq, Akhtar, Liu, Gilani, Mian (bib25) 2019 Pan, Cai, Huang, Lee, Gaidon, Adeli, Niebles (bib12) 2020 Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein (bib42) 2015; 115 Pan, Yao, Li, Mei (bib7) 2017 Zhang, Peng (bib31) 2019 Aafaq, Mian, Liu, Gilani, Shah (bib37) 2019; 52 Liu, Ren, Yuan (bib14) 2020 Yao, Han, Zhang, Nie (bib22) 2017; 26 Oppenheim (bib49) 1999 Szegedy, Ioffe, Vanhoucke, Alemi (bib47) 2017 Yan, Tu, Wang, Zhang, Hao, Zhang, Dai (bib13) 2019 Das, Xu, Doell, Corso (bib20) 2013 Kilickaya, Erdem, Ikizler-Cinbis, Erdem (bib38) 2016 Li, Zhao, Lu (bib27) 2017 Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio (bib3) 2014 Yao, Han, Cheng, Qian, Guo (bib10) 2016; 54 Jozefowicz, Zaremba, Sutskever (bib52) 2015 Krishna (10.1016/j.array.2020.100052_bib48) 2017 Jozefowicz (10.1016/j.array.2020.100052_bib52) 2015 Zheng (10.1016/j.array.2020.100052_bib32) 2020 Hochreiter (10.1016/j.array.2020.100052_bib4) 1997; 9 Banerjee (10.1016/j.array.2020.100052_bib35) 2005 Oppenheim (10.1016/j.array.2020.100052_bib49) 1999 Li (10.1016/j.array.2020.100052_bib27) 2017 Kilickaya (10.1016/j.array.2020.100052_bib38) 2016 Wang (10.1016/j.array.2020.100052_bib15) 2018 Krishnamoorthy (10.1016/j.array.2020.100052_bib21) 2013 Tran (10.1016/j.array.2020.100052_bib44) 2015 Yao (10.1016/j.array.2020.100052_bib10) 2016; 54 Gan (10.1016/j.array.2020.100052_bib9) 2017 Park (10.1016/j.array.2020.100052_bib26) 2020 Elman (10.1016/j.array.2020.100052_bib2) 1990; 14 Chen (10.1016/j.array.2020.100052_bib30) 2018 Pennington (10.1016/j.array.2020.100052_bib18) 2014 Lin (10.1016/j.array.2020.100052_bib34) 2004 Karpathy (10.1016/j.array.2020.100052_bib43) 2014 Wang (10.1016/j.array.2020.100052_bib16) 2018 Venugopalan (10.1016/j.array.2020.100052_bib5) 2014 Aafaq (10.1016/j.array.2020.100052_bib37) 2019; 52 Xu (10.1016/j.array.2020.100052_bib28) 2015 Mikolov (10.1016/j.array.2020.100052_bib17) 2013 Pan (10.1016/j.array.2020.100052_bib12) 2020 Kojima (10.1016/j.array.2020.100052_bib19) 2002; 50 Yao (10.1016/j.array.2020.100052_bib22) 2017; 26 Robertson (10.1016/j.array.2020.100052_bib39) 2004; 60 Szegedy (10.1016/j.array.2020.100052_bib47) 2017 Aafaq (10.1016/j.array.2020.100052_bib25) 2019 Simonyan (10.1016/j.array.2020.100052_bib45) 2015 Cho (10.1016/j.array.2020.100052_bib3) 2014 Liu (10.1016/j.array.2020.100052_bib14) 2020 Russakovsky (10.1016/j.array.2020.100052_bib42) 2015; 115 Chen (10.1016/j.array.2020.100052_bib41) 2011 Zhang (10.1016/j.array.2020.100052_bib31) 2019 Papineni (10.1016/j.array.2020.100052_bib33) 2002 Donahue (10.1016/j.array.2020.100052_bib1) 2015 Pan (10.1016/j.array.2020.100052_bib7) 2017 Bojanowski (10.1016/j.array.2020.100052_bib50) 2017 Das (10.1016/j.array.2020.100052_bib20) 2013 Wang (10.1016/j.array.2020.100052_bib29) 2018 Chen (10.1016/j.array.2020.100052_bib40) 2015 Szegedy (10.1016/j.array.2020.100052_bib46) 2016 Pan (10.1016/j.array.2020.100052_bib6) 2016 Yao (10.1016/j.array.2020.100052_bib8) 2015 Yu (10.1016/j.array.2020.100052_bib23) 2016 Pei (10.1016/j.array.2020.100052_bib11) 2019 Yan (10.1016/j.array.2020.100052_bib13) 2019 Chung (10.1016/j.array.2020.100052_bib51) 2014 Cheng (10.1016/j.array.2020.100052_bib24) 2018; 56 Vedantam (10.1016/j.array.2020.100052_bib36) 2015 |
| References_xml | – start-page: 4507 year: 2015 end-page: 4515 ident: bib8 article-title: Describing videos by exploiting temporal structure publication-title: Proceedings of the IEEE international conference on computer vision – start-page: 358 year: 2018 end-page: 373 ident: bib30 article-title: Less is more: picking informative frames for video captioning publication-title: Proceedings of the European conference on computer vision – start-page: 311 year: 2002 end-page: 318 ident: bib33 article-title: Bleu: a method for automatic evaluation of machine translation publication-title: Proceedings of the 40th annual meeting on ACL – start-page: 2048 year: 2015 end-page: 2057 ident: bib28 article-title: Show, attend and tell: neural image caption generation with visual attention publication-title: International conference on machine learning – year: 2015 ident: bib36 article-title: Cider: consensus-based image description evaluation publication-title: IEEE CVPR – start-page: 190 year: 2011 end-page: 200 ident: bib41 article-title: Collecting highly parallel data for paraphrase evaluation publication-title: ACL: human language technologies-volume 1 – start-page: 4584 year: 2016 end-page: 4593 ident: bib23 article-title: Video paragraph captioning using hierarchical recurrent neural networks publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 52 start-page: 115 year: 2019 ident: bib37 article-title: Video description: a survey of methods, datasets, and evaluation metrics publication-title: ACM Comput Surv – start-page: 4594 year: 2016 end-page: 4602 ident: bib6 article-title: Jointly modeling embedding and translation to bridge video and language publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 14 start-page: 179 year: 1990 end-page: 211 ident: bib2 article-title: Finding structure in time publication-title: Cognit Sci – start-page: 2342 year: 2015 end-page: 2350 ident: bib52 article-title: An empirical exploration of recurrent network architectures publication-title: International conference on machine learning – volume: 115 start-page: 211 year: 2015 end-page: 252 ident: bib42 article-title: Imagenet large scale visual recognition challenge publication-title: Int J Comput Vis – volume: 54 start-page: 3660 year: 2016 end-page: 3671 ident: bib10 article-title: Semantic annotation of high-resolution satellite images via weakly supervised learning publication-title: IEEE Trans Geosci Rem Sens – start-page: 4213 year: 2018 end-page: 4222 ident: bib15 article-title: Video captioning via hierarchical reinforcement learning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 26 start-page: 3196 year: 2017 end-page: 3209 ident: bib22 article-title: Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering publication-title: IEEE Trans Image Process – start-page: 13096 year: 2020 end-page: 13105 ident: bib32 article-title: Syntax-aware action targeting for video captioning publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2017 ident: bib47 article-title: Inception-v4, inception-resnet and the impact of residual connections on learning publication-title: Thirty-first AAAI conference on artificial intelligence – volume: 50 start-page: 171 year: 2002 end-page: 184 ident: bib19 article-title: Natural language description of human activities from video images based on concept hierarchy of actions publication-title: IJCV – year: 2004 ident: bib34 article-title: Rouge: a package for automatic evaluation of summaries publication-title: Text summarization branches out: proceedings of the ACL-04 workshop – start-page: 7622 year: 2018 end-page: 7631 ident: bib16 article-title: Reconstruction network for video captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2020 ident: bib26 article-title: Identity-aware multi-sentence video description publication-title: Proceedings of the ECCV – year: 2019 ident: bib13 article-title: Stat: spatial-temporal attention mechanism for video captioning publication-title: IEEE transactions on multimedia – start-page: 7512 year: 2018 end-page: 7520 ident: bib29 article-title: M3: multimodal memory modelling for video captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 56 start-page: 2811 year: 2018 end-page: 2821 ident: bib24 article-title: When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns publication-title: IEEE Trans Geosci Rem Sens – start-page: 2208 year: 2017 end-page: 2214 ident: bib27 article-title: Mam-rnn: multi-level attention model based rnn for video captioning publication-title: IJCAI – year: 2020 ident: bib12 article-title: Spatio-temporal graph for video captioning with knowledge distillation – year: 2017 ident: bib7 article-title: Video captioning with transferred semantic attributes publication-title: IEEE CVPR – year: 2017 ident: bib9 article-title: Semantic compositional networks for visual captioning publication-title: IEEE CVPR – start-page: 65 year: 2005 end-page: 72 ident: bib35 article-title: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments publication-title: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization – year: 2015 ident: bib40 article-title: Microsoft coco captions: data collection and evaluation server – start-page: 2818 year: 2016 end-page: 2826 ident: bib46 article-title: Rethinking the inception architecture for computer vision publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2019 ident: bib11 article-title: Memory-attended recurrent network for video captioning publication-title: The IEEE conference on computer vision and pattern recognition (CVPR) – volume: 60 start-page: 503 year: Oct 2004 end-page: 520 ident: bib39 article-title: Understanding inverse document frequency: on theoretical arguments for idf publication-title: J Doc – year: 2014 ident: bib51 article-title: Empirical evaluation of gated recurrent neural networks on sequence modeling – year: 2015 ident: bib45 article-title: Very deep convolutional networks for large-scale image recognition publication-title: ICLR – start-page: 1725 year: 2014 end-page: 1732 ident: bib43 article-title: Large-scale video classification with convolutional neural networks publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2014 ident: bib5 article-title: Translating videos to natural language using deep recurrent neural networks – year: 2016 ident: bib38 article-title: Re-evaluating automatic metrics for image captioning – start-page: 2634 year: 2013 end-page: 2641 ident: bib20 article-title: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching publication-title: IEEE CVPR – start-page: 8327 year: 2019 end-page: 8336 ident: bib31 article-title: Object-aware aggregation with bidirectional temporal graph for video captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 1999 ident: bib49 article-title: Discrete-time signal processing – start-page: 1532 year: 2014 end-page: 1543 ident: bib18 article-title: Glove: global vectors for word representation publication-title: Proceedings of the 2014 conference on empirical methods in natural language processing – start-page: 4489 year: 2015 end-page: 4497 ident: bib44 article-title: Learning spatiotemporal features with 3d convolutional networks publication-title: Proceedings of the IEEE international conference on computer vision – year: 2014 ident: bib3 article-title: Learning phrase representations using rnn encoder-decoder for statistical machine translation – start-page: 135 year: 2017 end-page: 146 ident: bib50 article-title: Enriching word vectors with subword information publication-title: TACL – start-page: 2625 year: 2015 end-page: 2634 ident: bib1 article-title: Long-term recurrent convolutional networks for visual recognition and description publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 9 start-page: 1735 year: 1997 end-page: 1780 ident: bib4 article-title: Long short-term memory publication-title: Neural Comput – start-page: 3111 year: 2013 end-page: 3119 ident: bib17 article-title: Distributed representations of words and phrases and their compositionality publication-title: Advances in neural information processing systems – year: 2020 ident: bib14 article-title: Sibnet: sibling convolutional encoder for video captioning publication-title: IEEE transactions on pattern analysis and machine intelligence – start-page: 706 year: 2017 end-page: 715 ident: bib48 article-title: Dense-captioning events in videos publication-title: Proceedings of the IEEE international conference on computer vision – start-page: 2 year: 2013 ident: bib21 article-title: Generating natural-language video descriptions using text-mined knowledge publication-title: AAAI – year: 2019 ident: bib25 article-title: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning publication-title: IEEE CVPR – start-page: 2625 year: 2015 ident: 10.1016/j.array.2020.100052_bib1 article-title: Long-term recurrent convolutional networks for visual recognition and description – start-page: 2048 year: 2015 ident: 10.1016/j.array.2020.100052_bib28 article-title: Show, attend and tell: neural image caption generation with visual attention – start-page: 4489 year: 2015 ident: 10.1016/j.array.2020.100052_bib44 article-title: Learning spatiotemporal features with 3d convolutional networks – start-page: 4213 year: 2018 ident: 10.1016/j.array.2020.100052_bib15 article-title: Video captioning via hierarchical reinforcement learning – start-page: 7512 year: 2018 ident: 10.1016/j.array.2020.100052_bib29 article-title: M3: multimodal memory modelling for video captioning – year: 2014 ident: 10.1016/j.array.2020.100052_bib5 – year: 2004 ident: 10.1016/j.array.2020.100052_bib34 article-title: Rouge: a package for automatic evaluation of summaries – year: 2020 ident: 10.1016/j.array.2020.100052_bib26 article-title: Identity-aware multi-sentence video description – start-page: 2342 year: 2015 ident: 10.1016/j.array.2020.100052_bib52 article-title: An empirical exploration of recurrent network architectures – year: 2020 ident: 10.1016/j.array.2020.100052_bib12 – start-page: 1725 year: 2014 ident: 10.1016/j.array.2020.100052_bib43 article-title: Large-scale video classification with convolutional neural networks – start-page: 135 year: 2017 ident: 10.1016/j.array.2020.100052_bib50 article-title: Enriching word vectors with subword information publication-title: TACL doi: 10.1162/tacl_a_00051 – start-page: 1532 year: 2014 ident: 10.1016/j.array.2020.100052_bib18 article-title: Glove: global vectors for word representation – year: 2014 ident: 10.1016/j.array.2020.100052_bib3 – year: 2014 ident: 10.1016/j.array.2020.100052_bib51 – year: 2019 ident: 10.1016/j.array.2020.100052_bib11 article-title: Memory-attended recurrent network for video captioning – start-page: 2818 year: 2016 ident: 10.1016/j.array.2020.100052_bib46 article-title: Rethinking the inception architecture for computer vision – start-page: 65 year: 2005 ident: 10.1016/j.array.2020.100052_bib35 article-title: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments – start-page: 8327 year: 2019 ident: 10.1016/j.array.2020.100052_bib31 article-title: Object-aware aggregation with bidirectional temporal graph for video captioning – start-page: 190 year: 2011 ident: 10.1016/j.array.2020.100052_bib41 article-title: Collecting highly parallel data for paraphrase evaluation – year: 2017 ident: 10.1016/j.array.2020.100052_bib47 article-title: Inception-v4, inception-resnet and the impact of residual connections on learning – start-page: 2208 year: 2017 ident: 10.1016/j.array.2020.100052_bib27 article-title: Mam-rnn: multi-level attention model based rnn for video captioning – start-page: 358 year: 2018 ident: 10.1016/j.array.2020.100052_bib30 article-title: Less is more: picking informative frames for video captioning – year: 2017 ident: 10.1016/j.array.2020.100052_bib9 article-title: Semantic compositional networks for visual captioning – volume: 56 start-page: 2811 year: 2018 ident: 10.1016/j.array.2020.100052_bib24 article-title: When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns publication-title: IEEE Trans Geosci Rem Sens doi: 10.1109/TGRS.2017.2783902 – year: 2020 ident: 10.1016/j.array.2020.100052_bib14 article-title: Sibnet: sibling convolutional encoder for video captioning – volume: 14 start-page: 179 year: 1990 ident: 10.1016/j.array.2020.100052_bib2 article-title: Finding structure in time publication-title: Cognit Sci doi: 10.1207/s15516709cog1402_1 – volume: 60 start-page: 503 year: 2004 ident: 10.1016/j.array.2020.100052_bib39 article-title: Understanding inverse document frequency: on theoretical arguments for idf publication-title: J Doc – volume: 115 start-page: 211 year: 2015 ident: 10.1016/j.array.2020.100052_bib42 article-title: Imagenet large scale visual recognition challenge publication-title: Int J Comput Vis doi: 10.1007/s11263-015-0816-y – year: 2017 ident: 10.1016/j.array.2020.100052_bib7 article-title: Video captioning with transferred semantic attributes – start-page: 311 year: 2002 ident: 10.1016/j.array.2020.100052_bib33 article-title: Bleu: a method for automatic evaluation of machine translation – start-page: 706 year: 2017 ident: 10.1016/j.array.2020.100052_bib48 article-title: Dense-captioning events in videos – year: 1999 ident: 10.1016/j.array.2020.100052_bib49 – start-page: 4594 year: 2016 ident: 10.1016/j.array.2020.100052_bib6 article-title: Jointly modeling embedding and translation to bridge video and language – start-page: 2 year: 2013 ident: 10.1016/j.array.2020.100052_bib21 article-title: Generating natural-language video descriptions using text-mined knowledge – start-page: 4507 year: 2015 ident: 10.1016/j.array.2020.100052_bib8 article-title: Describing videos by exploiting temporal structure – start-page: 3111 year: 2013 ident: 10.1016/j.array.2020.100052_bib17 article-title: Distributed representations of words and phrases and their compositionality – year: 2015 ident: 10.1016/j.array.2020.100052_bib36 article-title: Cider: consensus-based image description evaluation – start-page: 7622 year: 2018 ident: 10.1016/j.array.2020.100052_bib16 article-title: Reconstruction network for video captioning – year: 2019 ident: 10.1016/j.array.2020.100052_bib25 article-title: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning – volume: 9 start-page: 1735 year: 1997 ident: 10.1016/j.array.2020.100052_bib4 article-title: Long short-term memory publication-title: Neural Comput doi: 10.1162/neco.1997.9.8.1735 – volume: 54 start-page: 3660 year: 2016 ident: 10.1016/j.array.2020.100052_bib10 article-title: Semantic annotation of high-resolution satellite images via weakly supervised learning publication-title: IEEE Trans Geosci Rem Sens doi: 10.1109/TGRS.2016.2523563 – volume: 50 start-page: 171 year: 2002 ident: 10.1016/j.array.2020.100052_bib19 article-title: Natural language description of human activities from video images based on concept hierarchy of actions publication-title: IJCV doi: 10.1023/A:1020346032608 – volume: 52 start-page: 115 year: 2019 ident: 10.1016/j.array.2020.100052_bib37 article-title: Video description: a survey of methods, datasets, and evaluation metrics publication-title: ACM Comput Surv – year: 2016 ident: 10.1016/j.array.2020.100052_bib38 – start-page: 13096 year: 2020 ident: 10.1016/j.array.2020.100052_bib32 article-title: Syntax-aware action targeting for video captioning – start-page: 2634 year: 2013 ident: 10.1016/j.array.2020.100052_bib20 article-title: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching – volume: 26 start-page: 3196 year: 2017 ident: 10.1016/j.array.2020.100052_bib22 article-title: Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering publication-title: IEEE Trans Image Process doi: 10.1109/TIP.2017.2694222 – year: 2019 ident: 10.1016/j.array.2020.100052_bib13 article-title: Stat: spatial-temporal attention mechanism for video captioning – year: 2015 ident: 10.1016/j.array.2020.100052_bib40 – start-page: 4584 year: 2016 ident: 10.1016/j.array.2020.100052_bib23 article-title: Video paragraph captioning using hierarchical recurrent neural networks – year: 2015 ident: 10.1016/j.array.2020.100052_bib45 article-title: Very deep convolutional networks for large-scale image recognition |
| SSID | ssj0002511158 |
| Score | 2.164884 |
| Snippet | Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional... |
| SourceID | doaj crossref elsevier |
| SourceType | Open Website Enrichment Source Index Database Publisher |
| StartPage | 100052 |
| SubjectTerms | CNN architecture Encoder-decoder Language and vision Language model Natural language processing Recurrent neural networks Video captioning Video to text Word embeddings |
| Title | Empirical autopsy of deep video captioning encoder-decoder architecture |
| URI | https://dx.doi.org/10.1016/j.array.2020.100052 https://doaj.org/article/98c335d3b3d848d58bd2ea8fb159aa88 |
| Volume | 9 |
| WOSCitedRecordID | wos001141397800001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2590-0056 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002511158 issn: 2590-0056 databaseCode: DOA dateStart: 20190101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2590-0056 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002511158 issn: 2590-0056 databaseCode: M~E dateStart: 20190101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1NS8QwEA0iHrz4La5f5ODRYjdNmuaosupBxYOKt5LMJLCiu6W7Cl787SZpK_WiFy8tlKQpk2nnTTN5j5CjYjgEJVEnbAiYcHAmMdJgwsBjb-TCpVGH7PFa3t4WT0_qrif1FWrCGnrgxnAnqoAsE5iZDAvfWxQGmdWFMz4Oa13Ebb6pVL1kKnyDA3AeRnFOD-_D1mmRd5RDsbhL17X-8Nkhi2UCqWA_wlJk7-9Fp17EuVgjKy1UpKfNI66TBTvZIKudDANt38pNcjl6rcaR6YPqt_m0mn3QqaNobUXDJrspBV21v11poK1EWydo45n21xG2yMPF6P78Kmn1ERLIOJ8nueEMMicYF8gV-rnIURl0JqztWQ4paKkZYKogsOZrCSiF9VExdWkmrMu2yeJkOrE7hEoOaFBJ8GiQY-qMybn22TGzCrSQckBYZ54SWvLwoGHxUnZVYs9ltGkZbFo2Nh2Q4-9OVcOd8Xvzs2D376aB-Dpe8O5Qtu5Q_uUOA5J3s1a2GKLBBv5W499G3_2P0ffIMgs1L7FGbZ8szus3e0CW4H0-ntWH0UX98eZz9AX9ZO0E |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Empirical+autopsy+of+deep+video+captioning+encoder-decoder+architecture&rft.jtitle=Array+%28New+York%29&rft.au=Aafaq%2C+Nayyer&rft.au=Akhtar%2C+Naveed&rft.au=Liu%2C+Wei&rft.au=Mian%2C+Ajmal&rft.date=2021-03-01&rft.pub=Elsevier+Inc&rft.issn=2590-0056&rft.eissn=2590-0056&rft.volume=9&rft_id=info:doi/10.1016%2Fj.array.2020.100052&rft.externalDocID=S2590005620300370 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2590-0056&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2590-0056&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2590-0056&client=summon |