Empirical autopsy of deep video captioning encoder-decoder architecture

Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language m...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Array (New York) Ročník 9; s. 100052
Hlavní autoři: Aafaq, Nayyer, Akhtar, Naveed, Liu, Wei, Mian, Ajmal
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.03.2021
Elsevier
Témata:
ISSN:2590-0056, 2590-0056
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language model to map visual features to natural language captions. Due to its composite nature, the encoder-decoder pipeline provides the freedom of multiple choices for each of its components, e.g., the choices of CNN models, feature transformations, word embeddings, and language models etc. Component selection can have drastic effects on the overall video captioning performance. However, current literature is void of any systematic investigation in this regard. This article fills this gap by providing the first thorough empirical analysis of the role that each major component plays in a widely adopted video captioning pipeline. We perform extensive experiments by varying the constituent components of the video captioning framework, and quantify the performance gains that are possible by mere component selection. We use the popular MSVD dataset as the test-bed, and demonstrate that substantial performance gains are possible by careful selection of the constituent components without major changes to the pipeline itself. These results are expected to provide guiding principles for research in the fast growing direction of video captioning.
AbstractList Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language model to map visual features to natural language captions. Due to its composite nature, the encoder-decoder pipeline provides the freedom of multiple choices for each of its components, e.g., the choices of CNN models, feature transformations, word embeddings, and language models etc. Component selection can have drastic effects on the overall video captioning performance. However, current literature is void of any systematic investigation in this regard. This article fills this gap by providing the first thorough empirical analysis of the role that each major component plays in a widely adopted video captioning pipeline. We perform extensive experiments by varying the constituent components of the video captioning framework, and quantify the performance gains that are possible by mere component selection. We use the popular MSVD dataset as the test-bed, and demonstrate that substantial performance gains are possible by careful selection of the constituent components without major changes to the pipeline itself. These results are expected to provide guiding principles for research in the fast growing direction of video captioning.
ArticleNumber 100052
Author Mian, Ajmal
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Author_xml – sequence: 1
  givenname: Nayyer
  orcidid: 0000-0003-2763-2094
  surname: Aafaq
  fullname: Aafaq, Nayyer
  email: nayyer.aafaq@research.uwa.edu.au
– sequence: 2
  givenname: Naveed
  orcidid: 0000-0003-3406-673X
  surname: Akhtar
  fullname: Akhtar, Naveed
– sequence: 3
  givenname: Wei
  surname: Liu
  fullname: Liu, Wei
– sequence: 4
  givenname: Ajmal
  surname: Mian
  fullname: Mian, Ajmal
BookMark eNqFkMFKAzEQQINUsNZ-gZf9ga3ZZNNmDx6k1FooeNFzmJ3M1pR2s2S3hf692VZEPOhphiHvQd4tG9S-JsbuMz7JeDZ92E4gBDhNBBf9hXMlrthQqIKncZ8Ofuw3bNy22_hEqCzLlB6y5WLfuOAQdgkcOt-0p8RXiSVqkqOz5BOEpnO-dvUmoRq9pZBaOs8EAn64jrA7BLpj1xXsWhp_zRF7f168zV_S9etyNX9apyjzvEunZS5QVkrkyuaFRYSpLUpblVILSTlyhBkItLxArnMNM7QzRUjEKy4VVXLEVhev9bA1TXB7CCfjwZnzwYeNgdA53JEpNEqprCyljSqrdGkFga7KTBUAWkeXvLgw-LYNVH37Mm76tGZrzmlNn9Zc0kaq-EWh66Bv1AVwu3_YxwtLMdHRUTAtupiVrAuxY_yD-5P_BCW6mTc
CitedBy_id crossref_primary_10_1007_s42235_025_00743_3
crossref_primary_10_1007_s00607_024_01334_6
crossref_primary_10_1007_s11042_024_19247_z
crossref_primary_10_1109_TAI_2021_3134190
crossref_primary_10_1007_s11042_023_15933_6
crossref_primary_10_1016_j_eswa_2023_120454
crossref_primary_10_1016_j_patcog_2022_109202
Cites_doi 10.1162/tacl_a_00051
10.1109/TGRS.2017.2783902
10.1207/s15516709cog1402_1
10.1007/s11263-015-0816-y
10.1162/neco.1997.9.8.1735
10.1109/TGRS.2016.2523563
10.1023/A:1020346032608
10.1109/TIP.2017.2694222
ContentType Journal Article
Copyright 2020 The Author(s)
Copyright_xml – notice: 2020 The Author(s)
DBID 6I.
AAFTH
AAYXX
CITATION
DOA
DOI 10.1016/j.array.2020.100052
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2590-0056
ExternalDocumentID oai_doaj_org_article_98c335d3b3d848d58bd2ea8fb159aa88
10_1016_j_array_2020_100052
S2590005620300370
GroupedDBID 0SF
6I.
AAEDW
AAFTH
AALRI
AAXUO
AEXQZ
AITUG
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
EBS
EJD
FDB
GROUPED_DOAJ
M41
M~E
NCXOZ
OK1
ROL
0R~
AAYWO
AAYXX
ACVFH
ADCNI
ADVLN
AEUPX
AFJKZ
AFPUW
AIGII
AKBMS
AKYEP
APXCP
CITATION
ID FETCH-LOGICAL-c344t-6b42c3f5245d49dcca6d9bdfb3823e4c0ca7a2cd09c0848a7cd75ecee0f035ef3
IEDL.DBID DOA
ISICitedReferencesCount 4
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001141397800001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2590-0056
IngestDate Fri Oct 03 12:50:26 EDT 2025
Thu Nov 20 00:46:57 EST 2025
Tue Nov 18 22:40:50 EST 2025
Tue Jul 25 21:03:00 EDT 2023
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Recurrent neural networks
Language model
Video to text
CNN architecture
Video captioning
Language and vision
Natural language processing
Encoder-decoder
Word embeddings
Language English
License This is an open access article under the CC BY license.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c344t-6b42c3f5245d49dcca6d9bdfb3823e4c0ca7a2cd09c0848a7cd75ecee0f035ef3
ORCID 0000-0003-3406-673X
0000-0003-2763-2094
OpenAccessLink https://doaj.org/article/98c335d3b3d848d58bd2ea8fb159aa88
ParticipantIDs doaj_primary_oai_doaj_org_article_98c335d3b3d848d58bd2ea8fb159aa88
crossref_primary_10_1016_j_array_2020_100052
crossref_citationtrail_10_1016_j_array_2020_100052
elsevier_sciencedirect_doi_10_1016_j_array_2020_100052
PublicationCentury 2000
PublicationDate March 2021
2021-03-00
2021-03-01
PublicationDateYYYYMMDD 2021-03-01
PublicationDate_xml – month: 03
  year: 2021
  text: March 2021
PublicationDecade 2020
PublicationTitle Array (New York)
PublicationYear 2021
Publisher Elsevier Inc
Elsevier
Publisher_xml – name: Elsevier Inc
– name: Elsevier
References Wang, Wang, Huang, Wang, Tan (bib29) 2018
Chen, Wang, Zhang, Huang (bib30) 2018
Pei, Zhang, Wang, Ke, Shen, Tai (bib11) 2019
Szegedy, Vanhoucke, Ioffe, Shlens, Wojna (bib46) 2016
Krishnamoorthy, Malkarnenkar, Mooney, Saenko, Guadarrama (bib21) 2013
Chen, Fang, Lin, Vedantam, Gupta, Dollár, Zitnick (bib40) 2015
Gan, Gan, He, Pu, Tran, Gao, Carin, Deng (bib9) 2017
Mikolov, Sutskever, Chen, Corrado, Dean (bib17) 2013
Chen, Dolan (bib41) 2011
Yao, Torabi, Cho, Ballas, Pal, Larochelle, Courville (bib8) 2015
Kojima, Tamura, Fukunaga (bib19) 2002; 50
Lin (bib34) 2004
Krishna, Hata, Ren, Fei-Fei, Carlos Niebles (bib48) 2017
Donahue, Anne Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, Darrell (bib1) 2015
Pennington, Socher, Manning (bib18) 2014
Cheng, Yang, Yao, Guo, Han (bib24) 2018; 56
Zheng, Wang, Tao (bib32) 2020
Bojanowski, Grave, Joulin, Mikolov (bib50) 2017
Park, Darrell, Rohrbach (bib26) 2020
Pan, Mei, Yao, Li, Rui (bib6) 2016
Vedantam, Lawrence Zitnick, Parikh (bib36) 2015
Chung, Gulcehre, Cho, Bengio (bib51) 2014
Banerjee, Lavie (bib35) 2005
Wang, Chen, Wu, Wang, Yang Wang (bib15) 2018
Karpathy, Toderici, Shetty, Leung, Sukthankar, Fei-Fei (bib43) 2014
Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, Bengio (bib28) 2015
Yu, Wang, Huang, Yang, Xu (bib23) 2016
Hochreiter, Schmidhuber (bib4) 1997; 9
Papineni, Roukos, Ward, Zhu (bib33) 2002
Simonyan, Zisserman (bib45) 2015
Tran, Bourdev, Fergus, Torresani, Paluri (bib44) 2015
Elman (bib2) 1990; 14
Venugopalan, Xu, Donahue, Rohrbach, Mooney, Saenko (bib5) 2014
Wang, Ma, Zhang, Liu (bib16) 2018
Robertson (bib39) Oct 2004; 60
Aafaq, Akhtar, Liu, Gilani, Mian (bib25) 2019
Pan, Cai, Huang, Lee, Gaidon, Adeli, Niebles (bib12) 2020
Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein (bib42) 2015; 115
Pan, Yao, Li, Mei (bib7) 2017
Zhang, Peng (bib31) 2019
Aafaq, Mian, Liu, Gilani, Shah (bib37) 2019; 52
Liu, Ren, Yuan (bib14) 2020
Yao, Han, Zhang, Nie (bib22) 2017; 26
Oppenheim (bib49) 1999
Szegedy, Ioffe, Vanhoucke, Alemi (bib47) 2017
Yan, Tu, Wang, Zhang, Hao, Zhang, Dai (bib13) 2019
Das, Xu, Doell, Corso (bib20) 2013
Kilickaya, Erdem, Ikizler-Cinbis, Erdem (bib38) 2016
Li, Zhao, Lu (bib27) 2017
Cho, Van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk, Bengio (bib3) 2014
Yao, Han, Cheng, Qian, Guo (bib10) 2016; 54
Jozefowicz, Zaremba, Sutskever (bib52) 2015
Krishna (10.1016/j.array.2020.100052_bib48) 2017
Jozefowicz (10.1016/j.array.2020.100052_bib52) 2015
Zheng (10.1016/j.array.2020.100052_bib32) 2020
Hochreiter (10.1016/j.array.2020.100052_bib4) 1997; 9
Banerjee (10.1016/j.array.2020.100052_bib35) 2005
Oppenheim (10.1016/j.array.2020.100052_bib49) 1999
Li (10.1016/j.array.2020.100052_bib27) 2017
Kilickaya (10.1016/j.array.2020.100052_bib38) 2016
Wang (10.1016/j.array.2020.100052_bib15) 2018
Krishnamoorthy (10.1016/j.array.2020.100052_bib21) 2013
Tran (10.1016/j.array.2020.100052_bib44) 2015
Yao (10.1016/j.array.2020.100052_bib10) 2016; 54
Gan (10.1016/j.array.2020.100052_bib9) 2017
Park (10.1016/j.array.2020.100052_bib26) 2020
Elman (10.1016/j.array.2020.100052_bib2) 1990; 14
Chen (10.1016/j.array.2020.100052_bib30) 2018
Pennington (10.1016/j.array.2020.100052_bib18) 2014
Lin (10.1016/j.array.2020.100052_bib34) 2004
Karpathy (10.1016/j.array.2020.100052_bib43) 2014
Wang (10.1016/j.array.2020.100052_bib16) 2018
Venugopalan (10.1016/j.array.2020.100052_bib5) 2014
Aafaq (10.1016/j.array.2020.100052_bib37) 2019; 52
Xu (10.1016/j.array.2020.100052_bib28) 2015
Mikolov (10.1016/j.array.2020.100052_bib17) 2013
Pan (10.1016/j.array.2020.100052_bib12) 2020
Kojima (10.1016/j.array.2020.100052_bib19) 2002; 50
Yao (10.1016/j.array.2020.100052_bib22) 2017; 26
Robertson (10.1016/j.array.2020.100052_bib39) 2004; 60
Szegedy (10.1016/j.array.2020.100052_bib47) 2017
Aafaq (10.1016/j.array.2020.100052_bib25) 2019
Simonyan (10.1016/j.array.2020.100052_bib45) 2015
Cho (10.1016/j.array.2020.100052_bib3) 2014
Liu (10.1016/j.array.2020.100052_bib14) 2020
Russakovsky (10.1016/j.array.2020.100052_bib42) 2015; 115
Chen (10.1016/j.array.2020.100052_bib41) 2011
Zhang (10.1016/j.array.2020.100052_bib31) 2019
Papineni (10.1016/j.array.2020.100052_bib33) 2002
Donahue (10.1016/j.array.2020.100052_bib1) 2015
Pan (10.1016/j.array.2020.100052_bib7) 2017
Bojanowski (10.1016/j.array.2020.100052_bib50) 2017
Das (10.1016/j.array.2020.100052_bib20) 2013
Wang (10.1016/j.array.2020.100052_bib29) 2018
Chen (10.1016/j.array.2020.100052_bib40) 2015
Szegedy (10.1016/j.array.2020.100052_bib46) 2016
Pan (10.1016/j.array.2020.100052_bib6) 2016
Yao (10.1016/j.array.2020.100052_bib8) 2015
Yu (10.1016/j.array.2020.100052_bib23) 2016
Pei (10.1016/j.array.2020.100052_bib11) 2019
Yan (10.1016/j.array.2020.100052_bib13) 2019
Chung (10.1016/j.array.2020.100052_bib51) 2014
Cheng (10.1016/j.array.2020.100052_bib24) 2018; 56
Vedantam (10.1016/j.array.2020.100052_bib36) 2015
References_xml – start-page: 4507
  year: 2015
  end-page: 4515
  ident: bib8
  article-title: Describing videos by exploiting temporal structure
  publication-title: Proceedings of the IEEE international conference on computer vision
– start-page: 358
  year: 2018
  end-page: 373
  ident: bib30
  article-title: Less is more: picking informative frames for video captioning
  publication-title: Proceedings of the European conference on computer vision
– start-page: 311
  year: 2002
  end-page: 318
  ident: bib33
  article-title: Bleu: a method for automatic evaluation of machine translation
  publication-title: Proceedings of the 40th annual meeting on ACL
– start-page: 2048
  year: 2015
  end-page: 2057
  ident: bib28
  article-title: Show, attend and tell: neural image caption generation with visual attention
  publication-title: International conference on machine learning
– year: 2015
  ident: bib36
  article-title: Cider: consensus-based image description evaluation
  publication-title: IEEE CVPR
– start-page: 190
  year: 2011
  end-page: 200
  ident: bib41
  article-title: Collecting highly parallel data for paraphrase evaluation
  publication-title: ACL: human language technologies-volume 1
– start-page: 4584
  year: 2016
  end-page: 4593
  ident: bib23
  article-title: Video paragraph captioning using hierarchical recurrent neural networks
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 52
  start-page: 115
  year: 2019
  ident: bib37
  article-title: Video description: a survey of methods, datasets, and evaluation metrics
  publication-title: ACM Comput Surv
– start-page: 4594
  year: 2016
  end-page: 4602
  ident: bib6
  article-title: Jointly modeling embedding and translation to bridge video and language
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 14
  start-page: 179
  year: 1990
  end-page: 211
  ident: bib2
  article-title: Finding structure in time
  publication-title: Cognit Sci
– start-page: 2342
  year: 2015
  end-page: 2350
  ident: bib52
  article-title: An empirical exploration of recurrent network architectures
  publication-title: International conference on machine learning
– volume: 115
  start-page: 211
  year: 2015
  end-page: 252
  ident: bib42
  article-title: Imagenet large scale visual recognition challenge
  publication-title: Int J Comput Vis
– volume: 54
  start-page: 3660
  year: 2016
  end-page: 3671
  ident: bib10
  article-title: Semantic annotation of high-resolution satellite images via weakly supervised learning
  publication-title: IEEE Trans Geosci Rem Sens
– start-page: 4213
  year: 2018
  end-page: 4222
  ident: bib15
  article-title: Video captioning via hierarchical reinforcement learning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 26
  start-page: 3196
  year: 2017
  end-page: 3209
  ident: bib22
  article-title: Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering
  publication-title: IEEE Trans Image Process
– start-page: 13096
  year: 2020
  end-page: 13105
  ident: bib32
  article-title: Syntax-aware action targeting for video captioning
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2017
  ident: bib47
  article-title: Inception-v4, inception-resnet and the impact of residual connections on learning
  publication-title: Thirty-first AAAI conference on artificial intelligence
– volume: 50
  start-page: 171
  year: 2002
  end-page: 184
  ident: bib19
  article-title: Natural language description of human activities from video images based on concept hierarchy of actions
  publication-title: IJCV
– year: 2004
  ident: bib34
  article-title: Rouge: a package for automatic evaluation of summaries
  publication-title: Text summarization branches out: proceedings of the ACL-04 workshop
– start-page: 7622
  year: 2018
  end-page: 7631
  ident: bib16
  article-title: Reconstruction network for video captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2020
  ident: bib26
  article-title: Identity-aware multi-sentence video description
  publication-title: Proceedings of the ECCV
– year: 2019
  ident: bib13
  article-title: Stat: spatial-temporal attention mechanism for video captioning
  publication-title: IEEE transactions on multimedia
– start-page: 7512
  year: 2018
  end-page: 7520
  ident: bib29
  article-title: M3: multimodal memory modelling for video captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 56
  start-page: 2811
  year: 2018
  end-page: 2821
  ident: bib24
  article-title: When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns
  publication-title: IEEE Trans Geosci Rem Sens
– start-page: 2208
  year: 2017
  end-page: 2214
  ident: bib27
  article-title: Mam-rnn: multi-level attention model based rnn for video captioning
  publication-title: IJCAI
– year: 2020
  ident: bib12
  article-title: Spatio-temporal graph for video captioning with knowledge distillation
– year: 2017
  ident: bib7
  article-title: Video captioning with transferred semantic attributes
  publication-title: IEEE CVPR
– year: 2017
  ident: bib9
  article-title: Semantic compositional networks for visual captioning
  publication-title: IEEE CVPR
– start-page: 65
  year: 2005
  end-page: 72
  ident: bib35
  article-title: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments
  publication-title: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
– year: 2015
  ident: bib40
  article-title: Microsoft coco captions: data collection and evaluation server
– start-page: 2818
  year: 2016
  end-page: 2826
  ident: bib46
  article-title: Rethinking the inception architecture for computer vision
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2019
  ident: bib11
  article-title: Memory-attended recurrent network for video captioning
  publication-title: The IEEE conference on computer vision and pattern recognition (CVPR)
– volume: 60
  start-page: 503
  year: Oct 2004
  end-page: 520
  ident: bib39
  article-title: Understanding inverse document frequency: on theoretical arguments for idf
  publication-title: J Doc
– year: 2014
  ident: bib51
  article-title: Empirical evaluation of gated recurrent neural networks on sequence modeling
– year: 2015
  ident: bib45
  article-title: Very deep convolutional networks for large-scale image recognition
  publication-title: ICLR
– start-page: 1725
  year: 2014
  end-page: 1732
  ident: bib43
  article-title: Large-scale video classification with convolutional neural networks
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2014
  ident: bib5
  article-title: Translating videos to natural language using deep recurrent neural networks
– year: 2016
  ident: bib38
  article-title: Re-evaluating automatic metrics for image captioning
– start-page: 2634
  year: 2013
  end-page: 2641
  ident: bib20
  article-title: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching
  publication-title: IEEE CVPR
– start-page: 8327
  year: 2019
  end-page: 8336
  ident: bib31
  article-title: Object-aware aggregation with bidirectional temporal graph for video captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 1999
  ident: bib49
  article-title: Discrete-time signal processing
– start-page: 1532
  year: 2014
  end-page: 1543
  ident: bib18
  article-title: Glove: global vectors for word representation
  publication-title: Proceedings of the 2014 conference on empirical methods in natural language processing
– start-page: 4489
  year: 2015
  end-page: 4497
  ident: bib44
  article-title: Learning spatiotemporal features with 3d convolutional networks
  publication-title: Proceedings of the IEEE international conference on computer vision
– year: 2014
  ident: bib3
  article-title: Learning phrase representations using rnn encoder-decoder for statistical machine translation
– start-page: 135
  year: 2017
  end-page: 146
  ident: bib50
  article-title: Enriching word vectors with subword information
  publication-title: TACL
– start-page: 2625
  year: 2015
  end-page: 2634
  ident: bib1
  article-title: Long-term recurrent convolutional networks for visual recognition and description
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 9
  start-page: 1735
  year: 1997
  end-page: 1780
  ident: bib4
  article-title: Long short-term memory
  publication-title: Neural Comput
– start-page: 3111
  year: 2013
  end-page: 3119
  ident: bib17
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Advances in neural information processing systems
– year: 2020
  ident: bib14
  article-title: Sibnet: sibling convolutional encoder for video captioning
  publication-title: IEEE transactions on pattern analysis and machine intelligence
– start-page: 706
  year: 2017
  end-page: 715
  ident: bib48
  article-title: Dense-captioning events in videos
  publication-title: Proceedings of the IEEE international conference on computer vision
– start-page: 2
  year: 2013
  ident: bib21
  article-title: Generating natural-language video descriptions using text-mined knowledge
  publication-title: AAAI
– year: 2019
  ident: bib25
  article-title: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning
  publication-title: IEEE CVPR
– start-page: 2625
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib1
  article-title: Long-term recurrent convolutional networks for visual recognition and description
– start-page: 2048
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib28
  article-title: Show, attend and tell: neural image caption generation with visual attention
– start-page: 4489
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib44
  article-title: Learning spatiotemporal features with 3d convolutional networks
– start-page: 4213
  year: 2018
  ident: 10.1016/j.array.2020.100052_bib15
  article-title: Video captioning via hierarchical reinforcement learning
– start-page: 7512
  year: 2018
  ident: 10.1016/j.array.2020.100052_bib29
  article-title: M3: multimodal memory modelling for video captioning
– year: 2014
  ident: 10.1016/j.array.2020.100052_bib5
– year: 2004
  ident: 10.1016/j.array.2020.100052_bib34
  article-title: Rouge: a package for automatic evaluation of summaries
– year: 2020
  ident: 10.1016/j.array.2020.100052_bib26
  article-title: Identity-aware multi-sentence video description
– start-page: 2342
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib52
  article-title: An empirical exploration of recurrent network architectures
– year: 2020
  ident: 10.1016/j.array.2020.100052_bib12
– start-page: 1725
  year: 2014
  ident: 10.1016/j.array.2020.100052_bib43
  article-title: Large-scale video classification with convolutional neural networks
– start-page: 135
  year: 2017
  ident: 10.1016/j.array.2020.100052_bib50
  article-title: Enriching word vectors with subword information
  publication-title: TACL
  doi: 10.1162/tacl_a_00051
– start-page: 1532
  year: 2014
  ident: 10.1016/j.array.2020.100052_bib18
  article-title: Glove: global vectors for word representation
– year: 2014
  ident: 10.1016/j.array.2020.100052_bib3
– year: 2014
  ident: 10.1016/j.array.2020.100052_bib51
– year: 2019
  ident: 10.1016/j.array.2020.100052_bib11
  article-title: Memory-attended recurrent network for video captioning
– start-page: 2818
  year: 2016
  ident: 10.1016/j.array.2020.100052_bib46
  article-title: Rethinking the inception architecture for computer vision
– start-page: 65
  year: 2005
  ident: 10.1016/j.array.2020.100052_bib35
  article-title: Meteor: an automatic metric for mt evaluation with improved correlation with human judgments
– start-page: 8327
  year: 2019
  ident: 10.1016/j.array.2020.100052_bib31
  article-title: Object-aware aggregation with bidirectional temporal graph for video captioning
– start-page: 190
  year: 2011
  ident: 10.1016/j.array.2020.100052_bib41
  article-title: Collecting highly parallel data for paraphrase evaluation
– year: 2017
  ident: 10.1016/j.array.2020.100052_bib47
  article-title: Inception-v4, inception-resnet and the impact of residual connections on learning
– start-page: 2208
  year: 2017
  ident: 10.1016/j.array.2020.100052_bib27
  article-title: Mam-rnn: multi-level attention model based rnn for video captioning
– start-page: 358
  year: 2018
  ident: 10.1016/j.array.2020.100052_bib30
  article-title: Less is more: picking informative frames for video captioning
– year: 2017
  ident: 10.1016/j.array.2020.100052_bib9
  article-title: Semantic compositional networks for visual captioning
– volume: 56
  start-page: 2811
  year: 2018
  ident: 10.1016/j.array.2020.100052_bib24
  article-title: When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns
  publication-title: IEEE Trans Geosci Rem Sens
  doi: 10.1109/TGRS.2017.2783902
– year: 2020
  ident: 10.1016/j.array.2020.100052_bib14
  article-title: Sibnet: sibling convolutional encoder for video captioning
– volume: 14
  start-page: 179
  year: 1990
  ident: 10.1016/j.array.2020.100052_bib2
  article-title: Finding structure in time
  publication-title: Cognit Sci
  doi: 10.1207/s15516709cog1402_1
– volume: 60
  start-page: 503
  year: 2004
  ident: 10.1016/j.array.2020.100052_bib39
  article-title: Understanding inverse document frequency: on theoretical arguments for idf
  publication-title: J Doc
– volume: 115
  start-page: 211
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib42
  article-title: Imagenet large scale visual recognition challenge
  publication-title: Int J Comput Vis
  doi: 10.1007/s11263-015-0816-y
– year: 2017
  ident: 10.1016/j.array.2020.100052_bib7
  article-title: Video captioning with transferred semantic attributes
– start-page: 311
  year: 2002
  ident: 10.1016/j.array.2020.100052_bib33
  article-title: Bleu: a method for automatic evaluation of machine translation
– start-page: 706
  year: 2017
  ident: 10.1016/j.array.2020.100052_bib48
  article-title: Dense-captioning events in videos
– year: 1999
  ident: 10.1016/j.array.2020.100052_bib49
– start-page: 4594
  year: 2016
  ident: 10.1016/j.array.2020.100052_bib6
  article-title: Jointly modeling embedding and translation to bridge video and language
– start-page: 2
  year: 2013
  ident: 10.1016/j.array.2020.100052_bib21
  article-title: Generating natural-language video descriptions using text-mined knowledge
– start-page: 4507
  year: 2015
  ident: 10.1016/j.array.2020.100052_bib8
  article-title: Describing videos by exploiting temporal structure
– start-page: 3111
  year: 2013
  ident: 10.1016/j.array.2020.100052_bib17
  article-title: Distributed representations of words and phrases and their compositionality
– year: 2015
  ident: 10.1016/j.array.2020.100052_bib36
  article-title: Cider: consensus-based image description evaluation
– start-page: 7622
  year: 2018
  ident: 10.1016/j.array.2020.100052_bib16
  article-title: Reconstruction network for video captioning
– year: 2019
  ident: 10.1016/j.array.2020.100052_bib25
  article-title: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning
– volume: 9
  start-page: 1735
  year: 1997
  ident: 10.1016/j.array.2020.100052_bib4
  article-title: Long short-term memory
  publication-title: Neural Comput
  doi: 10.1162/neco.1997.9.8.1735
– volume: 54
  start-page: 3660
  year: 2016
  ident: 10.1016/j.array.2020.100052_bib10
  article-title: Semantic annotation of high-resolution satellite images via weakly supervised learning
  publication-title: IEEE Trans Geosci Rem Sens
  doi: 10.1109/TGRS.2016.2523563
– volume: 50
  start-page: 171
  year: 2002
  ident: 10.1016/j.array.2020.100052_bib19
  article-title: Natural language description of human activities from video images based on concept hierarchy of actions
  publication-title: IJCV
  doi: 10.1023/A:1020346032608
– volume: 52
  start-page: 115
  year: 2019
  ident: 10.1016/j.array.2020.100052_bib37
  article-title: Video description: a survey of methods, datasets, and evaluation metrics
  publication-title: ACM Comput Surv
– year: 2016
  ident: 10.1016/j.array.2020.100052_bib38
– start-page: 13096
  year: 2020
  ident: 10.1016/j.array.2020.100052_bib32
  article-title: Syntax-aware action targeting for video captioning
– start-page: 2634
  year: 2013
  ident: 10.1016/j.array.2020.100052_bib20
  article-title: A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching
– volume: 26
  start-page: 3196
  year: 2017
  ident: 10.1016/j.array.2020.100052_bib22
  article-title: Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering
  publication-title: IEEE Trans Image Process
  doi: 10.1109/TIP.2017.2694222
– year: 2019
  ident: 10.1016/j.array.2020.100052_bib13
  article-title: Stat: spatial-temporal attention mechanism for video captioning
– year: 2015
  ident: 10.1016/j.array.2020.100052_bib40
– start-page: 4584
  year: 2016
  ident: 10.1016/j.array.2020.100052_bib23
  article-title: Video paragraph captioning using hierarchical recurrent neural networks
– year: 2015
  ident: 10.1016/j.array.2020.100052_bib45
  article-title: Very deep convolutional networks for large-scale image recognition
SSID ssj0002511158
Score 2.164884
Snippet Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional...
SourceID doaj
crossref
elsevier
SourceType Open Website
Enrichment Source
Index Database
Publisher
StartPage 100052
SubjectTerms CNN architecture
Encoder-decoder
Language and vision
Language model
Natural language processing
Recurrent neural networks
Video captioning
Video to text
Word embeddings
Title Empirical autopsy of deep video captioning encoder-decoder architecture
URI https://dx.doi.org/10.1016/j.array.2020.100052
https://doaj.org/article/98c335d3b3d848d58bd2ea8fb159aa88
Volume 9
WOSCitedRecordID wos001141397800001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2590-0056
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002511158
  issn: 2590-0056
  databaseCode: DOA
  dateStart: 20190101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2590-0056
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002511158
  issn: 2590-0056
  databaseCode: M~E
  dateStart: 20190101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1NS8QwEA0iHrz4La5f5ODRYjdNmuaosupBxYOKt5LMJLCiu6W7Cl787SZpK_WiFy8tlKQpk2nnTTN5j5CjYjgEJVEnbAiYcHAmMdJgwsBjb-TCpVGH7PFa3t4WT0_qrif1FWrCGnrgxnAnqoAsE5iZDAvfWxQGmdWFMz4Oa13Ebb6pVL1kKnyDA3AeRnFOD-_D1mmRd5RDsbhL17X-8Nkhi2UCqWA_wlJk7-9Fp17EuVgjKy1UpKfNI66TBTvZIKudDANt38pNcjl6rcaR6YPqt_m0mn3QqaNobUXDJrspBV21v11poK1EWydo45n21xG2yMPF6P78Kmn1ERLIOJ8nueEMMicYF8gV-rnIURl0JqztWQ4paKkZYKogsOZrCSiF9VExdWkmrMu2yeJkOrE7hEoOaFBJ8GiQY-qMybn22TGzCrSQckBYZ54SWvLwoGHxUnZVYs9ltGkZbFo2Nh2Q4-9OVcOd8Xvzs2D376aB-Dpe8O5Qtu5Q_uUOA5J3s1a2GKLBBv5W499G3_2P0ffIMgs1L7FGbZ8szus3e0CW4H0-ntWH0UX98eZz9AX9ZO0E
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Empirical+autopsy+of+deep+video+captioning+encoder-decoder+architecture&rft.jtitle=Array+%28New+York%29&rft.au=Aafaq%2C+Nayyer&rft.au=Akhtar%2C+Naveed&rft.au=Liu%2C+Wei&rft.au=Mian%2C+Ajmal&rft.date=2021-03-01&rft.pub=Elsevier+Inc&rft.issn=2590-0056&rft.eissn=2590-0056&rft.volume=9&rft_id=info:doi/10.1016%2Fj.array.2020.100052&rft.externalDocID=S2590005620300370
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2590-0056&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2590-0056&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2590-0056&client=summon