Empirical autopsy of deep video captioning encoder-decoder architecture

Contemporary deep learning based video captioning methods adopt encoder-decoder framework. In encoder, visual features are extracted with 2D/3D Convolutional Neural Networks (CNNs) and a transformed version of those features is passed to the decoder. The decoder uses word embeddings and a language m...

Full description

Saved in:

Bibliographic Details
Published in:	Array (New York) Vol. 9; p. 100052
Main Authors:	Aafaq, Nayyer, Akhtar, Naveed, Liu, Wei, Mian, Ajmal
Format:	Journal Article
Language:	English
Published:	Elsevier Inc 01.03.2021 Elsevier
Subjects:	CNN architecture Encoder-decoder Language and vision Language model Natural language processing Recurrent neural networks Video captioning Video to text Word embeddings Recurrent neural networks Language model Video to text CNN architecture Video captioning Language and vision Natural language processing Encoder-decoder Word embeddings
ISSN:	2590-0056, 2590-0056
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Be the first to leave a comment!