Attention based Image Caption Generation (ABICG) using Encoder-Decoder Architecture

The image captioning is utilized to develop the explanations of the sentences describing the series of scenes captured in the image or picture forms. The practice of using image captioning is vast although it is a tedious task for the machine to learn what a human is capable of. The model must be bu...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International Conference on Smart Systems and Inventive Technology (Online) s. 1564 - 1572
Hlavní autoři: Kulkarni, Uday, Tomar, Kushagra, Kalmat, Mayuri, Bandi, Rakshita, Jadhav, Pranav, Meena, Sm
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 23.01.2023
Témata:
ISSN:2832-3017
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The image captioning is utilized to develop the explanations of the sentences describing the series of scenes captured in the image or picture forms. The practice of using image captioning is vast although it is a tedious task for the machine to learn what a human is capable of. The model must be built in a way such that when it reads the scene, it recognizes and reproduce to-the-point captions or descriptions. The generated descriptions must be semantically and syntactically accurate. Hence, availability of Artificial Intelligence (AI) and Machine Learning algorithms viz. Natural Language Processing (NLP), Deep Learning (DL) etc. makes the task easier. Although majority of the existing machine-generated captions are valid, they do not focus on the crucial parts of the images, which results in lesser clarity of the captions. In the proposed paper, anew introduction to attention mechanism called Bahdanau's along with EncoderDecoder architecture is being used so as to reflect the image captions that are more accurate and detailed. It uses a pretrained Convolutional Neural Network (CNN) called InceptionV3 architecture to gather the features of images and then a Recurrent Neural Network (RNN) called Gated Recurrent Unit (GRU) architecture in order to develop captions. This model is trained on Flickr8k dataset and the captions generated are 10% more accurate than the present state of art.
ISSN:2832-3017
DOI:10.1109/ICSSIT55814.2023.10061040