Attention based Image Caption Generation (ABICG) using Encoder-Decoder Architecture
The image captioning is utilized to develop the explanations of the sentences describing the series of scenes captured in the image or picture forms. The practice of using image captioning is vast although it is a tedious task for the machine to learn what a human is capable of. The model must be bu...
Gespeichert in:
| Veröffentlicht in: | International Conference on Smart Systems and Inventive Technology (Online) S. 1564 - 1572 |
|---|---|
| Hauptverfasser: | , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
23.01.2023
|
| Schlagworte: | |
| ISSN: | 2832-3017 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | The image captioning is utilized to develop the explanations of the sentences describing the series of scenes captured in the image or picture forms. The practice of using image captioning is vast although it is a tedious task for the machine to learn what a human is capable of. The model must be built in a way such that when it reads the scene, it recognizes and reproduce to-the-point captions or descriptions. The generated descriptions must be semantically and syntactically accurate. Hence, availability of Artificial Intelligence (AI) and Machine Learning algorithms viz. Natural Language Processing (NLP), Deep Learning (DL) etc. makes the task easier. Although majority of the existing machine-generated captions are valid, they do not focus on the crucial parts of the images, which results in lesser clarity of the captions. In the proposed paper, anew introduction to attention mechanism called Bahdanau's along with EncoderDecoder architecture is being used so as to reflect the image captions that are more accurate and detailed. It uses a pretrained Convolutional Neural Network (CNN) called InceptionV3 architecture to gather the features of images and then a Recurrent Neural Network (RNN) called Gated Recurrent Unit (GRU) architecture in order to develop captions. This model is trained on Flickr8k dataset and the captions generated are 10% more accurate than the present state of art. |
|---|---|
| ISSN: | 2832-3017 |
| DOI: | 10.1109/ICSSIT55814.2023.10061040 |