Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks

Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MME...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE International Conference on Industrial Cyber Physical Systems (Online) s. 1 - 6
Hlavní autoři: Chemmanam, Ajai John, Jose, Bijoy A, Moopan, Asif
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 12.05.2024
Témata:
ISSN:2769-3899
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases.
ISSN:2769-3899
DOI:10.1109/ICPS59941.2024.10639946