Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks

Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MME...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE International Conference on Industrial Cyber Physical Systems (Online) S. 1 - 6
Hauptverfasser:	Chemmanam, Ajai John, Jose, Bijoy A, Moopan, Asif
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 12.05.2024
Schlagworte:	Computer vision Cross-modal retrieval Encoder-decoder architectures Multi-modal learning Natural language processing Training Training data Vectors Visualization
ISSN:	2769-3899
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases.
ISSN:	2769-3899
DOI:	10.1109/ICPS59941.2024.10639946