Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks

Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MME...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE International Conference on Industrial Cyber Physical Systems (Online) S. 1 - 6
Hauptverfasser: Chemmanam, Ajai John, Jose, Bijoy A, Moopan, Asif
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 12.05.2024
Schlagworte:
ISSN:2769-3899
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases.
AbstractList Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases.
Author Jose, Bijoy A
Moopan, Asif
Chemmanam, Ajai John
Author_xml – sequence: 1
  givenname: Ajai John
  surname: Chemmanam
  fullname: Chemmanam, Ajai John
  email: ajaichemmanam@cusat.ac.in
  organization: Cochin University of Science and Technology,CPS Lab,Department of Electronics,Kerala,India
– sequence: 2
  givenname: Bijoy A
  surname: Jose
  fullname: Jose, Bijoy A
  email: bijoyjose@cusat.ac.in
  organization: Cochin University of Science and Technology,CPS Lab,Department of Computer Science,Kerala,India
– sequence: 3
  givenname: Asif
  surname: Moopan
  fullname: Moopan, Asif
  email: asif@vuelogix.com
  organization: Vuelogix Technologies Pvt. Ltd,Kerala,India
BookMark eNo1j9FOwkAQRVejiYj9AxP3B4ozO7vdzqNBUAioUd7Jlg64Cq1pa9C_l0R9ujk5yUnuuTqp6kqUukIYIAJfT4ZPL47Z4sCAsQOEjA6UHamEPefkgDICNMeqZ3zGKeXMZypp2zcAIIPoIe-p6fxz28V0Xpdhq2cSmipWG72P3aue1rHq9GQXNpIu5KvTo10hZXnwrQ5VqW9lVZfS6Afp9nXz3l6o03XYtpL8bV89j0eL4X06e7ybDG9maUSfdenaCjoPQuSCd6XhFfs1ZN6KIce5eCEMaA0TkrMAUljweQEsIbfUV5e_0Sgiy48m7kLzvfz_Tj8M0U5P
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICPS59941.2024.10639946
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350363012
EISSN 2769-3899
EndPage 6
ExternalDocumentID 10639946
Genre orig-research
GroupedDBID 6IE
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i176t-f4e1570e335a75d29c97f0674e23598e7e31a14293135400eb4078b09ea843
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001308277000006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:02:36 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-f4e1570e335a75d29c97f0674e23598e7e31a14293135400eb4078b09ea843
PageCount 6
ParticipantIDs ieee_primary_10639946
PublicationCentury 2000
PublicationDate 2024-May-12
PublicationDateYYYYMMDD 2024-05-12
PublicationDate_xml – month: 05
  year: 2024
  text: 2024-May-12
  day: 12
PublicationDecade 2020
PublicationTitle IEEE International Conference on Industrial Cyber Physical Systems (Online)
PublicationTitleAbbrev ICPS
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211708
Score 1.8772128
Snippet Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Computer vision
Cross-modal retrieval
Encoder-decoder architectures
Multi-modal learning
Natural language processing
Training
Training data
Vectors
Visualization
Title Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks
URI https://ieeexplore.ieee.org/document/10639946
WOSCitedRecordID wos001308277000006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB60ePCkYsU3OXhNTTbZzeZcW2zRUrRIbyXZnZWC3ZV26-83SbcVDx68hZBAmJB88_pmAO6MA2krC0mt1EhlEStqkRsqRGFN4nSILJRSentSo1E6nepxQ1YPXBhEDMln2PHDEMvPq2ztXWXuhXs8lck-7CuVbMhaO4eKiHwTlbTJ4eJM3w-649fYrfdmYCQ7292_-qgEGOkf_fMAx9D-IeSR8Q5qTmAPy1MYBvIsfa5y80GaOqnvxDtWybCalzUZLNxnQSfu-yW9hcU8hJmIKXPygJ7KviSjTRL4qg0v_d6k-0ib1gh0zlVS00IijxVDIWKj4jzSmVaFAx6JkS_JhwoFN9xhjeDescPQ-nidZRpNKsUZtMqqxHMg2mRGJdykzNlhaCLNc2bcu9RCo1Ol2AW0vRRmn5vSF7OtAC7_mL-CQy9rH17n0TW06uUab-Ag-6rnq-VtuLBvd3GUlA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA86BT2pOPHbHLxmJk3aNOe54eZWhg7ZbSTtqwxcK1vn32-SdRMPHryFQCC8kPd7X7_3ELrXFqSNyAUxQgEReSiJAaYJ57nRkbUhUt9K6W0gkySeTNSoJqt7LgwA-OIzaLmlz-VnZbpyoTL7wx2eimgX7bnRWTVdaxtS4YEboxLXVVyMqodee_Qa2hPOEQxEa3P-1yQVDyTdo39e4Rg1fyh5eLQFmxO0A8Up6nv6LBmWmf7AdafUd-xCq7hfzooK9-ZWXZCxVcC4MzeQ-UQT1kWGH8GR2Rc4WZeBL5vopdsZt59IPRyBzJiMKpILYKGkwHmoZZgFKlUyt9AjIHBN-UACZ5pZtOHMhXYoGJexM1SBjgU_Q42iLOAcYaVTLSOmY2o9MdCBYhnV9mcqrsAaU_QCNZ0Upp_r5hfTjQAu_9i_QwdP4-FgOuglz1fo0MndJdtZcI0a1WIFN2g__apmy8Wtf7xvan-X3Q
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+International+Conference+on+Industrial+Cyber+Physical+Systems+%28Online%29&rft.atitle=Multi-Modal+Learning+with+Joint+Image-Text+Embeddings+and+Decoder+Networks&rft.au=Chemmanam%2C+Ajai+John&rft.au=Jose%2C+Bijoy+A&rft.au=Moopan%2C+Asif&rft.date=2024-05-12&rft.pub=IEEE&rft.eissn=2769-3899&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICPS59941.2024.10639946&rft.externalDocID=10639946