Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks
Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MME...
Gespeichert in:
| Veröffentlicht in: | IEEE International Conference on Industrial Cyber Physical Systems (Online) S. 1 - 6 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
12.05.2024
|
| Schlagworte: | |
| ISSN: | 2769-3899 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases. |
|---|---|
| AbstractList | Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research efforts have begun to bridge the gap between the two domains. In this work, we propose a semi supervised Multi-Modal Encoder Decoder Network (MMEDN) to capture the relationship between images and textual descriptions, allowing us to generate meaningful descriptions of images and retrieve images from a database using cross-modality search. The semi-supervised training approach, which combines ground truth text descriptions and pseudotext generated by the text decoder within the model, requires far fewer image-text pairs in the training data and can directly add new raw images without manual text labelling for training. This approach is particularly useful for active learning environments, where labels are expensive and hard to obtain. We show that our model performs well with qualitative evaluations. We applied our model for finding images of a person from large databases and generating descriptions of people involved in an event for adding to an automatically generated report. The model was able to retrieve relevant images and generate accurate descriptions, demonstrating its applicability to more practical use cases. |
| Author | Jose, Bijoy A Moopan, Asif Chemmanam, Ajai John |
| Author_xml | – sequence: 1 givenname: Ajai John surname: Chemmanam fullname: Chemmanam, Ajai John email: ajaichemmanam@cusat.ac.in organization: Cochin University of Science and Technology,CPS Lab,Department of Electronics,Kerala,India – sequence: 2 givenname: Bijoy A surname: Jose fullname: Jose, Bijoy A email: bijoyjose@cusat.ac.in organization: Cochin University of Science and Technology,CPS Lab,Department of Computer Science,Kerala,India – sequence: 3 givenname: Asif surname: Moopan fullname: Moopan, Asif email: asif@vuelogix.com organization: Vuelogix Technologies Pvt. Ltd,Kerala,India |
| BookMark | eNo1j9FOwkAQRVejiYj9AxP3B4ozO7vdzqNBUAioUd7Jlg64Cq1pa9C_l0R9ujk5yUnuuTqp6kqUukIYIAJfT4ZPL47Z4sCAsQOEjA6UHamEPefkgDICNMeqZ3zGKeXMZypp2zcAIIPoIe-p6fxz28V0Xpdhq2cSmipWG72P3aue1rHq9GQXNpIu5KvTo10hZXnwrQ5VqW9lVZfS6Afp9nXz3l6o03XYtpL8bV89j0eL4X06e7ybDG9maUSfdenaCjoPQuSCd6XhFfs1ZN6KIce5eCEMaA0TkrMAUljweQEsIbfUV5e_0Sgiy48m7kLzvfz_Tj8M0U5P |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ICPS59941.2024.10639946 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9798350363012 |
| EISSN | 2769-3899 |
| EndPage | 6 |
| ExternalDocumentID | 10639946 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL |
| ID | FETCH-LOGICAL-i176t-f4e1570e335a75d29c97f0674e23598e7e31a14293135400eb4078b09ea843 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001308277000006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:02:36 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i176t-f4e1570e335a75d29c97f0674e23598e7e31a14293135400eb4078b09ea843 |
| PageCount | 6 |
| ParticipantIDs | ieee_primary_10639946 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-May-12 |
| PublicationDateYYYYMMDD | 2024-05-12 |
| PublicationDate_xml | – month: 05 year: 2024 text: 2024-May-12 day: 12 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE International Conference on Industrial Cyber Physical Systems (Online) |
| PublicationTitleAbbrev | ICPS |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211708 |
| Score | 1.8772128 |
| Snippet | Advances in machine learning and neural networks have transformed natural language processing (NLP) and computer vision (CV) applications. Recent research... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Computer vision Cross-modal retrieval Encoder-decoder architectures Multi-modal learning Natural language processing Training Training data Vectors Visualization |
| Title | Multi-Modal Learning with Joint Image-Text Embeddings and Decoder Networks |
| URI | https://ieeexplore.ieee.org/document/10639946 |
| WOSCitedRecordID | wos001308277000006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB60ePCkYsU3OXhNTTbZzeZcW2zRUrRIbyXZnZWC3ZV26-83SbcVDx68hZBAmJB88_pmAO6MA2krC0mt1EhlEStqkRsqRGFN4nSILJRSentSo1E6nepxQ1YPXBhEDMln2PHDEMvPq2ztXWXuhXs8lck-7CuVbMhaO4eKiHwTlbTJ4eJM3w-649fYrfdmYCQ7292_-qgEGOkf_fMAx9D-IeSR8Q5qTmAPy1MYBvIsfa5y80GaOqnvxDtWybCalzUZLNxnQSfu-yW9hcU8hJmIKXPygJ7KviSjTRL4qg0v_d6k-0ib1gh0zlVS00IijxVDIWKj4jzSmVaFAx6JkS_JhwoFN9xhjeDescPQ-nidZRpNKsUZtMqqxHMg2mRGJdykzNlhaCLNc2bcu9RCo1Ol2AW0vRRmn5vSF7OtAC7_mL-CQy9rH17n0TW06uUab-Ag-6rnq-VtuLBvd3GUlA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA86BT2pOPHbHLxmJk3aNOe54eZWhg7ZbSTtqwxcK1vn32-SdRMPHryFQCC8kPd7X7_3ELrXFqSNyAUxQgEReSiJAaYJ57nRkbUhUt9K6W0gkySeTNSoJqt7LgwA-OIzaLmlz-VnZbpyoTL7wx2eimgX7bnRWTVdaxtS4YEboxLXVVyMqodee_Qa2hPOEQxEa3P-1yQVDyTdo39e4Rg1fyh5eLQFmxO0A8Up6nv6LBmWmf7AdafUd-xCq7hfzooK9-ZWXZCxVcC4MzeQ-UQT1kWGH8GR2Rc4WZeBL5vopdsZt59IPRyBzJiMKpILYKGkwHmoZZgFKlUyt9AjIHBN-UACZ5pZtOHMhXYoGJexM1SBjgU_Q42iLOAcYaVTLSOmY2o9MdCBYhnV9mcqrsAaU_QCNZ0Upp_r5hfTjQAu_9i_QwdP4-FgOuglz1fo0MndJdtZcI0a1WIFN2g__apmy8Wtf7xvan-X3Q |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE+International+Conference+on+Industrial+Cyber+Physical+Systems+%28Online%29&rft.atitle=Multi-Modal+Learning+with+Joint+Image-Text+Embeddings+and+Decoder+Networks&rft.au=Chemmanam%2C+Ajai+John&rft.au=Jose%2C+Bijoy+A&rft.au=Moopan%2C+Asif&rft.date=2024-05-12&rft.pub=IEEE&rft.eissn=2769-3899&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICPS59941.2024.10639946&rft.externalDocID=10639946 |