View in EDS

Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers

Saved in:

Bibliographic Details
Title:	Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers
Authors:	Dheeraj Pai, Deigant Yadava, João Monteiro, Vinay Nair
Publication Year:	2024
Collection:	KiltHub Research from Carnegie Mellon University
Subject Terms:	Knowledge representation and reasoning, Natural language processing, multimodal machine learning (ML), Machine Learning Methods, alignment, Reasoning in Machine Learning, Reasoning, Visual Question Answering (VQA), Knowledge Representation and Machine Learning
Description:	In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject.
Document Type:	report
Language:	unknown
Relation:	https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718
DOI:	10.1184/r1/24920718.v1
Availability:	https://doi.org/10.1184/r1/24920718.v1 https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718
Rights:	CC BY 4.0
Accession Number:	edsbas.5E069F6
Database:	BASE

View record from BASE

Nájsť tento článok vo Web of Science

Description
Abstract:	In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject.
DOI:	10.1184/r1/24920718.v1