Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers
Saved in:
| Title: | Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers |
|---|---|
| Authors: | Dheeraj Pai, Deigant Yadava, João Monteiro, Vinay Nair |
| Publication Year: | 2024 |
| Collection: | KiltHub Research from Carnegie Mellon University |
| Subject Terms: | Knowledge representation and reasoning, Natural language processing, multimodal machine learning (ML), Machine Learning Methods, alignment, Reasoning in Machine Learning, Reasoning, Visual Question Answering (VQA), Knowledge Representation and Machine Learning |
| Description: | In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject. |
| Document Type: | report |
| Language: | unknown |
| Relation: | https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718 |
| DOI: | 10.1184/r1/24920718.v1 |
| Availability: | https://doi.org/10.1184/r1/24920718.v1 https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718 |
| Rights: | CC BY 4.0 |
| Accession Number: | edsbas.5E069F6 |
| Database: | BASE |
| Abstract: | In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject. |
|---|---|
| DOI: | 10.1184/r1/24920718.v1 |
Nájsť tento článok vo Web of Science