Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers

Uloženo v:
Podrobná bibliografie
Název: Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers
Autoři: Dheeraj Pai, Deigant Yadava, João Monteiro, Vinay Nair
Rok vydání: 2024
Sbírka: KiltHub Research from Carnegie Mellon University
Témata: Knowledge representation and reasoning, Natural language processing, multimodal machine learning (ML), Machine Learning Methods, alignment, Reasoning in Machine Learning, Reasoning, Visual Question Answering (VQA), Knowledge Representation and Machine Learning
Popis: In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject.
Druh dokumentu: report
Jazyk: unknown
Relation: https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718
DOI: 10.1184/r1/24920718.v1
Dostupnost: https://doi.org/10.1184/r1/24920718.v1
https://figshare.com/articles/preprint/Multihop_Multimodal_QA_using_Joint_Attentive_Training_and_Hierarchical_Attentive_Vision_Language_transformers/24920718
Rights: CC BY 4.0
Přístupové číslo: edsbas.5E069F6
Databáze: BASE
Popis
Abstrakt:In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject.
DOI:10.1184/r1/24920718.v1