MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence Jg. 44; H. 1; S. 318 - 329
Hauptverfasser: Peng, Liang, Yang, Yang, Wang, Zheng, Huang, Zi, Shen, Heng Tao
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 01.01.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:0162-8828, 1939-3539, 2160-9292, 1939-3539
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network ( MRA-Net ). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2020.3004830