MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network

Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on pattern analysis and machine intelligence Ročník 44; číslo 1; s. 318 - 329
Hlavní autoři: Peng, Liang, Yang, Yang, Wang, Zheng, Huang, Zi, Shen, Heng Tao
Médium: Journal Article
Jazyk:angličtina
Vydáno: United States IEEE 01.01.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:0162-8828, 1939-3539, 2160-9292, 1939-3539
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Visual Question Answering (VQA) is a task to answer natural language questions tied to the content of visual images. Most recent VQA approaches usually apply attention mechanism to focus on the relevant visual objects and/or consider the relations between objects via off-the-shelf methods in visual relation reasoning. However, they still suffer from several drawbacks. First, they mostly model the simple relations between objects, which results in many complicated questions cannot be answered correctly, because of failing to provide sufficient knowledge. Second, they seldom leverage the harmony cooperation of visual appearance feature and relation feature. To solve these problems, we propose a novel end-to-end VQA model, termed Multi-modal Relation Attention Network ( MRA-Net ). The proposed model explores both textual and visual relations to improve performance and interpretability. In specific, we devise 1) a self-guided word relation attention scheme, which explore the latent semantic relations between words; 2) two question-adaptive visual relation attention modules that can extract not only the fine-grained and precise binary relations between objects but also the more sophisticated trinary relations. Both kinds of question-related visual relations provide more and deeper visual semantics, thereby improving the visual reasoning ability of question answering. Furthermore, the proposed model also combines appearance feature with relation feature to reconcile the two types of features effectively. Extensive experiments on five large benchmark datasets, VQA-1.0, VQA-2.0, COCO-QA, VQA-CP v2, and TDIUC, demonstrate that our proposed model outperforms state-of-the-art approaches.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2020.3004830