A visual question answering method based on task decomposition

Visual question answering (VQA) as an interdisciplinary task of computer vision and natural language processing, estimating the model’s visual reasoning ability, which requires the integration of image information extraction technology and natural language understanding technology. The testing on pr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one Jg. 20; H. 11; S. e0336623
Hauptverfasser: Cong, Yao, Mo, Hongwei
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States Public Library of Science 13.11.2025
Public Library of Science (PLoS)
Schlagworte:
ISSN:1932-6203, 1932-6203
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual question answering (VQA) as an interdisciplinary task of computer vision and natural language processing, estimating the model’s visual reasoning ability, which requires the integration of image information extraction technology and natural language understanding technology. The testing on professional benchmark which controls the potential bias states that the VQA method based on task decomposition is a promising approach, offering advantages in interpretability at program execution stage and reducing data bias dependencies, compared with traditional VQA methods that only rely on multimodal fusion. The VQA method based on task decomposition decomposes the task by parsing natural language and it usually parses the language with sequence-to-sequence networks. It has limitations when faced with flexible and varied natural language, making it difficult to accurately decompose the task. To address this issue, we propose a Graph-to-Sequence Task Decomposition Network (Graph2Seq-TDN), which uses semantic structural information from natural language to guide the task decomposition process and improve parsing accuracy, additionally, in terms of reasoning execution, in addition to the original symbolic reasoning execution, we propose a reasoning executor to enhance execution performance. We conducted validation on four datasets: CLEVR, CLEVR-Human, CLEVR-CoGenT and GQA. The experimental results showed that our model outperformed the comparative model in terms of answering accuracy, program accuracy, and training costs under the same accuracy.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0336623