Open-ended remote sensing visual question answering with transformers

Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of remote sensing Jg. 43; H. 18; S. 6809 - 6823
Hauptverfasser:	Al Rahhal, Mohamad M., Bazi, Yakoub, Alsaleh, Sara O., Al-Razgan, Muna, Mekhalfi, Mohamed Lamine, Al Zuair, Mansour, Alajlan, Naif
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	London Taylor & Francis 17.09.2022 Taylor & Francis Ltd
Schlagworte:	Annotations Coders data collection Datasets encoder-decoder architecture Encoders-Decoders humans Natural language processing open-set dataset Questions Remote sensing Transformers Vectors vision vision transformers Visual question answering
ISSN:	0143-1161, 1366-5901, 1366-5901
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0143-1161 1366-5901 1366-5901
DOI:	10.1080/01431161.2022.2145583