Interactions Guided Generative Adversarial Network for unsupervised image captioning

•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is perfo...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Neurocomputing (Amsterdam) Ročník 417; s. 419 - 431
Hlavní autori: Cao, Shan, An, Gaoyun, Zheng, Zhenxing, Ruan, Qiuqi
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier B.V 05.12.2020
Predmet:
ISSN:0925-2312, 1872-8286
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is performed by cycle consistency.•An effective unsupervised image captioning model, IGGAN, is proposed. Most of the current image captioning models that have achieved great successes heavily depend on manually labeled image-caption pairs. However, it is expensive and time-consuming to acquire large scale paired data. In this paper, we propose the Interactions Guided Generative Adversarial Network (IGGAN) for unsupervised image captioning, which joints multi-scale feature representation and object-object interactions. To get robust feature representation, the image is encoded by ResNet with a new Multi-scale module and adaptive Channel attention (RMCNet). Moreover, the information on object-object interactions is extracted by our Mutual Attention Network (MAN) and then adopted in the process of adversarial generation, which enhances the rationality of generated sentences. To encourage the sentence to be semantically consistent with the image, we utilize the image and generated sentence to reconstruct each other by cycle consistency in IGGAN. Our proposed model can generate sentences without any manually labeled image-caption pairs. Experimental results show that our proposed model achieves quite promising performance on the MSCOCO image captioning dataset. The ablation studies validate the effectiveness of our proposed modules.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2020.08.019