Interactions Guided Generative Adversarial Network for unsupervised image captioning

•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is perfo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neurocomputing (Amsterdam) Jg. 417; S. 419 - 431
Hauptverfasser: Cao, Shan, An, Gaoyun, Zheng, Zhenxing, Ruan, Qiuqi
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 05.12.2020
Schlagworte:
ISSN:0925-2312, 1872-8286
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is performed by cycle consistency.•An effective unsupervised image captioning model, IGGAN, is proposed. Most of the current image captioning models that have achieved great successes heavily depend on manually labeled image-caption pairs. However, it is expensive and time-consuming to acquire large scale paired data. In this paper, we propose the Interactions Guided Generative Adversarial Network (IGGAN) for unsupervised image captioning, which joints multi-scale feature representation and object-object interactions. To get robust feature representation, the image is encoded by ResNet with a new Multi-scale module and adaptive Channel attention (RMCNet). Moreover, the information on object-object interactions is extracted by our Mutual Attention Network (MAN) and then adopted in the process of adversarial generation, which enhances the rationality of generated sentences. To encourage the sentence to be semantically consistent with the image, we utilize the image and generated sentence to reconstruct each other by cycle consistency in IGGAN. Our proposed model can generate sentences without any manually labeled image-caption pairs. Experimental results show that our proposed model achieves quite promising performance on the MSCOCO image captioning dataset. The ablation studies validate the effectiveness of our proposed modules.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2020.08.019