Interactions Guided Generative Adversarial Network for unsupervised image captioning

•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is perfo...

Full description

Saved in:
Bibliographic Details
Published in:Neurocomputing (Amsterdam) Vol. 417; pp. 419 - 431
Main Authors: Cao, Shan, An, Gaoyun, Zheng, Zhenxing, Ruan, Qiuqi
Format: Journal Article
Language:English
Published: Elsevier B.V 05.12.2020
Subjects:
ISSN:0925-2312, 1872-8286
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Resnet with a new Multi-scale module and adaptive Channel attention is proposed.•Mutual Attention Network is proposed to reason about interactions among objects.•The information on object-object interactions is adopted to adversarial generation.•The alignment between the image and sentence is performed by cycle consistency.•An effective unsupervised image captioning model, IGGAN, is proposed. Most of the current image captioning models that have achieved great successes heavily depend on manually labeled image-caption pairs. However, it is expensive and time-consuming to acquire large scale paired data. In this paper, we propose the Interactions Guided Generative Adversarial Network (IGGAN) for unsupervised image captioning, which joints multi-scale feature representation and object-object interactions. To get robust feature representation, the image is encoded by ResNet with a new Multi-scale module and adaptive Channel attention (RMCNet). Moreover, the information on object-object interactions is extracted by our Mutual Attention Network (MAN) and then adopted in the process of adversarial generation, which enhances the rationality of generated sentences. To encourage the sentence to be semantically consistent with the image, we utilize the image and generated sentence to reconstruct each other by cycle consistency in IGGAN. Our proposed model can generate sentences without any manually labeled image-caption pairs. Experimental results show that our proposed model achieves quite promising performance on the MSCOCO image captioning dataset. The ablation studies validate the effectiveness of our proposed modules.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2020.08.019