Adaptive Latent Graph Representation Learning for Image-Text Matching

Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on image processing Ročník 32; s. 1
Hlavní autoři:	Tian, Mengxiao, Wu, Xinxiao, Jia, Yunde
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	United States IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Adaptation models Embedding Feature extraction Graph representations graph variational autoencoder Graphical representations Image edge detection Image-text matching latent representation learning Learning Matching Representation learning Semantics Task analysis Visualization
ISSN:	1057-7149, 1941-0042, 1941-0042
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2022.3229631