HiCoS-Net: hierarchical cross-modal graph learning with dynamic attention for hard negative-aware image-text matching

Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negat...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of King Saud University. Computer and information sciences Vol. 37; no. 9; pp. 281 - 30
Main Authors:	Feng, Dingcheng, Luo, Ning, Zhang, Shudong, Zhou, Lijuan, Wei, Bing
Format:	Journal Article
Language:	English
Published:	Cham Springer International Publishing 01.11.2025 Springer Nature B.V Springer
Subjects:	Adaptation Associations Attention Computer Imaging Computer Science Database Management Dynamic graph attention mechanism Effectiveness Embedding Embedding similarity graph Hard negative samples HiCoS-Net Hierarchical inter-modal semantic network Knowledge Learning Machine Learning Matching Neural networks Original Paper Pattern Recognition and Graphics Semantics Software Engineering/Programming and Operating Systems Systems and Data Security Theory of Computation Vision Hierarchical inter-modal semantic network Hard negative samples HiCoS-Net Dynamic graph attention mechanism Embedding similarity graph
ISSN:	1319-1578, 2213-1248, 1319-1578
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negative samples. To illustrate this point, consider the failure to leverage shared knowledge across multiple samples on analogous topics. This failure results in an inadequate capacity to differentiate hard negative samples. In this study, it is posited that the establishment of sample relationships facilitates the learning of semantic associations between different samples. This, in turn, enables the effective identification of subtle differences between hard negative samples, thereby enhancing the overall embedding process. The proposal of HiCoS-Net is the subject of this paper. The proposed model is a novel hierarchical inter-modal semantic network that learns robust embeddings through local-to-sample semantic interaction propagation. Specifically, at the local level, a dynamic graph attention mechanism is designed to achieve region-lexicon fine-grained interactions; at the sample level, an embedding similarity graph is constructed by combining the relational mapping matrix with the semantic matching matrix to explicitly model the topological associations and semantic coupling strengths of inter-modal samples. A substantial programme of experimentation is undertaken to validate the advantages of the proposed HiCoS-Net method. This has been demonstrated to achieve state-of-the-art image-text matching results on the public benchmark datasets Flickr30K and MS-COCO.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1319-1578 2213-1248 1319-1578
DOI:	10.1007/s44443-025-00313-x