HiCoS-Net: hierarchical cross-modal graph learning with dynamic attention for hard negative-aware image-text matching
Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negat...
Saved in:
| Published in: | Journal of King Saud University. Computer and information sciences Vol. 37; no. 9; pp. 281 - 30 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Cham
Springer International Publishing
01.11.2025
Springer Nature B.V Springer |
| Subjects: | |
| ISSN: | 1319-1578, 2213-1248, 1319-1578 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negative samples. To illustrate this point, consider the failure to leverage shared knowledge across multiple samples on analogous topics. This failure results in an inadequate capacity to differentiate hard negative samples. In this study, it is posited that the establishment of sample relationships facilitates the learning of semantic associations between different samples. This, in turn, enables the effective identification of subtle differences between hard negative samples, thereby enhancing the overall embedding process. The proposal of HiCoS-Net is the subject of this paper. The proposed model is a novel hierarchical inter-modal semantic network that learns robust embeddings through local-to-sample semantic interaction propagation. Specifically, at the local level, a dynamic graph attention mechanism is designed to achieve region-lexicon fine-grained interactions; at the sample level, an embedding similarity graph is constructed by combining the relational mapping matrix with the semantic matching matrix to explicitly model the topological associations and semantic coupling strengths of inter-modal samples. A substantial programme of experimentation is undertaken to validate the advantages of the proposed HiCoS-Net method. This has been demonstrated to achieve state-of-the-art image-text matching results on the public benchmark datasets Flickr30K and MS-COCO. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1319-1578 2213-1248 1319-1578 |
| DOI: | 10.1007/s44443-025-00313-x |