HiCoS-Net: hierarchical cross-modal graph learning with dynamic attention for hard negative-aware image-text matching

Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negat...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of King Saud University. Computer and information sciences Ročník 37; číslo 9; s. 281 - 30
Hlavní autoři: Feng, Dingcheng, Luo, Ning, Zhang, Shudong, Zhou, Lijuan, Wei, Bing
Médium: Journal Article
Jazyk:angličtina
Vydáno: Cham Springer International Publishing 01.11.2025
Springer Nature B.V
Springer
Témata:
ISSN:1319-1578, 2213-1248, 1319-1578
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Fine-grained image-text matching, which is pivotal to multimodal intelligence, has advanced semantic correspondence inference through inter-modal region-word aggregation. Despite the efficacy of this approach, it remains limited by its inability to accommodate the semantic associations of hard negative samples. To illustrate this point, consider the failure to leverage shared knowledge across multiple samples on analogous topics. This failure results in an inadequate capacity to differentiate hard negative samples. In this study, it is posited that the establishment of sample relationships facilitates the learning of semantic associations between different samples. This, in turn, enables the effective identification of subtle differences between hard negative samples, thereby enhancing the overall embedding process. The proposal of HiCoS-Net is the subject of this paper. The proposed model is a novel hierarchical inter-modal semantic network that learns robust embeddings through local-to-sample semantic interaction propagation. Specifically, at the local level, a dynamic graph attention mechanism is designed to achieve region-lexicon fine-grained interactions; at the sample level, an embedding similarity graph is constructed by combining the relational mapping matrix with the semantic matching matrix to explicitly model the topological associations and semantic coupling strengths of inter-modal samples. A substantial programme of experimentation is undertaken to validate the advantages of the proposed HiCoS-Net method. This has been demonstrated to achieve state-of-the-art image-text matching results on the public benchmark datasets Flickr30K and MS-COCO.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1319-1578
2213-1248
1319-1578
DOI:10.1007/s44443-025-00313-x