LLM-powered scene graph representation learning for image retrieval via visual triplet-based graph transformation

•Image retrieval system leveraging LLM-powered high-level visual context.•Convert a scene graph into a visual triplet-based graph with triplets as nodes.•Graph embedding reflects the importance of visual triplets via attention mechanism.•VTGT achieves superior image retrieval performance compared to...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Expert systems with applications Ročník 286; s. 127926
Hlavní autoři: Jeong, Soohwan, Park, Jongmin, Choi, Mingyu, Kwon, Yongjin, Lim, Sungsu
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 15.08.2025
Témata:
ISSN:0957-4174
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:•Image retrieval system leveraging LLM-powered high-level visual context.•Convert a scene graph into a visual triplet-based graph with triplets as nodes.•Graph embedding reflects the importance of visual triplets via attention mechanism.•VTGT achieves superior image retrieval performance compared to baselines. [Display omitted] A scene graph represents the relational information between objects within an image, conveying its inherent semantic content. Current image retrieval methods, which use images as queries to find similar ones, typically rely on visual content or basic structural similarities in scene graphs. However, these methods use only basic and surface-level information, overlooking the high-level semantic information embedded in the scene graph. In this study, we leverage visual triplet units, consisting of subject-relation-object pairs in the scene graph, to capture high-level semantics more effectively. To enhance the triplets, we incorporate extensive knowledge from large language models (LLMs). We propose Visual Triplet-based Graph Transformation (VTGT), a framework that transforms the scene graph into a visual triplet-based graph, which is the triplets serve as the nodes. This transformed graph is then processed by a graph neural network (GNN) to learn an optimal scene graph representation. Experimental results in image retrieval demonstrate the superior performance of our approach, driven by the LLM-powered visual triplet-based graph representation.
ISSN:0957-4174
DOI:10.1016/j.eswa.2025.127926