P ^ M: Progressive Perspective Mining for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment the object instances referred to by linguistic expressions in video frames. The prevailing approaches mainly rely on simplistic fusion strategies, wherein textual features are directly interacted with video features without considering the i...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on multimedia s. 1 - 12
Hlavní autoři: Wang, Yihan, Sun, Baoli, Ma, Xinzhu, Ge, Hongwei, Fan, Jiulin, Li, Haojie
Médium: Journal Article
Jazyk:angličtina
Vydáno: IEEE 2025
Témata:
ISSN:1520-9210, 1941-0077
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Referring video object segmentation (RVOS) aims to segment the object instances referred to by linguistic expressions in video frames. The prevailing approaches mainly rely on simplistic fusion strategies, wherein textual features are directly interacted with video features without considering the impact of textual semantics at different levels. These coarse-grained fusion strategies hinder the model's ability to perceive changes in object appearance and movement, resulting in performance degradation. To mitigate this issue, we propose a Progressive Perspective Mining (P <inline-formula><tex-math notation="LaTeX">^{2}</tex-math></inline-formula> M) framework, which leverages a coarse-to-fine perspective to mine latent information from text and video, enabling precise segmentation of referred objects. P <inline-formula><tex-math notation="LaTeX">^{2}</tex-math></inline-formula> M consists of two key components: Progressive Vision-Language Interaction (PVLI) and Vision-Language Synergistic Fusion (VLSF). Specifically, PVLI leverages language features across subject, word, and sentence levels to mine textual information, enabling a progressive interaction with video features within an integrated representational space. Concurrently, VLSF focuses on generating semantically rich object queries for segmentation by employing slot attention mechanisms to mine and integrate relevant visual features with linguistic semantics. Furthermore, we introduce two query optimization losses: (1) the Matching Optimization Loss constrains the best queries between frame-level and video-level, effectively preventing the queries of the tracking target from drifting along the temporal dimension during the inference phase; (2) the Vision-Language Semantic Alignment Loss performs a word-by-word matching between object queries and expression, aligning the multi-modal joint space and enhancing the framework's understanding of the textual description. We conducted various experiments on the RVOS task, achieving new state-of-the-art results across all benchmarks, thereby demonstrating the effectiveness of P <inline-formula><tex-math notation="LaTeX">^{2}</tex-math></inline-formula> M.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2025.3618539