LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult f...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 14001 - 14010
Hlavní autori: Yuan, Linfeng, Shi, Miaojing, Yue, Zijie, Chen, Qijun
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 16.06.2024
Predmet:
ISSN:1063-6919
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text ex-pressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-vas, JHMDB-Sentences and Refer-DAVIS17 show impres-sive improvements of our method. Code is available here.
AbstractList Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text ex-pressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-vas, JHMDB-Sentences and Refer-DAVIS17 show impres-sive improvements of our method. Code is available here.
Author Yue, Zijie
Chen, Qijun
Yuan, Linfeng
Shi, Miaojing
Author_xml – sequence: 1
  givenname: Linfeng
  surname: Yuan
  fullname: Yuan, Linfeng
  email: linfengyuan1997@gmail.com
  organization: College of Electronic and Information Engineering, Tongji University
– sequence: 2
  givenname: Miaojing
  surname: Shi
  fullname: Shi, Miaojing
  email: mshi@tongji.edu.cn
  organization: College of Electronic and Information Engineering, Tongji University
– sequence: 3
  givenname: Zijie
  surname: Yue
  fullname: Yue, Zijie
  email: zijie@tongji.edu.cn
  organization: College of Electronic and Information Engineering, Tongji University
– sequence: 4
  givenname: Qijun
  surname: Chen
  fullname: Chen, Qijun
  email: qjchen@tongji.edu.cn
  organization: College of Electronic and Information Engineering, Tongji University
BookMark eNotj9tKAzEURaMoWGv_oA_5galJTi4T36R4ZbClrX0tmZmTNtVOJBNQ_96KPm1YLBbsS3LWxQ4JGXM24ZzZ6-l6vlDCAEwEE3LCOIjyhIyssSUoBgoY06dkwJmGQltuL8io7_eMMRCca1sOyLqKy90NrWK3LZa7mDJd4VemzzF0mc4TtqHJIXb0BfNnTG_Ux0QX6DGl0G3pOrQY6azeY5PpErcH7LL79a_IuXfvPY7-d0he7-9W08eimj08TW-rInCjc9FYWZe1l65WTnkua-dt6b0VR6YlChCmMQIc40IDli1430howLC2tQoMDMn4rxsQcfORwsGl783xr1KllvADLUhUrw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52733.2024.01328
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore / Electronic Library Online (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350353006
EISSN 1063-6919
EndPage 14010
ExternalDocumentID 10655864
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i176t-c94b8bf4ab5a5f14baf98ff92bf464e2327c723a01263e8d3ffc43c370dd95373
IEDL.DBID RIE
ISICitedReferencesCount 5
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001342442405035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:00:48 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-c94b8bf4ab5a5f14baf98ff92bf464e2327c723a01263e8d3ffc43c370dd95373
PageCount 10
ParticipantIDs ieee_primary_10655864
PublicationCentury 2000
PublicationDate 2024-June-16
PublicationDateYYYYMMDD 2024-06-16
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-16
  day: 16
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.403033
Snippet Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression...
SourceID ieee
SourceType Publisher
StartPage 14001
SubjectTerms Codes
Computer vision
Object segmentation
Optical losses
Pipelines
Predictive models
Visualization
Title LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
URI https://ieeexplore.ieee.org/document/10655864
WOSCitedRecordID wos001342442405035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEG6EePCED4zv9OB1cZe-vRKJMQSJIOFG-pgCB3cNLP5-22XVePDgrWnSNpnJdJ7fDEK3nBgTHAuXeGpIQhnoRHsdOyMKCz6jmnhaDZsQw6GczdSoBqtXWBgAqIrPoBOXVS7fFXYbQ2VBwjljktMGagghdmCt74AKCa4MV7KGx2WpuutNRy-xvxgJbmCXdmJWQf4aolLpkH7rn68fovYPGg-PvvXMEdqD_Bi1avMR18K5OUHTQTFe3uNBkS-S8TKY1XgSfl78VKzyMlwQMzKRC3i4q_zGwVzFVZ_ZGNrD05WDAj-bGJfBY1i81aCkvI1e-w-T3mNSj01IVpngZWIVNdJ4qg3TLJDbaK-k96ob9jiFYEIJK7pEB9XECUhHvLeUWCJS5xQjgpyiZl7kcIZwymKVTAoyDZdyTbQDoyQNB1lmhSPnqB3pNH_fdcaYf5Ho4o_9S3QQWRFLrTJ-hZrlegvXaN9-lKvN-qbi5yeEyqGf
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagIMFUHkW88cCaktRv1oqKRwgVLVW3ynbsNgMJalN-P3YaihgY2CxLtqU7ne_53QFwTZFSzrFIA4sVCjAxMpBW-s6ITBsbYYksroZNsCTh47Ho12D1CgtjjKmKz0zbL6tcflropQ-VOQmnhHCKN8EWwbgTreBa65AKcs4MFbwGyEWhuOmO-q--wxhyjmAHt31egf8ao1JpkV7zn-_vgdYPHg_215pmH2yY_AA0awMS1uK5OASjuBjMbmFc5NNgMHOGNRy6vxc-Flleugt8TsbzASar2m_oDFZYdZr1wT04ylJTwBflIzNwYKbvNSwpb4G33t2wex_UgxOCLGK0DLTAiiuLpSKSOIIraQW3VnTcHsXGGVFMsw6STjlRZHiKrNUYacTCNBUEMXQEGnmRm2MAQ-LrZELDQ3cplUimRgmO3UESaZaiE9DydJp8rHpjTL5JdPrH_hXYuR8-x5P4IXk6A7ueLb7wKqLnoFHOl-YCbOvPMlvMLyvefgEct6Tm
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=LoSh%3A+Long-Short+Text+Joint+Prediction+Network+for+Referring+Video+Object+Segmentation&rft.au=Yuan%2C+Linfeng&rft.au=Shi%2C+Miaojing&rft.au=Yue%2C+Zijie&rft.au=Chen%2C+Qijun&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=14001&rft.epage=14010&rft_id=info:doi/10.1109%2FCVPR52733.2024.01328&rft.externalDocID=10655864