LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult f...
Uložené v:
| Vydané v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 14001 - 14010 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
16.06.2024
|
| Predmet: | |
| ISSN: | 1063-6919 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text ex-pressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-vas, JHMDB-Sentences and Refer-DAVIS17 show impres-sive improvements of our method. Code is available here. |
|---|---|
| AbstractList | Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains so-phisticated description of the instance's appearance, action, and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact, the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text ex-pressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part, we also introduce a forward-backward visual consistency loss, which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-vas, JHMDB-Sentences and Refer-DAVIS17 show impres-sive improvements of our method. Code is available here. |
| Author | Yue, Zijie Chen, Qijun Yuan, Linfeng Shi, Miaojing |
| Author_xml | – sequence: 1 givenname: Linfeng surname: Yuan fullname: Yuan, Linfeng email: linfengyuan1997@gmail.com organization: College of Electronic and Information Engineering, Tongji University – sequence: 2 givenname: Miaojing surname: Shi fullname: Shi, Miaojing email: mshi@tongji.edu.cn organization: College of Electronic and Information Engineering, Tongji University – sequence: 3 givenname: Zijie surname: Yue fullname: Yue, Zijie email: zijie@tongji.edu.cn organization: College of Electronic and Information Engineering, Tongji University – sequence: 4 givenname: Qijun surname: Chen fullname: Chen, Qijun email: qjchen@tongji.edu.cn organization: College of Electronic and Information Engineering, Tongji University |
| BookMark | eNotj9tKAzEURaMoWGv_oA_5galJTi4T36R4ZbClrX0tmZmTNtVOJBNQ_96KPm1YLBbsS3LWxQ4JGXM24ZzZ6-l6vlDCAEwEE3LCOIjyhIyssSUoBgoY06dkwJmGQltuL8io7_eMMRCca1sOyLqKy90NrWK3LZa7mDJd4VemzzF0mc4TtqHJIXb0BfNnTG_Ux0QX6DGl0G3pOrQY6azeY5PpErcH7LL79a_IuXfvPY7-d0he7-9W08eimj08TW-rInCjc9FYWZe1l65WTnkua-dt6b0VR6YlChCmMQIc40IDli1430howLC2tQoMDMn4rxsQcfORwsGl783xr1KllvADLUhUrw |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52733.2024.01328 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350353006 |
| EISSN | 1063-6919 |
| EndPage | 14010 |
| ExternalDocumentID | 10655864 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i176t-c94b8bf4ab5a5f14baf98ff92bf464e2327c723a01263e8d3ffc43c370dd95373 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 5 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001342442405035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:00:48 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i176t-c94b8bf4ab5a5f14baf98ff92bf464e2327c723a01263e8d3ffc43c370dd95373 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_10655864 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-June-16 |
| PublicationDateYYYYMMDD | 2024-06-16 |
| PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-16 day: 16 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.403033 |
| Snippet | Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 14001 |
| SubjectTerms | Codes Computer vision Object segmentation Optical losses Pipelines Predictive models Visualization |
| Title | LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation |
| URI | https://ieeexplore.ieee.org/document/10655864 |
| WOSCitedRecordID | wos001342442405035&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcBUHkW85YHVpYkdP1grKoSqUtESdasc-9xmIEFtyu_HTgOIgYHNsixb8ul8993dd0bo1gMcZywA4X2bERZbICqWjlgtmTaRzkzdrikdifFYzudq0pDVay4MANTFZ9ALwzqXb0uzDaEyr-E8SSRnLdQSQuzIWt8BFeqhDFeyocdFfXU3SCcvob8Y9TAwZr2QVZC_PlGpbciw88_TD1H3h42HJ9925gjtQXGMOo37iBvl3JygdFROV_d4VBZLMl15txrP_MuLn8q8qPwGISMTpIDHu8pv7N1VXPeZDaE9nOYWSvychbgMnsLyrSElFV30OnyYDR5J820CySPBK2IUy2TmmM4SnbiIZdop6ZyK_Rxn4F0oYURMtTdNnIK01DnDqKGib61KqKCnqF2UBZwhTDm1jjkPqigwv0h6_KO1c7GODHOxOkfdcE-L911njMXXFV38MX-JDoIoQqlVxK9Qu1pv4Rrtm48q36xvanl-AnWzoow |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagIMFUHkW88cDq0tiOY7NWVDxCqGipulWOH20GEtSm_H7sNBQxMLBZlmVLPp3vvrv7zgBcO4BjlTYGsY5OEcXaIIG5RVpyKlUgU1W1axrFUZLw8Vj0a7J6xYUxxlTFZ6bth1UuXxdq6UNlTsNZGHJGN8FWSCkOVnStdUiFODDDBK8JckFH3HRH_VffYYw4IIhp2-cV-K9vVCor0mv-8_w90Prh48H-2tLsgw2TH4Bm7UDCWj0Xh2AUF4PZLYyLfIoGM-dYw6F7e-FjkeWl28DnZLwcYLKq_YbOYYVVp1kf3IOjTJsCvqQ-MgMHZvpe05LyFnjr3Q2796j-OAFlQcRKpARNeWqpTEMZ2oCm0gpurcBujlHjnKhIRZhIZ5wYMVwTaxUlikQdrUVIInIEGnmRm2MACSPaUutgFTHULeIOAUlpLZaBohaLE9Dy9zT5WPXGmHxf0ekf81dg5374HE_ih-TpDOx6sfjCq4Cdg0Y5X5oLsK0-y2wxv6xk-wUZYKXT |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=LoSh%3A+Long-Short+Text+Joint+Prediction+Network+for+Referring+Video+Object+Segmentation&rft.au=Yuan%2C+Linfeng&rft.au=Shi%2C+Miaojing&rft.au=Yue%2C+Zijie&rft.au=Chen%2C+Qijun&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=14001&rft.epage=14010&rft_id=info:doi/10.1109%2FCVPR52733.2024.01328&rft.externalDocID=10655864 |