A Simple Framework for Text-Supervised Semantic Segmentation

Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CL...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7071 - 7080
Hlavní autoři:	Yi, Muyang, Cui, Quan, Wu, Hao, Yang, Cheng, Yoshie, Osamu, Lu, Hongtao
Médium:	Konferenční příspěvek
Jazyk:	angličtina japonština
Vydáno:	IEEE 01.06.2023
Témata:	and reasoning Codes Computer vision language Location awareness Network architecture Semantic segmentation Semantics Vision Visualization
ISSN:	1063-6919
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly amelio-rate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.
AbstractList	Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly amelio-rate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.
Author	Yang, Cheng Yoshie, Osamu Lu, Hongtao Cui, Quan Wu, Hao Yi, Muyang
Author_xml	– sequence: 1 givenname: Muyang surname: Yi fullname: Yi, Muyang organization: AI Institute, Shanghai Jiao Tong University,MoE Key Lab of Artificial Intelligence,Department of Computer Science and Engineering – sequence: 2 givenname: Quan surname: Cui fullname: Cui, Quan organization: Waseda University – sequence: 3 givenname: Hao surname: Wu fullname: Wu, Hao organization: ByteDance Inc – sequence: 4 givenname: Cheng surname: Yang fullname: Yang, Cheng organization: ByteDance Inc – sequence: 5 givenname: Osamu surname: Yoshie fullname: Yoshie, Osamu organization: Waseda University – sequence: 6 givenname: Hongtao surname: Lu fullname: Lu, Hongtao organization: AI Institute, Shanghai Jiao Tong University,MoE Key Lab of Artificial Intelligence,Department of Computer Science and Engineering
BookMark	eNotjMtKw0AUQEdRsNb8QRf5gcQ7c2cyueCmFFuFgmKq2zJN7sho8yCJr783oKtzFodzKc6atmEhFhJSKYGuVy-PT0ZZRakChSlAluOJiMhSjgYQpKL8VMwkZJhkJOlCRMPwBgCopMwon4mbZVyEujtyvO5dzV9t_x77to93_D0mxUfH_WcYuIoLrl0zhnKS15qb0Y2hba7EuXfHgaN_zsXz-na3uku2D5v71XKbBEU0JoRsKyXZlNrnDiWVoKAko5VBq5X0ptKarK88WX3InM2n2Btdep0R6wPOxeLvG5h53_Whdv3PXk4XRDT4C8j5Sq8
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52729.2023.00683
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9798350301298
EISSN	1063-6919
EndPage	7080
ExternalDocumentID	10203335
Genre	orig-research
GrantInformation_xml	– fundername: NSFC grantid: 62176155 funderid: 10.13039/501100001807
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i299t-93e7d21e5c4f8a319c020c9542537421f5d4497fdf974b6a7821ef54cf469e4b3
IEDL.DBID	RIE
ISICitedReferencesCount	29
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607041&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English Japanese
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i299t-93e7d21e5c4f8a319c020c9542537421f5d4497fdf974b6a7821ef54cf469e4b3
PageCount	10
ParticipantIDs	ieee_primary_10203335
PublicationCentury	2000
PublicationDate	2023-06-01
PublicationDateYYYYMMDD	2023-06-01
PublicationDate_xml	– month: 06 year: 2023 text: 2023-06-01 day: 01
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.3920157
Snippet	Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering...
SourceID	ieee
SourceType	Publisher
StartPage	7071
SubjectTerms	and reasoning Codes Computer vision language Location awareness Network architecture Semantic segmentation Semantics Vision Visualization
Title	A Simple Framework for Text-Supervised Semantic Segmentation
URI	https://ieeexplore.ieee.org/document/10203335
WOSCitedRecordID	wos001058542607041&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcBUHkW85YE1JYlfscSCKioGVFW0oG5VbJ9RhqZVm_L7OadpYWFgsyxZts4-3332fXeE3Fvr8kQ7PLw8ExEXxkR5rHRkMy-kkU4aV1cteVXDYTad6lFDVq-5MABQB59BLzTrv3y3sJvwVIYansaMMdEiLaXUlqy1f1BhCGWkzhp6XBLrh_7H6E2k6D32Qo3wEMEVkgP-KqJS25BB55-zH5PuDxuPjvZ25oQcQHlKOo37SBvlXJ-Rxyc6LkKyXzrYRVxRdEnpJIDb8WYZboV1GAFzFGdhsfE5b6hHZZe8D54n_ZeoKY4QFWhBqkgzUC5NQFjusxwVyeLyrBaogwzhbuKF41wr7zwiBiNz9AQS8IJbj4AYuGHnpF0uSrggVOc2ZgZNmE0N7pbKpTSInCCGBFIn9SXpBmnMltv8F7OdIK7-6L8mR0Hg24CqG9KuVhu4JYf2qyrWq7t6174B2E2Xtg
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLZgIMFpPIZ40wPXjrZ5tJG4oIlpiDFNbCBuU5O4qId10x78fpyuG1w4cIsiRYmcOPaX-LMBbo2xaagsHV6eCJ8Lrf00iJVvkkxILa3Utqxa0o17veTjQ_UrsnrJhUHEMvgMm65Z_uXbiVm6pzLS8ChgjIlt2BGcR-GKrrV5UmEEZqRKKoJcGKi71nv_VUTkPzZdlXAXw-XSA_4qo1JakXb9n_MfQOOHj-f1N5bmELawOIJ65UB6lXrOj-H-wRvkLt2v117HXHnklHpDB28Hy6m7F-ZuBI5JoLmhxue4Ih8VDXhrPw5bHb8qj-DnZEMWvmIY2yhEYXiWpKRKhpZnlCAtZAR4w0xYzlWc2Ywwg5Yp-QIhZoKbjCAxcs1OoFZMCjwFT6UmYJqMmIk07VecSqkJO2GAIUZWqjNoOGmMpqsMGKO1IM7_6L-Bvc7wpTvqPvWeL2DfCX8VXnUJtcVsiVewa74W-Xx2Xe7gN--3mv0
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=A+Simple+Framework+for+Text-Supervised+Semantic+Segmentation&rft.au=Yi%2C+Muyang&rft.au=Cui%2C+Quan&rft.au=Wu%2C+Hao&rft.au=Yang%2C+Cheng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7071&rft.epage=7080&rft_id=info:doi/10.1109%2FCVPR52729.2023.00683&rft.externalDocID=10203335