A Simple Framework for Text-Supervised Semantic Segmentation

Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CL...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7071 - 7080
Hlavní autori: Yi, Muyang, Cui, Quan, Wu, Hao, Yang, Cheng, Yoshie, Osamu, Lu, Hongtao
Médium: Konferenčný príspevok..
Jazyk:English
Japanese
Vydavateľské údaje: IEEE 01.06.2023
Predmet:
ISSN:1063-6919
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly amelio-rate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.
AbstractList Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pretraining (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly amelio-rate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.
Author Yang, Cheng
Yoshie, Osamu
Lu, Hongtao
Cui, Quan
Wu, Hao
Yi, Muyang
Author_xml – sequence: 1
  givenname: Muyang
  surname: Yi
  fullname: Yi, Muyang
  organization: AI Institute, Shanghai Jiao Tong University,MoE Key Lab of Artificial Intelligence,Department of Computer Science and Engineering
– sequence: 2
  givenname: Quan
  surname: Cui
  fullname: Cui, Quan
  organization: Waseda University
– sequence: 3
  givenname: Hao
  surname: Wu
  fullname: Wu, Hao
  organization: ByteDance Inc
– sequence: 4
  givenname: Cheng
  surname: Yang
  fullname: Yang, Cheng
  organization: ByteDance Inc
– sequence: 5
  givenname: Osamu
  surname: Yoshie
  fullname: Yoshie, Osamu
  organization: Waseda University
– sequence: 6
  givenname: Hongtao
  surname: Lu
  fullname: Lu, Hongtao
  organization: AI Institute, Shanghai Jiao Tong University,MoE Key Lab of Artificial Intelligence,Department of Computer Science and Engineering
BookMark eNotjMtKw0AUQEdRsNb8QRf5gcQ7c2cyueCmFFuFgmKq2zJN7sho8yCJr783oKtzFodzKc6atmEhFhJSKYGuVy-PT0ZZRakChSlAluOJiMhSjgYQpKL8VMwkZJhkJOlCRMPwBgCopMwon4mbZVyEujtyvO5dzV9t_x77to93_D0mxUfH_WcYuIoLrl0zhnKS15qb0Y2hba7EuXfHgaN_zsXz-na3uku2D5v71XKbBEU0JoRsKyXZlNrnDiWVoKAko5VBq5X0ptKarK88WX3InM2n2Btdep0R6wPOxeLvG5h53_Whdv3PXk4XRDT4C8j5Sq8
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.00683
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350301298
EISSN 1063-6919
EndPage 7080
ExternalDocumentID 10203335
Genre orig-research
GrantInformation_xml – fundername: NSFC
  grantid: 62176155
  funderid: 10.13039/501100001807
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i299t-93e7d21e5c4f8a319c020c9542537421f5d4497fdf974b6a7821ef54cf469e4b3
IEDL.DBID RIE
ISICitedReferencesCount 29
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607041&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
Japanese
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i299t-93e7d21e5c4f8a319c020c9542537421f5d4497fdf974b6a7821ef54cf469e4b3
PageCount 10
ParticipantIDs ieee_primary_10203335
PublicationCentury 2000
PublicationDate 2023-06-01
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-06-01
  day: 01
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.3920157
Snippet Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering...
SourceID ieee
SourceType Publisher
StartPage 7071
SubjectTerms and reasoning
Codes
Computer vision
language
Location awareness
Network architecture
Semantic segmentation
Semantics
Vision
Visualization
Title A Simple Framework for Text-Supervised Semantic Segmentation
URI https://ieeexplore.ieee.org/document/10203335
WOSCitedRecordID wos001058542607041&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVoxcBUPoqgfMgDa0oS27EtsaCKigFVFS1Vtyp2zihD06of_H7OblpYGNisSFGii17uPfveHSEPMrYyK2IRSYUg51zaSBkwkYDYiAw0V8HhPXmTg4GaTvWwNqsHLwwAhOIz6PplOMsvFnbrt8oQ4WnMGBMN0pBS7sxahw0VhlIm06q2xyWxfuxNhu8iRfbY9TPCfQWXbw74a4hKyCH91j-ffkraP248OjzkmTNyBNU5adX0kdbgXF-Qp2c6Kn2zX9rfV1xRpKR07MXtaLv0f4W1vwPmGM7S4uJzXluPqjb56L-Me69RPRwhKjGDbCLNQBZpAsJyp3IEksXXs1ogBhnK3cSJgnMtXeFQMZgsRyaQgBPcOhTEwA27JM1qUcEVoSacnZk8UcbwQjujIYXCuNjlyIey5Jq0fTRmy13_i9k-EJ0_rt-QEx_wUFClb0lzs9rCHTm2X5tyvboPX-0bLBSXdw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0omugJPzB-24PXYj92227ixRAJRiREkHAj3e2s6YFCKPj7nV0KevHgbdOkaTPN67y3O28G4D72VBxlHnfjhEDOWKzcRKJ0OXqSRyhYYh3eo27c6yXjsehXZnXrhUFEW3yGTbO0Z_nZTK3MVhkhPPDCMOS7sMcZC_y1XWu7pRKSmIlEUhnkfE88tEb9dx4Qf2yaKeGmhsu0B_w1RsVmkXb9n88_gsaPH8_pbzPNMexgcQL1ikA6FTzLU3h8cga5affrtDc1Vw6RUmdo5O1gNTf_hdLcgVMKaK5o8TmtzEdFAz7az8NWx63GI7g55ZClK0KMs8BHrphOUoKSotdTghMKQxK8vuYZYyLWmSbNIKOUuICPmjOlSRIjk-EZ1IpZgefgSHt6JlM_kZJlQkuBAWZSezolRhT5F9Aw0ZjM1x0wJptAXP5x_Q4OOsO37qT70nu9gkMT_HV51TXUlosV3sC--lrm5eLWfsFvPZiatg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=A+Simple+Framework+for+Text-Supervised+Semantic+Segmentation&rft.au=Yi%2C+Muyang&rft.au=Cui%2C+Quan&rft.au=Wu%2C+Hao&rft.au=Yang%2C+Cheng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7071&rft.epage=7080&rft_id=info:doi/10.1109%2FCVPR52729.2023.00683&rft.externalDocID=10203335