CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize differe...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 15305 - 15314
Hlavní autori: Lin, Yuqi, Chen, Minghao, Wang, Wenxiao, Wu, Boxi, Li, Ke, Lin, Binbin, Liu, Haifeng, He, Xiaofei
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.01.2023
Predmet:
ISSN:1063-6919
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Mean-while, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP- ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
AbstractList Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Mean-while, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP- ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
Author Lin, Yuqi
Wang, Wenxiao
Wu, Boxi
Li, Ke
Lin, Binbin
He, Xiaofei
Chen, Minghao
Liu, Haifeng
Author_xml – sequence: 1
  givenname: Yuqi
  surname: Lin
  fullname: Lin, Yuqi
  email: linyq5566@gmail.com
  organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG
– sequence: 2
  givenname: Minghao
  surname: Chen
  fullname: Chen, Minghao
  email: minghaochen01@gmail.com
  organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG
– sequence: 3
  givenname: Wenxiao
  surname: Wang
  fullname: Wang, Wenxiao
  organization: School of Software Technology, Zhejiang University
– sequence: 4
  givenname: Boxi
  surname: Wu
  fullname: Wu, Boxi
  organization: School of Software Technology, Zhejiang University
– sequence: 5
  givenname: Ke
  surname: Li
  fullname: Li, Ke
  organization: Fullong Technology
– sequence: 6
  givenname: Binbin
  surname: Lin
  fullname: Lin, Binbin
  organization: School of Software Technology, Zhejiang University
– sequence: 7
  givenname: Haifeng
  surname: Liu
  fullname: Liu, Haifeng
  organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG
– sequence: 8
  givenname: Xiaofei
  surname: He
  fullname: He, Xiaofei
  organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG
BookMark eNo1zM1OwkAUQOHRaCIib8BiXqB456edjrumgpKQSAR1SW7bOzgK06atRN5eEnX1rc65ZhehDsTYWMBECLC3-evyOZZG2okEqSYgdGLP2Mgam6oYFAhp03M2EJCoKLHCXrFR130AgJJCJDYdsG2-mC-573i262qOgU-d86Wn0PMVbfcnqb3jGV_Tdx_dt_5AgWdN09ZYvnNXt_yN8HN35KuvhtqD76g6dXsMvS__B9j7OtywS4e7jkZ_DtnLbLrOH6PF08M8zxaRlwb6SGPhtHau1MoiVrFAh1oCGo0mKWIDxiWOrAJrhSqBsFAxlkWZyEK7ylZqyMa_X09Em6b1e2yPGwESVBpr9QNph1r-
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.01469
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350301298
EISSN 1063-6919
EndPage 15314
ExternalDocumentID 10203854
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i270t-4abf44ffc439aad51afa420a74a76b5707f6fe9309913c0eab35acbc62b4fd9d3
IEDL.DBID RIE
ISICitedReferencesCount 137
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001062522107060&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i270t-4abf44ffc439aad51afa420a74a76b5707f6fe9309913c0eab35acbc62b4fd9d3
PageCount 10
ParticipantIDs ieee_primary_10203854
PublicationCentury 2000
PublicationDate 2023-01-01
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – month: 01
  year: 2023
  text: 2023-01-01
  day: 01
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6303906
Snippet Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer...
SourceID ieee
SourceType Publisher
StartPage 15305
SubjectTerms Codes
Computer vision
Costs
grouping and shape analysis
Pattern recognition
Real-time systems
Segmentation
Semantic segmentation
Training
Title CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
URI https://ieeexplore.ieee.org/document/10203854
WOSCitedRecordID wos001062522107060&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07b8IwELYK6tCJPqj6loeuoUns2HE3REEdKoQKpWzITxQVAkqgUv99bROoOnToFmU4S3ey77uzv-8AuFeRooxEUZASpQKboZJAICYDLWgiCMcI-1eV4xfa76eTCRtUZHXPhdFa-8dnuuU-_V2-WsqNa5XZHR67iyxcAzVKyZastW-oIFvKEJZW9LgoZA-d8eA1iS16bLkZ4S0nk8J-DVHxOaTX-Ofqx6D5w8aDg32eOQEHOj8FjQo-wmpzlmdg1rFVOsxK2J6XS8hz2PXyENYoHOqZF98sHmEbjlyx-1S4Yw62K0lxaLErfNf8Y_4Fh5uVO0BKZ1wvrOczuTPgw9gEb73uqPMcVHMUgiym4TrAXBiMjZEWfHCukogbjuOQU8wpEQkNqSFGM2TBYoRkqLlACZdCklhgo5hC56CeL3N9AaAWLBU0RSLEEptUM5MIboST7ZLcSHYJms5x09VWKmO689nVH_-vwZGLzbancQPq62Kjb8Gh_FxnZXHnA_wN7DSoLw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NTwIxEG0UTfSEHxi_7cHr4u623W69EYRgREIEkRtpuy3ZiEB2wcR_b1sWjAcP3jZ7mCYzaefNtO8NALdJkFAWBYEXR0nimQxFPIGY9JSgREQcI-xeVQ7atNOJh0PWLcjqjgujlHKPz1TVfrq7_GQml7ZVZnZ4aC-y8DbYsaOzyIqutWmpIFPMRCwuCHKBz-7qg-4LCQ1-rNop4VUrlMJ-jVFxWaRZ_uf6B6Dyw8eD3U2mOQRbanoEygWAhMX2zI_BuG7qdJjmsDbJZ5BPYcMJRBijsKfGTn4zu4c12Lfl7kNmDzpYK0TFoUGv8E3x98kX7C3n9gjJrXH1YXyfyrUBF8gKeG02-vWWV0xS8NKQ-gsPc6Ex1loa-MF5QgKuOQ59TjGnkSDUpzrSiiEDFwMkfcUFIlwKGYUC64Ql6ASUprOpOgVQCRYLGiPhY4l1rJgmgmthhbsk15KdgYp13Gi-EssYrX12_sf_G7DX6j-3R-3HztMF2LdxWnU4LkFpkS3VFdiVn4s0z65dsL8B34-reg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=CLIP+is+Also+an+Efficient+Segmenter%3A+A+Text-Driven+Approach+for+Weakly+Supervised+Semantic+Segmentation&rft.au=Lin%2C+Yuqi&rft.au=Chen%2C+Minghao&rft.au=Wang%2C+Wenxiao&rft.au=Wu%2C+Boxi&rft.date=2023-01-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=15305&rft.epage=15314&rft_id=info:doi/10.1109%2FCVPR52729.2023.01469&rft.externalDocID=10203854