CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize differe...
Uloženo v:
| Vydáno v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 15305 - 15314 |
|---|---|
| Hlavní autoři: | , , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.01.2023
|
| Témata: | |
| ISSN: | 1063-6919 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Mean-while, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP- ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES. |
|---|---|
| AbstractList | Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Mean-while, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP- ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES. |
| Author | Lin, Yuqi Wang, Wenxiao Wu, Boxi Li, Ke Lin, Binbin He, Xiaofei Chen, Minghao Liu, Haifeng |
| Author_xml | – sequence: 1 givenname: Yuqi surname: Lin fullname: Lin, Yuqi email: linyq5566@gmail.com organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG – sequence: 2 givenname: Minghao surname: Chen fullname: Chen, Minghao email: minghaochen01@gmail.com organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG – sequence: 3 givenname: Wenxiao surname: Wang fullname: Wang, Wenxiao organization: School of Software Technology, Zhejiang University – sequence: 4 givenname: Boxi surname: Wu fullname: Wu, Boxi organization: School of Software Technology, Zhejiang University – sequence: 5 givenname: Ke surname: Li fullname: Li, Ke organization: Fullong Technology – sequence: 6 givenname: Binbin surname: Lin fullname: Lin, Binbin organization: School of Software Technology, Zhejiang University – sequence: 7 givenname: Haifeng surname: Liu fullname: Liu, Haifeng organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG – sequence: 8 givenname: Xiaofei surname: He fullname: He, Xiaofei organization: College of Computer Science, Zhejiang University,State Key Lab of CAD&CG |
| BookMark | eNo1zM1OwkAUQOHRaCIib8BiXqB456edjrumgpKQSAR1SW7bOzgK06atRN5eEnX1rc65ZhehDsTYWMBECLC3-evyOZZG2okEqSYgdGLP2Mgam6oYFAhp03M2EJCoKLHCXrFR130AgJJCJDYdsG2-mC-573i262qOgU-d86Wn0PMVbfcnqb3jGV_Tdx_dt_5AgWdN09ZYvnNXt_yN8HN35KuvhtqD76g6dXsMvS__B9j7OtywS4e7jkZ_DtnLbLrOH6PF08M8zxaRlwb6SGPhtHau1MoiVrFAh1oCGo0mKWIDxiWOrAJrhSqBsFAxlkWZyEK7ylZqyMa_X09Em6b1e2yPGwESVBpr9QNph1r- |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52729.2023.01469 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350301298 |
| EISSN | 1063-6919 |
| EndPage | 15314 |
| ExternalDocumentID | 10203854 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i270t-4abf44ffc439aad51afa420a74a76b5707f6fe9309913c0eab35acbc62b4fd9d3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 137 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001062522107060&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:56:33 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i270t-4abf44ffc439aad51afa420a74a76b5707f6fe9309913c0eab35acbc62b4fd9d3 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_10203854 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-01-01 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – month: 01 year: 2023 text: 2023-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.6304424 |
| Snippet | Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 15305 |
| SubjectTerms | Codes Computer vision Costs grouping and shape analysis Pattern recognition Real-time systems Segmentation Semantic segmentation Training |
| Title | CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation |
| URI | https://ieeexplore.ieee.org/document/10203854 |
| WOSCitedRecordID | wos001062522107060&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxcDEq4i3PLCm5OHYMVtVWjFVFS3QrfLjXEWUtEpaJP49tpsWMTCwRRku0p3u_N0533cI3UHGMmMoCSKIZUCyVNmcEyzQqaBRxKSFJMYvm2CDQTaZ8GFNVvdcGADwP59B2z36u3y9UGs3KrMZHruLLNJADcbohqy1G6gktpWhPKvpcVHI77uvw-c0tuix7XaEt51MCv-1RMWfIf3Df379CLV-2Hh4uDtnjtEeFCfosIaPuE7O6hTNurZLx3mFO_NqgUWBe14ewhrFI5h58c3yAXfw2DW7j6Urc7hTS4pji13xG4j3-RcerZeugFTOOHxYz-dqa8CHsYVe-r1x9ymo9ygEeczCVUCENIQYoyz4EEKnkTCCxKFgRDAqUxYyQw3wxILFKFEhCJmkQklFY0mM5jo5Q81iUcA5woanAqhmmbJATMkkC4mm2jrA2rSVw1yglnPcdLmRyphufXb5x_srdOBis5lpXKPmqlzDDdpXn6u8Km99gL8BamSoOw |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEN4omugJHxjf7sFrsY_t7tYbQQxGJERQuZF9kkYspAUT_727S8F48OCt6WGazGRmv5nt9w0A14oSqjVGXqBC7iEaC5NzjHgyZjgICDeQRLtlE6TbpcNh0ivJ6o4Lo5RyP5-pun10d_lyKhZ2VGYyPLQXWWgTbNnVWSVdaz1SiUwzgxNaEuQCP7lpvvae49Dgx7rdEl63QinJrzUq7hS5r_7z-3ug9sPHg731SbMPNlR2AKolgIRlehaHYNw0fTpMC9iYFFPIMthyAhHGKOyrsZPfzG9hAw5su3uX20IHG6WoODToFb4p9j75gv3FzJaQwhpXH8b3qVgZcIGsgZf71qDZ9spNCl4aEn_uIcY1QloLAz8Yk3HANEOhzwhiBPOY-ERjrZLIwMUgEr5iPIqZ4AKHHGmZyOgIVLJppo4B1EnMFJaECgPFBI-ojySWxgHGpqkd-gTUrONGs6VYxmjls9M_3l-BnfbgqTPqPHQfz8CujdNywnEOKvN8oS7Atvicp0V-6YL9DVrnq4Q |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=CLIP+is+Also+an+Efficient+Segmenter%3A+A+Text-Driven+Approach+for+Weakly+Supervised+Semantic+Segmentation&rft.au=Lin%2C+Yuqi&rft.au=Chen%2C+Minghao&rft.au=Wang%2C+Wenxiao&rft.au=Wu%2C+Boxi&rft.date=2023-01-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=15305&rft.epage=15314&rft_id=info:doi/10.1109%2FCVPR52729.2023.01469&rft.externalDocID=10203854 |