Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, t...
Uloženo v:
| Vydáno v: | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7061 - 7070 |
|---|---|
| Hlavní autoři: | , , , , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.06.2023
|
| Témata: | |
| ISSN: | 1063-6919 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations. |
|---|---|
| AbstractList | Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations. |
| Author | Vajda, Peter Liang, Feng Zhao, Yinan Zhang, Peizhao Wu, Bichen Zhang, Hang Li, Kunpeng Dai, Xiaoliang Marculescu, Diana |
| Author_xml | – sequence: 1 givenname: Feng surname: Liang fullname: Liang, Feng email: jeffliang@utexas.edu organization: The University of Texas at Austin – sequence: 2 givenname: Bichen surname: Wu fullname: Wu, Bichen email: wbc@meta.com organization: Meta Reality Labs – sequence: 3 givenname: Xiaoliang surname: Dai fullname: Dai, Xiaoliang organization: Meta Reality Labs – sequence: 4 givenname: Kunpeng surname: Li fullname: Li, Kunpeng organization: Meta Reality Labs – sequence: 5 givenname: Yinan surname: Zhao fullname: Zhao, Yinan organization: Meta Reality Labs – sequence: 6 givenname: Hang surname: Zhang fullname: Zhang, Hang organization: Cruise – sequence: 7 givenname: Peizhao surname: Zhang fullname: Zhang, Peizhao email: stzpz@meta.com organization: Meta Reality Labs – sequence: 8 givenname: Peter surname: Vajda fullname: Vajda, Peter email: vajdap@meta.com organization: Meta Reality Labs – sequence: 9 givenname: Diana surname: Marculescu fullname: Marculescu, Diana email: dianam@utexas.edu organization: The University of Texas at Austin |
| BookMark | eNotjstKw0AUQEdRsNb8QRf5gcR770zmAW4kaC1EWnx0WybpjUabSWgi4t8b0NU5q8O5FGehCyzEAiFFBHedbzdPGRlyKQHJFEBbOhGRM87KDCQgOXsqZghaJtqhuxDRMHwAgCRE7exM3Kx7Dsm2q3z5dfDHn_iZWx_GpprkreUw-rHpQvzdjO_xox8-E7_3_cj7OC9WmytxXvvDwNE_5-L1_u4lf0iK9XKV3xZJQ6DGxJclsFWGFVvC6bY0WqnaG6WNrg34ygJyjbqmkipU0kpZK6UAsJKalJyLxV-3YeZdf2za6XSHQJAhZfIXlx5JpQ |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CVPR52729.2023.00682 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350301298 |
| EISSN | 1063-6919 |
| EndPage | 7070 |
| ExternalDocumentID | 10205125 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
| ID | FETCH-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 253 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:56:33 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243 |
| PageCount | 10 |
| ParticipantIDs | ieee_primary_10205125 |
| PublicationCentury | 2000 |
| PublicationDate | 2023-June |
| PublicationDateYYYYMMDD | 2023-06-01 |
| PublicationDate_xml | – month: 06 year: 2023 text: 2023-June |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
| PublicationTitleAbbrev | CVPR |
| PublicationYear | 2023 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0003211698 |
| Score | 2.6809673 |
| Snippet | Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 7061 |
| SubjectTerms | Adaptation models and reasoning Computational modeling language Proposals Semantic segmentation Semantics Training Training data Vision |
| Title | Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP |
| URI | https://ieeexplore.ieee.org/document/10205125 |
| WOSCitedRecordID | wos001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09b8IwED0V1KET_aDqtzx0DU0cB9tSN1TUoUWoH4gN2eZcoYqAIFTqv-85SWHq0M3KkEiXnN67y717ALculSpGKkt0sA4QVulIaZdELsHMUAHhZbVn9kkOBmo81sNarF5qYRCxHD7DTjiW__KnC7cJrTLKcE7fEM8a0JBSVmKtbUMlpVKmq1Utj0tifdcbDV8yTuyxEzzCwwRXuW5vZ6JSYki_9c-nH0J7p8Zjwy3OHMEe5sfQqukjq5NzfQL3YTgkGhE42TBb-s1ecU5xmzk6fMxrjVHOQueVPZv1Z2SmZkmEk_Woum_De__hrfcY1eYI0YzHooiMtTEStKBAFUymtJVEbbwJ9iFdL2PjCHvQJ13PLXfB5DxNvRCUsokj0BLpKTTzRY5nwJRRaIXU3GZTYk9Gm0Sgo9sj8SOl_Tm0QzQmy2r_xeQ3EBd_XL-EgxDwaqDqCprFaoPXsO--itl6dVO-tR8DYJXQ |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0omugJPzB-24PXxXa7pbuJNyLBCIQoEm5kd5kaYiiEgon_3tm2wsmDt00PbTLt5L2ZzpsHcG_DWPpIZYly1gHCSMWksgGzAUaaCogkLvbMduJeT45Gql-K1XMtDCLmw2dYd8f8X_5kbteuVUYZzukb4tEu7EVC8KCQa21aKiEVMw0lS4Fc4KuH5rD_GnHij3XnEu5muPKFe1sblRxFWtV_Pv8Ials9ntffIM0x7GB6AtWSQHpleman8OjGQ9iQ4Mm46dJv7w1nFLmppcPHrFQZpZ7rvXpdnX0yPdELopxek-r7Gry3ngbNNivtEdiU-2LFtDE-ErigQOlsppSJidwk2hmINJLY15bQB5OgkXDDrbM5D8NECErawBJsifAMKuk8xXPwpJZoRKy4iSbEn7TSgUBLt0diSFIlF1Bz0Rgvig0Y499AXP5x_Q4O2oNuZ9x57r1cwaELfjFedQ2V1XKNN7Bvv1bTbHmbv8EfAbSZFw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Open-Vocabulary+Semantic+Segmentation+with+Mask-adapted+CLIP&rft.au=Liang%2C+Feng&rft.au=Wu%2C+Bichen&rft.au=Dai%2C+Xiaoliang&rft.au=Li%2C+Kunpeng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7061&rft.epage=7070&rft_id=info:doi/10.1109%2FCVPR52729.2023.00682&rft.externalDocID=10205125 |