Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, t...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7061 - 7070
Hlavní autoři: Liang, Feng, Wu, Bichen, Dai, Xiaoliang, Li, Kunpeng, Zhao, Yinan, Zhang, Hang, Zhang, Peizhao, Vajda, Peter, Marculescu, Diana
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.06.2023
Témata:
ISSN:1063-6919
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
AbstractList Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
Author Vajda, Peter
Liang, Feng
Zhao, Yinan
Zhang, Peizhao
Wu, Bichen
Zhang, Hang
Li, Kunpeng
Dai, Xiaoliang
Marculescu, Diana
Author_xml – sequence: 1
  givenname: Feng
  surname: Liang
  fullname: Liang, Feng
  email: jeffliang@utexas.edu
  organization: The University of Texas at Austin
– sequence: 2
  givenname: Bichen
  surname: Wu
  fullname: Wu, Bichen
  email: wbc@meta.com
  organization: Meta Reality Labs
– sequence: 3
  givenname: Xiaoliang
  surname: Dai
  fullname: Dai, Xiaoliang
  organization: Meta Reality Labs
– sequence: 4
  givenname: Kunpeng
  surname: Li
  fullname: Li, Kunpeng
  organization: Meta Reality Labs
– sequence: 5
  givenname: Yinan
  surname: Zhao
  fullname: Zhao, Yinan
  organization: Meta Reality Labs
– sequence: 6
  givenname: Hang
  surname: Zhang
  fullname: Zhang, Hang
  organization: Cruise
– sequence: 7
  givenname: Peizhao
  surname: Zhang
  fullname: Zhang, Peizhao
  email: stzpz@meta.com
  organization: Meta Reality Labs
– sequence: 8
  givenname: Peter
  surname: Vajda
  fullname: Vajda, Peter
  email: vajdap@meta.com
  organization: Meta Reality Labs
– sequence: 9
  givenname: Diana
  surname: Marculescu
  fullname: Marculescu, Diana
  email: dianam@utexas.edu
  organization: The University of Texas at Austin
BookMark eNotjstKw0AUQEdRsNb8QRf5gcR770zmAW4kaC1EWnx0WybpjUabSWgi4t8b0NU5q8O5FGehCyzEAiFFBHedbzdPGRlyKQHJFEBbOhGRM87KDCQgOXsqZghaJtqhuxDRMHwAgCRE7exM3Kx7Dsm2q3z5dfDHn_iZWx_GpprkreUw-rHpQvzdjO_xox8-E7_3_cj7OC9WmytxXvvDwNE_5-L1_u4lf0iK9XKV3xZJQ6DGxJclsFWGFVvC6bY0WqnaG6WNrg34ygJyjbqmkipU0kpZK6UAsJKalJyLxV-3YeZdf2za6XSHQJAhZfIXlx5JpQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.00682
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350301298
EISSN 1063-6919
EndPage 7070
ExternalDocumentID 10205125
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
IEDL.DBID RIE
ISICitedReferencesCount 253
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
PageCount 10
ParticipantIDs ieee_primary_10205125
PublicationCentury 2000
PublicationDate 2023-June
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6809673
Snippet Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during...
SourceID ieee
SourceType Publisher
StartPage 7061
SubjectTerms Adaptation models
and reasoning
Computational modeling
language
Proposals
Semantic segmentation
Semantics
Training
Training data
Vision
Title Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
URI https://ieeexplore.ieee.org/document/10205125
WOSCitedRecordID wos001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09b8IwED0V1KET_aDqtzx0DU0cB9tSN1TUoUWoH4gN2eZcoYqAIFTqv-85SWHq0M3KkEiXnN67y717ALculSpGKkt0sA4QVulIaZdELsHMUAHhZbVn9kkOBmo81sNarF5qYRCxHD7DTjiW__KnC7cJrTLKcE7fEM8a0JBSVmKtbUMlpVKmq1Utj0tifdcbDV8yTuyxEzzCwwRXuW5vZ6JSYki_9c-nH0J7p8Zjwy3OHMEe5sfQqukjq5NzfQL3YTgkGhE42TBb-s1ecU5xmzk6fMxrjVHOQueVPZv1Z2SmZkmEk_Woum_De__hrfcY1eYI0YzHooiMtTEStKBAFUymtJVEbbwJ9iFdL2PjCHvQJ13PLXfB5DxNvRCUsokj0BLpKTTzRY5nwJRRaIXU3GZTYk9Gm0Sgo9sj8SOl_Tm0QzQmy2r_xeQ3EBd_XL-EgxDwaqDqCprFaoPXsO--itl6dVO-tR8DYJXQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8JAEJ0omugJPzB-24PXxXa7pbuJNyLBCIQoEm5kd5kaYiiEgon_3tm2wsmDt00PbTLt5L2ZzpsHcG_DWPpIZYly1gHCSMWksgGzAUaaCogkLvbMduJeT45Gql-K1XMtDCLmw2dYd8f8X_5kbteuVUYZzukb4tEu7EVC8KCQa21aKiEVMw0lS4Fc4KuH5rD_GnHij3XnEu5muPKFe1sblRxFWtV_Pv8Ials9ntffIM0x7GB6AtWSQHpleman8OjGQ9iQ4Mm46dJv7w1nFLmppcPHrFQZpZ7rvXpdnX0yPdELopxek-r7Gry3ngbNNivtEdiU-2LFtDE-ErigQOlsppSJidwk2hmINJLY15bQB5OgkXDDrbM5D8NECErawBJsifAMKuk8xXPwpJZoRKy4iSbEn7TSgUBLt0diSFIlF1Bz0Rgvig0Y499AXP5x_Q4O2oNuZ9x57r1cwaELfjFedQ2V1XKNN7Bvv1bTbHmbv8EfAbSZFw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Open-Vocabulary+Semantic+Segmentation+with+Mask-adapted+CLIP&rft.au=Liang%2C+Feng&rft.au=Wu%2C+Bichen&rft.au=Dai%2C+Xiaoliang&rft.au=Li%2C+Kunpeng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7061&rft.epage=7070&rft_id=info:doi/10.1109%2FCVPR52729.2023.00682&rft.externalDocID=10205125