Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, t...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7061 - 7070
Hlavní autori: Liang, Feng, Wu, Bichen, Dai, Xiaoliang, Li, Kunpeng, Zhao, Yinan, Zhang, Hang, Zhang, Peizhao, Vajda, Peter, Marculescu, Diana
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.06.2023
Predmet:
ISSN:1063-6919
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
AbstractList Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
Author Vajda, Peter
Liang, Feng
Zhao, Yinan
Zhang, Peizhao
Wu, Bichen
Zhang, Hang
Li, Kunpeng
Dai, Xiaoliang
Marculescu, Diana
Author_xml – sequence: 1
  givenname: Feng
  surname: Liang
  fullname: Liang, Feng
  email: jeffliang@utexas.edu
  organization: The University of Texas at Austin
– sequence: 2
  givenname: Bichen
  surname: Wu
  fullname: Wu, Bichen
  email: wbc@meta.com
  organization: Meta Reality Labs
– sequence: 3
  givenname: Xiaoliang
  surname: Dai
  fullname: Dai, Xiaoliang
  organization: Meta Reality Labs
– sequence: 4
  givenname: Kunpeng
  surname: Li
  fullname: Li, Kunpeng
  organization: Meta Reality Labs
– sequence: 5
  givenname: Yinan
  surname: Zhao
  fullname: Zhao, Yinan
  organization: Meta Reality Labs
– sequence: 6
  givenname: Hang
  surname: Zhang
  fullname: Zhang, Hang
  organization: Cruise
– sequence: 7
  givenname: Peizhao
  surname: Zhang
  fullname: Zhang, Peizhao
  email: stzpz@meta.com
  organization: Meta Reality Labs
– sequence: 8
  givenname: Peter
  surname: Vajda
  fullname: Vajda, Peter
  email: vajdap@meta.com
  organization: Meta Reality Labs
– sequence: 9
  givenname: Diana
  surname: Marculescu
  fullname: Marculescu, Diana
  email: dianam@utexas.edu
  organization: The University of Texas at Austin
BookMark eNotjstKw0AUQEdRsNb8QRf5gcR770zmAW4kaC1EWnx0WybpjUabSWgi4t8b0NU5q8O5FGehCyzEAiFFBHedbzdPGRlyKQHJFEBbOhGRM87KDCQgOXsqZghaJtqhuxDRMHwAgCRE7exM3Kx7Dsm2q3z5dfDHn_iZWx_GpprkreUw-rHpQvzdjO_xox8-E7_3_cj7OC9WmytxXvvDwNE_5-L1_u4lf0iK9XKV3xZJQ6DGxJclsFWGFVvC6bY0WqnaG6WNrg34ygJyjbqmkipU0kpZK6UAsJKalJyLxV-3YeZdf2za6XSHQJAhZfIXlx5JpQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.00682
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350301298
EISSN 1063-6919
EndPage 7070
ExternalDocumentID 10205125
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
IEDL.DBID RIE
ISICitedReferencesCount 253
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
PageCount 10
ParticipantIDs ieee_primary_10205125
PublicationCentury 2000
PublicationDate 2023-June
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6809673
Snippet Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during...
SourceID ieee
SourceType Publisher
StartPage 7061
SubjectTerms Adaptation models
and reasoning
Computational modeling
language
Proposals
Semantic segmentation
Semantics
Training
Training data
Vision
Title Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
URI https://ieeexplore.ieee.org/document/10205125
WOSCitedRecordID wos001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07a8MwED6a0KFT-kjpGw1dlcq2ZNnQLTR0aEPoI2QLknwuocQJeRT673tS3GTq0E0YjOHs47s7f999ALdRWhijMOE2Q2pQrJI8F5nkttDGOCo5FIY9s0-6389Go3xQi9WDFgYRA_kMO_4Y_uUXM7f2ozLK8Ji-oVg1oKG13oi1tgOVhFqZNM9qeVwk8rvucPCiYqoeO94j3DO4wrq9nYlKwJBe659PP4T2To3HBlucOYI9rI6hVZePrE7O5Qnce3IIHxI4Wc8t_WavOKW4TRwdPqa1xqhifvLKns3yk5vCzKngZF3q7tvw3nt46z7y2hyBT2IhV9xYK5CgBSVm3mQqt5pKm9J4-5C01MI4wh4so7SMbey8yXmSlFJSykaOQEsmp9CsZhWeAUMnlRPS0X2pzK0wyiq_uA2TIpVliefQ9tEYzzf7L8a_gbj44_olHPiAbwhVV9BcLdZ4DfvuazVZLm7CW_sB6NaWvQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB60Cnqqj4pv9-B16ybZzQO8FUvFthStpbeyu5lIkaalD8F_72wa25MHb0sgBCYZvpnJ980HcO-FqdYKA25ipAbFKMkTEUtu0khrSyWHwmLPbDvqduPhMOmVYvVCC4OIBfkM6-5Y_MtPp3blRmWU4T59Q77ahT0lpe-t5VqbkUpAzUyYxKVAzhPJQ2PQe1U-1Y915xLuOFzFwr2tjUqBIs3qP59_BLWtHo_1NkhzDDuYn0C1LCBZmZ6LU3h09BA-IHgyjl36zd5wQpEbWzp8TEqVUc7c7JV19OKT61TPqORkDerva_DefOo3Wry0R-BjX8gl18YIJHBBibGzmUpMRMVNpp2BSJhFQltCH8y8MPONb53NeRBkUlLSepZgSwZnUMmnOZ4DQyuVFdLSfaFMjNDKKLe6DYM0lFmGF1Bz0RjN1hswRr-BuPzj-h0ctPqd9qj93H25gkMX_DW96hoqy_kKb2Dffi3Hi_lt8QZ_APQYmgQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Open-Vocabulary+Semantic+Segmentation+with+Mask-adapted+CLIP&rft.au=Liang%2C+Feng&rft.au=Wu%2C+Bichen&rft.au=Dai%2C+Xiaoliang&rft.au=Li%2C+Kunpeng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7061&rft.epage=7070&rft_id=info:doi/10.1109%2FCVPR52729.2023.00682&rft.externalDocID=10205125