Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, t...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) s. 7061 - 7070
Hlavní autori:	Liang, Feng, Wu, Bichen, Dai, Xiaoliang, Li, Kunpeng, Zhao, Yinan, Zhang, Hang, Zhang, Peizhao, Vajda, Peter, Marculescu, Diana
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.06.2023
Predmet:	Adaptation models and reasoning Computational modeling language Proposals Semantic segmentation Semantics Training Training data Vision
ISSN:	1063-6919
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
AbstractList	Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
Author	Vajda, Peter Liang, Feng Zhao, Yinan Zhang, Peizhao Wu, Bichen Zhang, Hang Li, Kunpeng Dai, Xiaoliang Marculescu, Diana
Author_xml	– sequence: 1 givenname: Feng surname: Liang fullname: Liang, Feng email: jeffliang@utexas.edu organization: The University of Texas at Austin – sequence: 2 givenname: Bichen surname: Wu fullname: Wu, Bichen email: wbc@meta.com organization: Meta Reality Labs – sequence: 3 givenname: Xiaoliang surname: Dai fullname: Dai, Xiaoliang organization: Meta Reality Labs – sequence: 4 givenname: Kunpeng surname: Li fullname: Li, Kunpeng organization: Meta Reality Labs – sequence: 5 givenname: Yinan surname: Zhao fullname: Zhao, Yinan organization: Meta Reality Labs – sequence: 6 givenname: Hang surname: Zhang fullname: Zhang, Hang organization: Cruise – sequence: 7 givenname: Peizhao surname: Zhang fullname: Zhang, Peizhao email: stzpz@meta.com organization: Meta Reality Labs – sequence: 8 givenname: Peter surname: Vajda fullname: Vajda, Peter email: vajdap@meta.com organization: Meta Reality Labs – sequence: 9 givenname: Diana surname: Marculescu fullname: Marculescu, Diana email: dianam@utexas.edu organization: The University of Texas at Austin
BookMark	eNotjstKw0AUQEdRsNb8QRf5gcR770zmAW4kaC1EWnx0WybpjUabSWgi4t8b0NU5q8O5FGehCyzEAiFFBHedbzdPGRlyKQHJFEBbOhGRM87KDCQgOXsqZghaJtqhuxDRMHwAgCRE7exM3Kx7Dsm2q3z5dfDHn_iZWx_GpprkreUw-rHpQvzdjO_xox8-E7_3_cj7OC9WmytxXvvDwNE_5-L1_u4lf0iK9XKV3xZJQ6DGxJclsFWGFVvC6bY0WqnaG6WNrg34ygJyjbqmkipU0kpZK6UAsJKalJyLxV-3YeZdf2za6XSHQJAhZfIXlx5JpQ
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52729.2023.00682
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9798350301298
EISSN	1063-6919
EndPage	7070
ExternalDocumentID	10205125
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
IEDL.DBID	RIE
ISICitedReferencesCount	253
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:56:33 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i204t-abb0e847e4e821729b7644fa74676f70ac801ef16f2b2c143833f444001c36243
PageCount	10
ParticipantIDs	ieee_primary_10205125
PublicationCentury	2000
PublicationDate	2023-June
PublicationDateYYYYMMDD	2023-06-01
PublicationDate_xml	– month: 06 year: 2023 text: 2023-June
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.6809673
Snippet	Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during...
SourceID	ieee
SourceType	Publisher
StartPage	7061
SubjectTerms	Adaptation models and reasoning Computational modeling language Proposals Semantic segmentation Semantics Training Training data Vision
Title	Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
URI	https://ieeexplore.ieee.org/document/10205125
WOSCitedRecordID	wos001058542607040&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07a8MwED6a0KFT-kjpGw1dlcq2ZNnQLTR0aEPoI2QLknwuocQJeRT673tS3GTq0E0YjOHs47s7f999ALdRWhijMOE2Q2pQrJI8F5nkttDGOCo5FIY9s0-6389Go3xQi9WDFgYRA_kMO_4Y_uUXM7f2ozLK8Ji-oVg1oKG13oi1tgOVhFqZNM9qeVwk8rvucPCiYqoeO94j3DO4wrq9nYlKwJBe659PP4T2To3HBlucOYI9rI6hVZePrE7O5Qnce3IIHxI4Wc8t_WavOKW4TRwdPqa1xqhifvLKns3yk5vCzKngZF3q7tvw3nt46z7y2hyBT2IhV9xYK5CgBSVm3mQqt5pKm9J4-5C01MI4wh4so7SMbey8yXmSlFJSykaOQEsmp9CsZhWeAUMnlRPS0X2pzK0wyiq_uA2TIpVliefQ9tEYzzf7L8a_gbj44_olHPiAbwhVV9BcLdZ4DfvuazVZLm7CW_sB6NaWvQ
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB60Cnqqj4pv9-B16ybZzQO8FUvFthStpbeyu5lIkaalD8F_72wa25MHb0sgBCYZvpnJ980HcO-FqdYKA25ipAbFKMkTEUtu0khrSyWHwmLPbDvqduPhMOmVYvVCC4OIBfkM6-5Y_MtPp3blRmWU4T59Q77ahT0lpe-t5VqbkUpAzUyYxKVAzhPJQ2PQe1U-1Y915xLuOFzFwr2tjUqBIs3qP59_BLWtHo_1NkhzDDuYn0C1LCBZmZ6LU3h09BA-IHgyjl36zd5wQpEbWzp8TEqVUc7c7JV19OKT61TPqORkDerva_DefOo3Wry0R-BjX8gl18YIJHBBibGzmUpMRMVNpp2BSJhFQltCH8y8MPONb53NeRBkUlLSepZgSwZnUMmnOZ4DQyuVFdLSfaFMjNDKKLe6DYM0lFmGF1Bz0RjN1hswRr-BuPzj-h0ctPqd9qj93H25gkMX_DW96hoqy_kKb2Dffi3Hi_lt8QZ_APQYmgQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Open-Vocabulary+Semantic+Segmentation+with+Mask-adapted+CLIP&rft.au=Liang%2C+Feng&rft.au=Wu%2C+Bichen&rft.au=Dai%2C+Xiaoliang&rft.au=Li%2C+Kunpeng&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7061&rft.epage=7070&rft_id=info:doi/10.1109%2FCVPR52729.2023.00682&rft.externalDocID=10205125