PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) S. 8542 - 8552
Hauptverfasser:	Zhang, Renrui, Guo, Ziyu, Zhang, Wei, Li, Kunchang, Miao, Xupeng, Cui, Bin, Qiao, Yu, Gao, Peng, Li, Hongsheng
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.06.2022
Schlagworte:	3D from multi-view and sensors; Transfer/low-shot/long-tail learning; Vision + language Fuses Image recognition Knowledge engineering Point cloud compression Three-dimensional displays Training Visualization
ISSN:	1063-6919
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Abstract	Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
AbstractList	Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
Author	Qiao, Yu Li, Kunchang Zhang, Wei Gao, Peng Guo, Ziyu Cui, Bin Zhang, Renrui Miao, Xupeng Li, Hongsheng
Author_xml	– sequence: 1 givenname: Renrui surname: Zhang fullname: Zhang, Renrui email: zhangrenrui@pjlab.org.cn organization: Shanghai AI Laboratory – sequence: 2 givenname: Ziyu surname: Guo fullname: Guo, Ziyu organization: Peking University,School of CS and Key Lab of HCST – sequence: 3 givenname: Wei surname: Zhang fullname: Zhang, Wei organization: Shanghai AI Laboratory – sequence: 4 givenname: Kunchang surname: Li fullname: Li, Kunchang organization: Shanghai AI Laboratory – sequence: 5 givenname: Xupeng surname: Miao fullname: Miao, Xupeng organization: Peking University,School of CS and Key Lab of HCST – sequence: 6 givenname: Bin surname: Cui fullname: Cui, Bin organization: Peking University,School of CS and Key Lab of HCST – sequence: 7 givenname: Yu surname: Qiao fullname: Qiao, Yu organization: Shanghai AI Laboratory – sequence: 8 givenname: Peng surname: Gao fullname: Gao, Peng email: gaopeng@pjlab.org.cn organization: Shanghai AI Laboratory – sequence: 9 givenname: Hongsheng surname: Li fullname: Li, Hongsheng email: hsli@ee.cuhk.edu.hk organization: The Chinese University of Hong Kong,CUHK-SenseTime Joint Laboratory
BookMark	eNotjc1OwzAQhA0Cibb0CeCQByDprh07u9xQVGilSESIcq1s7KCg4qAkHPr2hJ_TzHwazczFWexiEOIaIUMEXpUv9ZOWhiiTIGUGQMqciDkao3PDuVGnYoZgVGoY-UIsh-EdAJRENEwzcVN3bRzLalvfJr82KQ_dl0920Yd-GG30bXxL3DH5qVyK88YehrD814XY3a-fy01aPT5sy7sqbSWoMbWQOw5Nrj0XCIrQGQ_0KjWw1dM3awkErkB0eZiooUaj9SFMITgn1UJc_e22E9t_9u2H7Y97poKYQH0DxvBClA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52688.2022.00836
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore Digital Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	1665469463 9781665469463
EISSN	1063-6919
EndPage	8552
ExternalDocumentID	9878980
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
IEDL.DBID	RIE
ISICitedReferencesCount	303
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Jun 25 06:01:13 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
PageCount	11
ParticipantIDs	ieee_primary_9878980
PublicationCentury	2000
PublicationDate	2022-June
PublicationDateYYYYMMDD	2022-06-01
PublicationDate_xml	– month: 06 year: 2022 text: 2022-June
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.6673737
Snippet	Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition,...
SourceID	ieee
SourceType	Publisher
StartPage	8542
SubjectTerms	3D from multi-view and sensors; Transfer/low-shot/long-tail learning; Vision + language Fuses Image recognition Knowledge engineering Point cloud compression Three-dimensional displays Training Visualization
Title	PointCLIP: Point Cloud Understanding by CLIP
URI	https://ieeexplore.ieee.org/document/9878980
WOSCitedRecordID	wos000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7M4cHT1E38TQ4eF9emaZLntTg8yCjixm5jaVIYSCtbJ_jfm2Rl7uDFWwiB8BJCvi953_cAHpSQhpmCUQc2UspjVVLEklHkbrwQSpbBrHr2KicTNZ9j3oHhXgtjrQ3JZ_bRN8NfvqmLrX8qGzl-rFA5gn4kpdhptfbvKYljMgJVq46LIxxls_zNm5n4BC7mbTmDD_NBDZVwhYx7_5v8FAa_WjyS72-ZM-jY6hx6LXgk7dHc9GGY16uqyRxRfyKhSbKPemvI9FC9QvQ38UMGMB0_v2cvtC2FQFcsShq6jLhGW_LUoPSW5rEWJlKFwy_oSxokmDIH_bSMY82t6xWqTOOlcUGw1GrNkgvoVnVlL4EYB4kSLU0prP9Ts5g6hsbRysIi8lJfQd8Hv_jcuV0s2riv_-6-gRO_urvkqVvoNuutvYPj4qtZbdb3YYt-AJEcjzc
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH6MKehp6ib-NgePi2vTNsnzWhwT5yiyjd3G0qQwkFa2TvC_N-nK3MGLtxAC4SWEfO_H9z2AB8mFZjpl1IKNiIa-zChixiiGdj3nUmSVWPV0KEYjOZth0oDujgtjjKmKz8yjG1a5fF2kGxcq61n_WKK0DvqB65xVs7V2EZXA-jIcZc2P8z3sxdPk3cmZuBIu5oQ5KyXmvS4q1SfSb_1v-xPo_LLxSLL7Z06hYfIzaNXwkdSPc92GblIs8zK2rvoTqYYk_ig2mkz2-StEfRO3pAOT_vM4HtC6GQJdMi8o6cILFZosjDQKJ2ruK649mVoEg66pQYARs-BPCd9XobGzXGaRv9DWCBYZpVhwDs28yM0FEG1BUaCEzrhxWTWD9jRliEakBjHM1CW0nfHzz63exby2--rv6Xs4GozfhvPhy-j1Go7dSW9LqW6gWa425hYO069yuV7dVdf1Az1KkoA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=PointCLIP%3A+Point+Cloud+Understanding+by+CLIP&rft.au=Zhang%2C+Renrui&rft.au=Guo%2C+Ziyu&rft.au=Zhang%2C+Wei&rft.au=Li%2C+Kunchang&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=8542&rft.epage=8552&rft_id=info:doi/10.1109%2FCVPR52688.2022.00836&rft.externalDocID=9878980