PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP,...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 8542 - 8552
Main Authors: Zhang, Renrui, Guo, Ziyu, Zhang, Wei, Li, Kunchang, Miao, Xupeng, Cui, Bin, Qiao, Yu, Gao, Peng, Li, Hongsheng
Format: Conference Proceeding
Language:English
Published: IEEE 01.06.2022
Subjects:
ISSN:1063-6919
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
AbstractList Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
Author Qiao, Yu
Li, Kunchang
Zhang, Wei
Gao, Peng
Guo, Ziyu
Cui, Bin
Zhang, Renrui
Miao, Xupeng
Li, Hongsheng
Author_xml – sequence: 1
  givenname: Renrui
  surname: Zhang
  fullname: Zhang, Renrui
  email: zhangrenrui@pjlab.org.cn
  organization: Shanghai AI Laboratory
– sequence: 2
  givenname: Ziyu
  surname: Guo
  fullname: Guo, Ziyu
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 3
  givenname: Wei
  surname: Zhang
  fullname: Zhang, Wei
  organization: Shanghai AI Laboratory
– sequence: 4
  givenname: Kunchang
  surname: Li
  fullname: Li, Kunchang
  organization: Shanghai AI Laboratory
– sequence: 5
  givenname: Xupeng
  surname: Miao
  fullname: Miao, Xupeng
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 6
  givenname: Bin
  surname: Cui
  fullname: Cui, Bin
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 7
  givenname: Yu
  surname: Qiao
  fullname: Qiao, Yu
  organization: Shanghai AI Laboratory
– sequence: 8
  givenname: Peng
  surname: Gao
  fullname: Gao, Peng
  email: gaopeng@pjlab.org.cn
  organization: Shanghai AI Laboratory
– sequence: 9
  givenname: Hongsheng
  surname: Li
  fullname: Li, Hongsheng
  email: hsli@ee.cuhk.edu.hk
  organization: The Chinese University of Hong Kong,CUHK-SenseTime Joint Laboratory
BookMark eNotjc1OwzAQhA0Cibb0CeCQByDprh07u9xQVGilSESIcq1s7KCg4qAkHPr2hJ_TzHwazczFWexiEOIaIUMEXpUv9ZOWhiiTIGUGQMqciDkao3PDuVGnYoZgVGoY-UIsh-EdAJRENEwzcVN3bRzLalvfJr82KQ_dl0920Yd-GG30bXxL3DH5qVyK88YehrD814XY3a-fy01aPT5sy7sqbSWoMbWQOw5Nrj0XCIrQGQ_0KjWw1dM3awkErkB0eZiooUaj9SFMITgn1UJc_e22E9t_9u2H7Y97poKYQH0DxvBClA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.00836
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 8552
ExternalDocumentID 9878980
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
IEDL.DBID RIE
ISICitedReferencesCount 303
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Jun 25 06:01:13 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
PageCount 11
ParticipantIDs ieee_primary_9878980
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6673286
Snippet Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition,...
SourceID ieee
SourceType Publisher
StartPage 8542
SubjectTerms 3D from multi-view and sensors; Transfer/low-shot/long-tail learning; Vision + language
Fuses
Image recognition
Knowledge engineering
Point cloud compression
Three-dimensional displays
Training
Visualization
Title PointCLIP: Point Cloud Understanding by CLIP
URI https://ieeexplore.ieee.org/document/9878980
WOSCitedRecordID wos000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7M4cHT1E38TQ4eV5e2SZt4LQ5Po4iT3UZfk8BAWtlawf_eJC1zBy_eQiiEl_LI95L3fR_AQ-IcbcPIBLpwlBxayAC5oQEywVVpU1MVyptNpIuFWK1kPoDpngujtfbNZ_rRDf1bvqrL1l2VzWx9LKSwBfpRmiYdV2t_nxLbSiaRomfHhVTOsvf81YmZuAauyMlyeh3mAw8Vf4TMR_9b_BQmv1w8ku9PmTMY6OocRj14JH1q7sYwzetN1WS2UH8ifkiyj7pVZHnIXiH4TdwnE1jOn9-yl6C3Qgg2EY2boKAMpTaMK5k6SfMQE0VFafGLdJYGseSRhX6YhiEybWcTYXhYKBtExDViFF_AsKorfQnE4imbpgU1ygjGkUmGWBqJLFUCLVy6grELfv3ZqV2s-7iv_56-gRO3u13z1C0Mm22r7-C4_Go2u-29_0U_abaPww
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH4MFfQ0dRN_m4PH1aVp0iZei2PiHEU22W00TQIDaWXrBP97k6zMHbx4C6EQXsoj30ve930A97FztA2JCXTuKDk4F4FkBgeScqYKm5oqV95sIhmP-Wwmshb0tlwYrbVvPtMPbujf8lVVrN1VWd_Wx1xwW6DvM0oJ3rC1tjcqka1lYsEbflyIRT99z96cnIlr4SJOmNMrMe-4qPhDZND-3_LH0P1l46Fse86cQEuXp9Bu4CNqknPVgV5WLco6taX6I_JDlH5Ua4Wmu_wVJL-R-6QL08HTJB0GjRlCsCA4qoMcUym0oUyJxImahzJWmBcWwQhnahAJRiz4k0kYSqrtbMwNC3NlgyBMS0miM9grq1KfA7KIyiZqjo0ynDJJBZWyMELSRHFpAdMFdFzw88-N3sW8ifvy7-k7OBxOXkfz0fP45QqO3E5vWqmuYa9ervUNHBRf9WK1vPW_6wcZ35MK
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=PointCLIP%3A+Point+Cloud+Understanding+by+CLIP&rft.au=Zhang%2C+Renrui&rft.au=Guo%2C+Ziyu&rft.au=Zhang%2C+Wei&rft.au=Li%2C+Kunchang&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=8542&rft.epage=8552&rft_id=info:doi/10.1109%2FCVPR52688.2022.00836&rft.externalDocID=9878980