PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) S. 8542 - 8552
Hauptverfasser: Zhang, Renrui, Guo, Ziyu, Zhang, Wei, Li, Kunchang, Miao, Xupeng, Cui, Bin, Qiao, Yu, Gao, Peng, Li, Hongsheng
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.06.2022
Schlagworte:
ISSN:1063-6919
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
AbstractList Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in open-vocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIP-encoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on Model-NetlO, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https://github.com/ZrrSkywalker/PointCLIP.
Author Qiao, Yu
Li, Kunchang
Zhang, Wei
Gao, Peng
Guo, Ziyu
Cui, Bin
Zhang, Renrui
Miao, Xupeng
Li, Hongsheng
Author_xml – sequence: 1
  givenname: Renrui
  surname: Zhang
  fullname: Zhang, Renrui
  email: zhangrenrui@pjlab.org.cn
  organization: Shanghai AI Laboratory
– sequence: 2
  givenname: Ziyu
  surname: Guo
  fullname: Guo, Ziyu
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 3
  givenname: Wei
  surname: Zhang
  fullname: Zhang, Wei
  organization: Shanghai AI Laboratory
– sequence: 4
  givenname: Kunchang
  surname: Li
  fullname: Li, Kunchang
  organization: Shanghai AI Laboratory
– sequence: 5
  givenname: Xupeng
  surname: Miao
  fullname: Miao, Xupeng
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 6
  givenname: Bin
  surname: Cui
  fullname: Cui, Bin
  organization: Peking University,School of CS and Key Lab of HCST
– sequence: 7
  givenname: Yu
  surname: Qiao
  fullname: Qiao, Yu
  organization: Shanghai AI Laboratory
– sequence: 8
  givenname: Peng
  surname: Gao
  fullname: Gao, Peng
  email: gaopeng@pjlab.org.cn
  organization: Shanghai AI Laboratory
– sequence: 9
  givenname: Hongsheng
  surname: Li
  fullname: Li, Hongsheng
  email: hsli@ee.cuhk.edu.hk
  organization: The Chinese University of Hong Kong,CUHK-SenseTime Joint Laboratory
BookMark eNotjc1OwzAQhA0Cibb0CeCQByDprh07u9xQVGilSESIcq1s7KCg4qAkHPr2hJ_TzHwazczFWexiEOIaIUMEXpUv9ZOWhiiTIGUGQMqciDkao3PDuVGnYoZgVGoY-UIsh-EdAJRENEwzcVN3bRzLalvfJr82KQ_dl0920Yd-GG30bXxL3DH5qVyK88YehrD814XY3a-fy01aPT5sy7sqbSWoMbWQOw5Nrj0XCIrQGQ_0KjWw1dM3awkErkB0eZiooUaj9SFMITgn1UJc_e22E9t_9u2H7Y97poKYQH0DxvBClA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.00836
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore Digital Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 8552
ExternalDocumentID 9878980
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
IEDL.DBID RIE
ISICitedReferencesCount 303
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Jun 25 06:01:13 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-a04b9ef45d9710381b6d08c2509a5003952080b711b4e25068f51adeee25ebb23
PageCount 11
ParticipantIDs ieee_primary_9878980
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.6673737
Snippet Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition,...
SourceID ieee
SourceType Publisher
StartPage 8542
SubjectTerms 3D from multi-view and sensors; Transfer/low-shot/long-tail learning; Vision + language
Fuses
Image recognition
Knowledge engineering
Point cloud compression
Three-dimensional displays
Training
Visualization
Title PointCLIP: Point Cloud Understanding by CLIP
URI https://ieeexplore.ieee.org/document/9878980
WOSCitedRecordID wos000870759101058&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH7M4cHT1E38TQ4eF9emaZLntTg8yCjixm5jaVIYSCtbJ_jfm2Rl7uDFWwiB8BJCvi953_cAHpSQhpmCUQc2UspjVVLEklHkbrwQSpbBrHr2KicTNZ9j3oHhXgtjrQ3JZ_bRN8NfvqmLrX8qGzl-rFA5gn4kpdhptfbvKYljMgJVq46LIxxls_zNm5n4BC7mbTmDD_NBDZVwhYx7_5v8FAa_WjyS72-ZM-jY6hx6LXgk7dHc9GGY16uqyRxRfyKhSbKPemvI9FC9QvQ38UMGMB0_v2cvtC2FQFcsShq6jLhGW_LUoPSW5rEWJlKFwy_oSxokmDIH_bSMY82t6xWqTOOlcUGw1GrNkgvoVnVlL4EYB4kSLU0prP9Ts5g6hsbRysIi8lJfQd8Hv_jcuV0s2riv_-6-gRO_urvkqVvoNuutvYPj4qtZbdb3YYt-AJEcjzc
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFH6MKehp6ib-NgePi2vTNsnzWhwT5yiyjd3G0qQwkFa2TvC_N-nK3MGLtxAC4SWEfO_H9z2AB8mFZjpl1IKNiIa-zChixiiGdj3nUmSVWPV0KEYjOZth0oDujgtjjKmKz8yjG1a5fF2kGxcq61n_WKK0DvqB65xVs7V2EZXA-jIcZc2P8z3sxdPk3cmZuBIu5oQ5KyXmvS4q1SfSb_1v-xPo_LLxSLL7Z06hYfIzaNXwkdSPc92GblIs8zK2rvoTqYYk_ig2mkz2-StEfRO3pAOT_vM4HtC6GQJdMi8o6cILFZosjDQKJ2ruK649mVoEg66pQYARs-BPCd9XobGzXGaRv9DWCBYZpVhwDs28yM0FEG1BUaCEzrhxWTWD9jRliEakBjHM1CW0nfHzz63exby2--rv6Xs4GozfhvPhy-j1Go7dSW9LqW6gWa425hYO069yuV7dVdf1Az1KkoA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=PointCLIP%3A+Point+Cloud+Understanding+by+CLIP&rft.au=Zhang%2C+Renrui&rft.au=Guo%2C+Ziyu&rft.au=Zhang%2C+Wei&rft.au=Li%2C+Kunchang&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=8542&rft.epage=8552&rft_id=info:doi/10.1109%2FCVPR52688.2022.00836&rft.externalDocID=9878980