PonderV2: Improved 3D Representation With a Universal Pre-Training Paradigm

In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework des...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on pattern analysis and machine intelligence Ročník 47; číslo 8; s. 6550 - 6565
Hlavní autori:	Zhu, Haoyi, Yang, Honghui, Wu, Xiaoyang, Huang, Di, Zhang, Sha, He, Xianglong, Zhao, Hengshuang, Shen, Chunhua, Qiao, Yu, He, Tong, Ouyang, Wanli
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	United States IEEE 01.08.2025
Predmet:	3D pre-training 3D vision Autoencoders Benchmark testing foundation model Geometry Image reconstruction LiDAR multi-view image Neural radiance field neural rendering point cloud Point cloud compression Rendering (computer graphics) RGB-D image Solid modeling Three-dimensional displays Training
ISSN:	0162-8828, 1939-3539, 2160-9292, 1939-3539
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In contrast to numerous NLP and 2D vision foundational models, training a 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and diversity of downstream tasks. In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations. Considering that informative 3D features should encode rich geometry and appearance cues that can be utilized to render realistic images, we propose to learn 3D representations by differentiable neural rendering. We train a 3D backbone with a volumetric neural renderer by comparing the rendered with the real images. Notably, our pre-trained encoder can be seamlessly applied to various downstream tasks. These tasks include semantic challenges like 3D detection and segmentation, which involve scene understanding, and non-semantic tasks like 3D reconstruction and image synthesis, which focus on geometry and visuals. They span both indoor and outdoor scenarios. We also illustrate the capability of pre-training a 2D backbone using the proposed methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2025.3561598