Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields

In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on pattern analysis and machine intelligence Ročník 47; číslo 5; s. 3922 - 3934
Hlavní autori:	Miao, Xingyu, Duan, Haoran, Bai, Yang, Shah, Tejal, Song, Jun, Long, Yang, Ranjan, Rajiv, Shao, Ling
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	United States IEEE 01.05.2025
Predmet:	3D segmentation Accuracy CLIP Feature extraction Image segmentation NeRF Neural radiance field Rendering (computer graphics) Semantics Solid modeling Three-dimensional displays Training Visualization
ISSN:	0162-8828, 1939-3539, 2160-9292, 1939-3539
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2025.3535916