Visual Programming for Zero-Shot Open-Vocabulary 3D Visual Grounding

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zer...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 20623 - 20633
Main Authors:	Yuan, Zhihao, Ren, Jinke, Feng, Chun-Mei, Zhao, Hengshuang, Cui, Shuguang, Li, Zhen
Format:	Conference Proceeding
Language:	English
Published:	IEEE 16.06.2024
Subjects:	Annotations Grounding Large language models Navigation Point Cloud Three-dimensional displays Vision and Language Visual Grounding Visualization Vocabulary
ISSN:	1063-6919
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/Z5VG3D.
ISSN:	1063-6919
DOI:	10.1109/CVPR52733.2024.01949