FaVoR: Features via Voxel Rendering for Camera Relocalization

Saved in:
Bibliographic Details
Title: FaVoR: Features via Voxel Rendering for Camera Relocalization
Authors: Polizzi, Vincenzo, Cannici, Marco, Scaramuzza, Davide, Kelly, Jonathan
Contributors: University of Zurich
Source: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). :44-53
Publication Status: Preprint
Publisher Information: IEEE, 2025.
Publication Year: 2025
Subject Terms: FOS: Computer and information sciences, 1709 Human-Computer Interaction, Computer Science - Robotics, 1707 Computer Vision and Pattern Recognition, 10009 Department of Informatics, Computer Vision and Pattern Recognition (cs.CV), 1706 Computer Science Applications, Computer Science - Computer Vision and Pattern Recognition, 2741 Radiology, Nuclear Medicine and Imaging, 1702 Artificial Intelligence, 000 Computer science, knowledge & systems, Robotics (cs.RO), 2611 Modeling and Simulation
Description: Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, Arizona, US, Feb 28-Mar 4, 2025
Document Type: Article
Other literature type
Conference object
File Description: WACV25_Polizzi.pdf - application/pdf
DOI: 10.1109/wacv61041.2025.00015
DOI: 10.48550/arxiv.2409.07571
DOI: 10.5167/uzh-278765
Access URL: http://arxiv.org/abs/2409.07571
https://www.zora.uzh.ch/id/eprint/278765/
https://doi.org/10.5167/uzh-278765
Rights: STM Policy #29
arXiv Non-Exclusive Distribution
CC BY
Accession Number: edsair.doi.dedup.....24f1190344930165762bcfda099a7506
Database: OpenAIRE
Description
Abstract:Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.<br />In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, Arizona, US, Feb 28-Mar 4, 2025
DOI:10.1109/wacv61041.2025.00015