DeHi: A Decoupled Hierarchical Architecture for Unaligned Ground-to-Aerial Geo-Localization

Ground-to-aerial (G2A) geo-localization remains extremely challenging due to the drastic appearance and geometry differences between ground and aerial views, especially when their relative orientation is unknown. In this paper, we focus on the challenging problem of unaligned G2A geo-localization, w...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology Vol. 34; no. 3; pp. 1927 - 1940
Main Authors: Wang, Teng, Li, Jiawen, Sun, Changyin
Format: Journal Article
Language:English
Published: New York IEEE 01.03.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1051-8215, 1558-2205
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Ground-to-aerial (G2A) geo-localization remains extremely challenging due to the drastic appearance and geometry differences between ground and aerial views, especially when their relative orientation is unknown. In this paper, we focus on the challenging problem of unaligned G2A geo-localization, where the query ground-level image is not perfectly orientation-aligned with respect to reference aerial imagery. We cast this problem as a metric embedding task and propose a decoupled hierarchical (DeHi) architecture to progressively learn meaningful multi-grained features. Specifically, DeHi first leverages CNN to extract high-level semantic features, and then introduces a novel orthogonally factorized transformer model consisting of part-level and global transformer encoders to learn part-level and global feature descriptors sequentially. For the purpose of enhancing representation power, cross-level connections are introduced to enrich part-level and global descriptors by CNN features, and the pooled part-level descriptor is combined with the global descriptor to construct the final query representation. Furthermore, such a decoupled hierarchical architecture allows for incorporating multi-level deep supervision. We introduce two part-level losses combined with one cross-level loss to complement the widely used global retrieval loss. Extensive experiments on standard benchmark datasets show significant boosting in recall rates compared with the previous state-of-the-art. Remarkably, DeHi improves the recall rate @top-1 from 78.59% to 82.38% (+3.79%) and from 72.91% to 77.94% (+5.03%) on CVUSA and CVACT datasets, respectively, under random orientation misalignments. Besides, DeHi maintains competitive inference efficiency with less parameters compared to existing transformer-based methods.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2023.3293514