CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity

Transformer-based 2D-to-3D lifting methods have demonstrated outstanding performance in 3D human pose estimation from 2D pose sequences. However, they still encounter challenges with the relatively poor quality of 2D joints and substantial computational costs. In this paper, we propose a CGFusionFor...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Sensors (Basel, Switzerland) Jg. 25; H. 19; S. 6052
Hauptverfasser: Lu, Tao, Wang, Hongtao, Xiao, Degui
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Switzerland MDPI AG 01.10.2025
Schlagworte:
ISSN:1424-8220, 1424-8220
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Transformer-based 2D-to-3D lifting methods have demonstrated outstanding performance in 3D human pose estimation from 2D pose sequences. However, they still encounter challenges with the relatively poor quality of 2D joints and substantial computational costs. In this paper, we propose a CGFusionFormer to address these problems. We propose a compact spatial representation (CSR) to robustly generate local spatial multihypothesis features from part of the 2D pose sequence. Specifically, CSR models spatial constraints based on body parts and incorporates 2D Gaussian filters and nonparametric reduction to improve spatial features against low-quality 2D poses and reduce the computational cost of subsequent temporal encoding. We design a residual-based Hybrid Adaptive Fusion module that combines multihypothesis features with global frequency domain features to accurately estimate the 3D human pose with minimal computational cost. We realize CGFusionFormer with a PoseFormer-like transformer backbone. Extensive experiments on the challenging Human3.6M and MPI-INF-3DHP benchmarks show that our method outperforms prior transformer-based variants in short receptive fields and achieves a superior accuracy–efficiency trade-off. On Human3.6M (sequence length 27, 3 input frames), it achieves 47.6 mm Mean Per Joint Position Error (MPJPE) at only 71.3 MFLOPs, representing about a 40 percent reduction in computation compared with PoseFormerV2 while attaining better accuracy. On MPI-INF-3DHP (81-frame sequences), it reaches 97.9 Percentage of Correct Keypoints (PCK), 78.5 Area Under the Curve (AUC), and 27.2 mm MPJPE, matching the best PCK and achieving the lowest MPJPE among the compared methods under the same setting.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1424-8220
1424-8220
DOI:10.3390/s25196052