CGFusionFormer: Exploring Compact Spatial Representation for Robust 3D Human Pose Estimation with Low Computation Complexity
Transformer-based 2D-to-3D lifting methods have demonstrated outstanding performance in 3D human pose estimation from 2D pose sequences. However, they still encounter challenges with the relatively poor quality of 2D joints and substantial computational costs. In this paper, we propose a CGFusionFor...
Gespeichert in:
| Veröffentlicht in: | Sensors (Basel, Switzerland) Jg. 25; H. 19; S. 6052 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Switzerland
MDPI AG
01.10.2025
|
| Schlagworte: | |
| ISSN: | 1424-8220, 1424-8220 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Transformer-based 2D-to-3D lifting methods have demonstrated outstanding performance in 3D human pose estimation from 2D pose sequences. However, they still encounter challenges with the relatively poor quality of 2D joints and substantial computational costs. In this paper, we propose a CGFusionFormer to address these problems. We propose a compact spatial representation (CSR) to robustly generate local spatial multihypothesis features from part of the 2D pose sequence. Specifically, CSR models spatial constraints based on body parts and incorporates 2D Gaussian filters and nonparametric reduction to improve spatial features against low-quality 2D poses and reduce the computational cost of subsequent temporal encoding. We design a residual-based Hybrid Adaptive Fusion module that combines multihypothesis features with global frequency domain features to accurately estimate the 3D human pose with minimal computational cost. We realize CGFusionFormer with a PoseFormer-like transformer backbone. Extensive experiments on the challenging Human3.6M and MPI-INF-3DHP benchmarks show that our method outperforms prior transformer-based variants in short receptive fields and achieves a superior accuracy–efficiency trade-off. On Human3.6M (sequence length 27, 3 input frames), it achieves 47.6 mm Mean Per Joint Position Error (MPJPE) at only 71.3 MFLOPs, representing about a 40 percent reduction in computation compared with PoseFormerV2 while attaining better accuracy. On MPI-INF-3DHP (81-frame sequences), it reaches 97.9 Percentage of Correct Keypoints (PCK), 78.5 Area Under the Curve (AUC), and 27.2 mm MPJPE, matching the best PCK and achieving the lowest MPJPE among the compared methods under the same setting. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1424-8220 1424-8220 |
| DOI: | 10.3390/s25196052 |