Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition

Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Sensors (Basel, Switzerland) Jg. 25; H. 10; S. 3146
Hauptverfasser: Bian, Cunling, Yang, Yang, Wang, Tao, Lu, Weigang
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Switzerland MDPI AG 16.05.2025
MDPI
Schlagworte:
ISSN:1424-8220, 1424-8220
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the Spatial–Temporal Heatmap Masked Autoencoder (STH-MAE), a novel self-supervised framework tailored for skeleton-based action recognition. Unlike coordinate-based methods, STH-MAE adopts heatmap volumes as its primary representation, mitigating noise inherent in pose estimation while capitalizing on advances in Vision Transformers. The framework constructs a spatial–temporal heatmap (STH) by aggregating 2D joint heatmaps across both spatial and temporal axes. This STH is partitioned into non-overlapping patches to facilitate local feature learning, with a masking strategy applied to randomly conceal portions of the input. During pre-training, a Vision Transformer-based autoencoder equipped with a lightweight prediction head reconstructs the masked regions, fostering the extraction of robust and transferable skeletal representations. Comprehensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 benchmarks demonstrate the superiority of STH-MAE, achieving state-of-the-art performance under multiple evaluation protocols.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1424-8220
1424-8220
DOI:10.3390/s25103146