Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition

Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Sensors (Basel, Switzerland) Jg. 25; H. 10; S. 3146
Hauptverfasser:	Bian, Cunling, Yang, Yang, Wang, Tao, Lu, Weigang
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	Switzerland MDPI AG 16.05.2025 MDPI
Schlagworte:	Artificial intelligence Comparative analysis Computer vision Human acts Human behavior Identification and classification Machine learning Machine vision masked autoencoder Methods Neural networks self-supervised learning Semantics skeleton-based action recognition spatial–temporal heatmap visual transformer China masked autoencoder self-supervised learning skeleton-based action recognition spatial–temporal heatmap visual transformer
ISSN:	1424-8220, 1424-8220
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the Spatial–Temporal Heatmap Masked Autoencoder (STH-MAE), a novel self-supervised framework tailored for skeleton-based action recognition. Unlike coordinate-based methods, STH-MAE adopts heatmap volumes as its primary representation, mitigating noise inherent in pose estimation while capitalizing on advances in Vision Transformers. The framework constructs a spatial–temporal heatmap (STH) by aggregating 2D joint heatmaps across both spatial and temporal axes. This STH is partitioned into non-overlapping patches to facilitate local feature learning, with a masking strategy applied to randomly conceal portions of the input. During pre-training, a Vision Transformer-based autoencoder equipped with a lightweight prediction head reconstructs the masked regions, fostering the extraction of robust and transferable skeletal representations. Comprehensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 benchmarks demonstrate the superiority of STH-MAE, achieving state-of-the-art performance under multiple evaluation protocols.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s25103146