View-Invariant Skeleton Action Representation Learning via Motion Retargeting

Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos , such methods perform poorly due to the large varia...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International journal of computer vision Ročník 132; číslo 7; s. 2351 - 2366
Hlavní autoři: Yang, Di, Wang, Yaohui, Dantcheva, Antitza, Garattoni, Lorenzo, Francesca, Gianpiero, Brémond, François
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York Springer US 01.07.2024
Springer
Springer Nature B.V
Springer Verlag
Témata:
ISSN:0920-5691, 1573-1405
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos , such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific ‘Motion’ features on top of the visual representation of a 2D or 3D skeleton sequence. Such ‘Motion’ features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data ( e.g. , Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g. , Toyota Smarthome, UAV-Human and Penn Action. Code and models will be publicly available at https://walker-a11y.github.io/ViA-project .
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0920-5691
1573-1405
DOI:10.1007/s11263-023-01967-8