Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or nee...
Saved in:
| Published in: | IEEE access Vol. 12; pp. 132084 - 132095 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Piscataway
IEEE
2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 2169-3536, 2169-3536 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2169-3536 2169-3536 |
| DOI: | 10.1109/ACCESS.2024.3457024 |