Self-supervised temporal autoencoder for egocentric action segmentation
Given an egocentric video, action temporal segmentation aims to temporally segment the video into basic units, each depicting an action. As the camera is constantly moving, some important objects may disappear in some consecutive frames and cause an abrupt change in the visual content. Recently work...
Uložené v:
| Vydané v: | Engineering applications of artificial intelligence Ročník 126; s. 107092 |
|---|---|
| Hlavní autori: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Elsevier Ltd
01.11.2023
|
| Predmet: | |
| ISSN: | 0952-1976, 1873-6769 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Given an egocentric video, action temporal segmentation aims to temporally segment the video into basic units, each depicting an action. As the camera is constantly moving, some important objects may disappear in some consecutive frames and cause an abrupt change in the visual content. Recently works fail to deal with this condition in the absence of manually annotating abundant frames. In this study, we propose a temporal-aware clustering method for egocentric action temporal segmentation: a self-supervised temporal autoencoder (SSTAE). Instead of directly learning visual features, the SSTAE is implemented by encoding the preceding target frame and predicting subsequent frames in the temporal relationship domain, which takes into account the local temporal consistency. Our proposed algorithm is guided by the reconstruction and predicted losses. Consequently, local temporal contexts are naturally integrated into the feature representation, and a clustering step is performed. Experiments on three egocentric datasets demonstrate the our proposed approach outperforms the state-of-the-art methods by clustering Accuracy(ACC) 7.57%, Normalized Mutual Information(NMI) 8.17%, Adjusted Rand Index(ARI) 8.6%.
•This study focuses on unsupervised temporal segmentation problem in egocentric vision video.•A self-supervised learning model that leverages the temporal relationship domain is introduced into unsupervised egocentric temporal segmentation.•A novel similarity metric between frames effectively improves feature representational ability.•Experimental results on three egocentric datasets show the superiority of the proposed method. |
|---|---|
| ISSN: | 0952-1976 1873-6769 |
| DOI: | 10.1016/j.engappai.2023.107092 |