TCLR: Temporal contrastive learning for video representation

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Computer vision and image understanding Ročník 219; s. 103406
Hlavní autoři: Dave, Ishan, Gupta, Rohit, Rizve, Mamshad Nayeem, Shah, Mubarak
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Inc 01.06.2022
Témata:
ISSN:1077-3142, 1090-235X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local–local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global–local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture with UCF101 pretraining, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval. Code released at https://github.com/DAVEISHAN/TCLR. •TCLR is a contrastive learning framework for video understanding tasks.•Explicitly enforces within instance temporal feature variation without pretext tasks.•Proposes novel local–local and global–local temporal contrastive losses.•Significantly outperforms state-of-art pre-training on video understanding tasks.•Uses fine-grained action classification task for evaluating learned representations.
AbstractList Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local–local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global–local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture with UCF101 pretraining, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval. Code released at https://github.com/DAVEISHAN/TCLR. •TCLR is a contrastive learning framework for video understanding tasks.•Explicitly enforces within instance temporal feature variation without pretext tasks.•Proposes novel local–local and global–local temporal contrastive losses.•Significantly outperforms state-of-art pre-training on video understanding tasks.•Uses fine-grained action classification task for evaluating learned representations.
ArticleNumber 103406
Author Shah, Mubarak
Rizve, Mamshad Nayeem
Dave, Ishan
Gupta, Rohit
Author_xml – sequence: 1
  givenname: Ishan
  orcidid: 0000-0001-9920-6970
  surname: Dave
  fullname: Dave, Ishan
  email: ishandave@knights.ucf.edu
– sequence: 2
  givenname: Rohit
  surname: Gupta
  fullname: Gupta, Rohit
– sequence: 3
  givenname: Mamshad Nayeem
  orcidid: 0000-0001-5378-1697
  surname: Rizve
  fullname: Rizve, Mamshad Nayeem
– sequence: 4
  givenname: Mubarak
  orcidid: 0000-0001-6172-5572
  surname: Shah
  fullname: Shah, Mubarak
BookMark eNp9kMtqwzAQRUVJoUnaH-jKP-B0ZCmWXbIpoS8IFEoK3QlZGhUFRwqSa-jf18ZddZHVDAPncucsyMwHj4TcUlhRoOXdYaV7970qoCiGA-NQXpA5hRrygq0_Z-MuRM4oL67IIqUDAKW8pnOy2W937_fZHo-nEFWb6eC7qFLnesxaVNE7_5XZELPeGQxZxFPEhL5TnQv-mlxa1Sa8-ZtL8vH0uN--5Lu359ftwy7XjPMuN40RuhJNxTUDY0wj1oKZWlAtWNOoxjIrAEBoq1RdloBVvUbLKat0XSC3bEmqKVfHkFJEK7WbGgxdXSspyNGCPMjRghwtyMnCgBb_0FN0RxV_zkObCcLhqd5hlEk79BqNi6g7aYI7h_8Cu2N5Ug
CitedBy_id crossref_primary_10_1007_s11263_022_01713_6
crossref_primary_10_1016_j_engappai_2023_106203
crossref_primary_10_1109_TPAMI_2023_3273415
crossref_primary_10_1016_j_patrec_2025_03_015
crossref_primary_10_1016_j_jhydrol_2024_130962
crossref_primary_10_1109_ACCESS_2025_3545768
crossref_primary_10_3390_s23031707
crossref_primary_10_1016_j_imavis_2024_105159
crossref_primary_10_1016_j_neunet_2023_06_010
crossref_primary_10_1145_3696445
crossref_primary_10_1016_j_imavis_2023_104765
crossref_primary_10_1109_TMM_2022_3193559
crossref_primary_10_1145_3577925
crossref_primary_10_1016_j_neucom_2025_129694
crossref_primary_10_3390_app122412863
crossref_primary_10_1007_s00138_023_01444_9
crossref_primary_10_1038_s41524_023_00966_0
crossref_primary_10_1109_JBHI_2024_3511601
crossref_primary_10_1007_s10489_024_05460_8
crossref_primary_10_1016_j_ins_2023_119042
crossref_primary_10_1016_j_media_2024_103385
crossref_primary_10_1007_s11548_024_03101_6
crossref_primary_10_1109_TIFS_2025_3531772
crossref_primary_10_1109_TCBB_2024_3451051
crossref_primary_10_1109_TIP_2024_3431451
crossref_primary_10_4103_1673_5374_393103
crossref_primary_10_1016_j_bspc_2025_107491
crossref_primary_10_1109_TPAMI_2023_3312419
crossref_primary_10_1371_journal_pone_0322555
crossref_primary_10_1016_j_patcog_2024_110804
crossref_primary_10_1007_s00530_024_01612_5
crossref_primary_10_1109_TPAMI_2023_3243812
crossref_primary_10_1145_3651311
crossref_primary_10_1007_s11042_024_18126_x
crossref_primary_10_1007_s11760_023_02605_z
crossref_primary_10_1007_s13042_023_01904_8
crossref_primary_10_1007_s42486_025_00194_z
crossref_primary_10_3389_fgene_2022_937042
crossref_primary_10_1007_s11042_025_21099_0
crossref_primary_10_1002_widm_70043
crossref_primary_10_1007_s10489_024_05661_1
crossref_primary_10_1016_j_image_2025_117381
crossref_primary_10_1109_LRA_2025_3583626
crossref_primary_10_2196_45547
crossref_primary_10_1016_j_ins_2025_122556
Cites_doi 10.1007/978-3-030-58523-5_13
10.1109/CVPR.2019.00413
10.1109/CVPR.2018.00840
10.1609/aaai.v35i12.17274
10.1109/CVPR42600.2020.00958
10.1007/978-3-030-58604-1_26
10.1007/978-3-030-58520-4_30
10.1109/WACV48630.2021.00171
10.1109/ICCV48922.2021.00982
10.1109/ICCV.2019.00630
10.1109/CVPR.2017.607
10.1109/CVPR42600.2020.00990
10.1109/ICCV48922.2021.00789
10.1109/ICCV.2017.79
10.1007/978-3-030-01267-0_19
10.1145/3394171.3413694
10.1109/ICCV.2015.510
10.1609/aaai.v35i11.17215
10.1109/CVPR.2017.502
10.1609/aaai.v35i2.16189
10.1109/CVPR42600.2020.00994
10.1109/WACV45572.2020.9093278
10.1109/ICPR.2018.8546325
10.1109/ACCESS.2021.3084840
10.1109/CVPR46437.2021.00689
10.1109/CVPR42600.2020.00658
10.1109/ICCV48922.2021.01026
10.1109/CVPR42600.2020.00975
10.1109/ICCV.2011.6126543
10.1109/CVPR46437.2021.01105
10.1109/CVPR46437.2021.00331
10.1109/ICCVW.2019.00186
10.1609/aaai.v34i07.6840
10.1109/CVPR.2018.00675
10.1007/978-3-030-01231-1_32
10.1609/aaai.v33i01.33018545
10.1016/j.image.2020.115967
10.1007/978-3-030-11012-3_45
10.1109/CVPR.2019.01058
ContentType Journal Article
Copyright 2022 The Author(s)
Copyright_xml – notice: 2022 The Author(s)
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.cviu.2022.103406
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
Computer Science
EISSN 1090-235X
ExternalDocumentID 10_1016_j_cviu_2022_103406
S1077314222000376
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
6I.
6TJ
7-5
71M
8P~
AABNK
AACTN
AAEDT
AAEDW
AAFTH
AAIAV
AAIKC
AAIKJ
AAKOC
AALRI
AAMNW
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
EBS
EFBJH
EFLBG
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HF~
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG5
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
RNS
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SSV
SSZ
T5K
TN5
XPP
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
CITATION
EFKBS
SST
~HD
ID FETCH-LOGICAL-c344t-dbd7c87b84c30dddb7573d971c73bbabf3f70007cfaa9660e895ef4138c92e4f3
ISICitedReferencesCount 99
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000793292400002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1077-3142
IngestDate Tue Nov 18 22:35:32 EST 2025
Sat Nov 29 07:09:40 EST 2025
Fri Feb 23 02:40:38 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords 68T45
68T07
Action Recognition
68T30
Self-Supervised Learning
Video Representation
Language English
License This is an open access article under the CC BY-NC-ND license.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c344t-dbd7c87b84c30dddb7573d971c73bbabf3f70007cfaa9660e895ef4138c92e4f3
ORCID 0000-0001-5378-1697
0000-0001-9920-6970
0000-0001-6172-5572
OpenAccessLink https://dx.doi.org/10.1016/j.cviu.2022.103406
ParticipantIDs crossref_citationtrail_10_1016_j_cviu_2022_103406
crossref_primary_10_1016_j_cviu_2022_103406
elsevier_sciencedirect_doi_10_1016_j_cviu_2022_103406
PublicationCentury 2000
PublicationDate June 2022
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: June 2022
PublicationDecade 2020
PublicationTitle Computer vision and image understanding
PublicationYear 2022
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Li, Y., Li, Y., Vasconcelos, N., 2018. Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 513–528.
Huo, Ding, Lu, Lu, Xiang, Wen, Huang, Jiang, Zhang, Tang, Huang, Luo (b27) 2021
Xue, Ji, Zhang, Cao (b64) 2020; 88
Yang, Xu, Dai, Zhou (b65) 2020
Knights, Harwood, Ward, Vanderkop, Mackenzie-Ross, Moghadam (b33) 2021
Sun, Baradel, Murphy, Schmid (b50) 2019
Lengyel, Bruintjes, Rios, Kayhan, Zambrano, Tomen, van Gemert (b36) 2022
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T., 2011. HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. ICCV.
Dave, Biyani, Clark, Gupta, Rawat, Shah (b14) 2021
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020b. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889.
Diba, Fayyaz, Sharma, Paluri, Gall, Stiefelhagen, Van Gool (b16) 2020
Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6202–6211.
Han, Xie, Zisserman (b24) 2020
Fernando, B., Bilen, H., Gavves, E., Gould, S., 2017. Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3636–3645.
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K., 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 305–321.
Tao, L., Wang, X., Yamasaki, T., 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2193–2201.
Wang, J., Gao, Y., Li, K., Jiang, X., Guo, X., Ji, R., Sun, X., 2021. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion. In: The AAAI Conference on Artificial Intelligence. AAAI.
Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S., Cui, Y., 2021a. Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6964–6974.
Behrmann, N., Gall, J., Noroozi, M., 2021. Unsupervised Video Representation Learning by Bidirectional Feature Prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1670–1679.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459.
Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308.
Wang, Jiao, Bao, He, Liu, Liu (b59) 2021
Jenni, S., Jin, H., 2021. Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9970–9980.
Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T., 2018. Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8052–8060.
Choi, J., Gao, C., Messou, J.C., Huang, J.-B., 2019. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In: Advances in Neural Information Processing Systems. pp. 853–865.
Kataoka, Wakamiya, Hara, Satoh (b31) 2020
Lorre, G., Rabarisoa, J., Orcesi, A., Ainouz, S., Canu, S., 2020. Temporal Contrastive Pretraining for Video Action Recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 662–670.
Oord, Li, Vinyals (b43) 2018
Zhuang, C., She, T., Andonian, A., Mark, M.S., Yamins, D., 2020. Unsupervised learning from video with deep neural embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9563–9572.
Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T., 2020. SpeedNet: Learning the Speediness in Videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9922–9931.
Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W., 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4006–4015.
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q., 2020a. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6548–6557.
Gutmann, M., Hyvärinen, A., 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 297–304.
Caron, Misra, Mairal, Goyal, Bojanowski, Joulin (b8) 2020
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W., 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11205–11214.
Alwassel, Mahajan, Korbar, Torresani, Ghanem, Tran (b3) 2020
Wang, J., Jiao, J., Liu, Y.-H., 2020. Self-supervised Video Representation Learning by Pace Prediction. In: The European Conference on Computer Vision. ECCV.
Misra, Zitnick, Hebert (b42) 2016
Soomro, Zamir, Shah (b49) 2012
Hara, K., Kataoka, H., Satoh, Y., 2018. Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions. In: 2018 24th International Conference on Pattern Recognition. ICPR, pp. 2516–2521.
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020a. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889.
Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., Li, D., Lin, W., 2021b. Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization. In: Proceedings of the International Conference on Computer Vision. ICCV.
Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., Wang, W., 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 11701–11708.
Tokmakov, Hebert, Schmid (b54) 2020
Han, T., Xie, W., Zisserman, A., 2019. Video Representation Learning by Dense Predictive Coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.
Cho, Kim, Chang, Hwang (b12) 2021; 9
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497.
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y., 2019. Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10334–10343.
Bachman, P., Hjelm, R.D., Buchwalter, W., 2019. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545.
Yao, T., Zhang, Y., Qiu, Z., Pan, Y., Mei, T., 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In: AAAI. 2, p. 7.
Jing, Yang, Liu, Tian (b30) 2018
Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H., 2017. Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 667–676.
Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple framework for contrastive learning of visual representations. In: ICML.
Patrick, Asano, Kuznetsova, Fong, Henriques, Zweig, Vedaldi (b45) 2021
Ahsan, Madhok, Essa (b2) 2019
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729–9738.
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K., 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 3299–3309.
Shao, Liu, Li (b48) 2021
Devon Hjelm, Bachman (b15) 2020
Suzuki, T., Itazuri, T., Hara, K., Kataoka, H., 2018. Learning Spatiotemporal 3D Convolution with Video Order Self-supervision. In: Proceedings of the European Conference on Computer Vision. ECCV.
Han, Xie, Zisserman (b23) 2020
Jenni, S., Meishvili, G., Favaro, P., 2020. Video Representation Learning by Recognizing Temporal Transformations. In: The European Conference on Computer Vision. ECCV.
Bai, Fan, Misra, Venkatesh, Lu, Zhou, Yu, Chandra, Yuille (b5) 2020
Gavrilyuk, K., Jain, M., Karmanov, I., Snoek, C.G., 2021. Motion-Augmented Self-Training for Video Recognition at Smaller Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10429–10438.
Tian, Che, Bao, Zhai, Gao (b53) 2020
Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., Tan, M., Gan, C., 2021. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning. In: The AAAI Conference on Artificial Intelligence. AAAI.
Kim, D., Cho, D., Kweon, I.S., 2019. Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 8545–8552.
Afouras, T., Owens, A., Chung, J.S., Zis
Han (10.1016/j.cviu.2022.103406_b24) 2020
Shao (10.1016/j.cviu.2022.103406_b48) 2021
10.1016/j.cviu.2022.103406_b18
10.1016/j.cviu.2022.103406_b17
Alwassel (10.1016/j.cviu.2022.103406_b3) 2020
10.1016/j.cviu.2022.103406_b19
Diba (10.1016/j.cviu.2022.103406_b16) 2020
Caron (10.1016/j.cviu.2022.103406_b8) 2020
Cho (10.1016/j.cviu.2022.103406_b12) 2021; 9
10.1016/j.cviu.2022.103406_b52
Xue (10.1016/j.cviu.2022.103406_b64) 2020; 88
10.1016/j.cviu.2022.103406_b51
Tokmakov (10.1016/j.cviu.2022.103406_b54) 2020
10.1016/j.cviu.2022.103406_b58
10.1016/j.cviu.2022.103406_b13
10.1016/j.cviu.2022.103406_b57
10.1016/j.cviu.2022.103406_b10
Yang (10.1016/j.cviu.2022.103406_b65) 2020
10.1016/j.cviu.2022.103406_b56
10.1016/j.cviu.2022.103406_b11
10.1016/j.cviu.2022.103406_b55
10.1016/j.cviu.2022.103406_b29
10.1016/j.cviu.2022.103406_b28
Lengyel (10.1016/j.cviu.2022.103406_b36) 2022
Patrick (10.1016/j.cviu.2022.103406_b45) 2021
Wang (10.1016/j.cviu.2022.103406_b59) 2021
Han (10.1016/j.cviu.2022.103406_b23) 2020
Tian (10.1016/j.cviu.2022.103406_b53) 2020
10.1016/j.cviu.2022.103406_b61
Oord (10.1016/j.cviu.2022.103406_b43) 2018
10.1016/j.cviu.2022.103406_b60
10.1016/j.cviu.2022.103406_b63
10.1016/j.cviu.2022.103406_b62
Sun (10.1016/j.cviu.2022.103406_b50) 2019
10.1016/j.cviu.2022.103406_b25
10.1016/j.cviu.2022.103406_b68
10.1016/j.cviu.2022.103406_b26
Knights (10.1016/j.cviu.2022.103406_b33) 2021
10.1016/j.cviu.2022.103406_b21
10.1016/j.cviu.2022.103406_b20
10.1016/j.cviu.2022.103406_b67
10.1016/j.cviu.2022.103406_b22
10.1016/j.cviu.2022.103406_b66
10.1016/j.cviu.2022.103406_b39
Jing (10.1016/j.cviu.2022.103406_b30) 2018
Soomro (10.1016/j.cviu.2022.103406_b49) 2012
Devon Hjelm (10.1016/j.cviu.2022.103406_b15) 2020
10.1016/j.cviu.2022.103406_b35
10.1016/j.cviu.2022.103406_b38
10.1016/j.cviu.2022.103406_b37
Ahsan (10.1016/j.cviu.2022.103406_b2) 2019
10.1016/j.cviu.2022.103406_b32
10.1016/j.cviu.2022.103406_b34
Bai (10.1016/j.cviu.2022.103406_b5) 2020
Huo (10.1016/j.cviu.2022.103406_b27) 2021
Misra (10.1016/j.cviu.2022.103406_b42) 2016
10.1016/j.cviu.2022.103406_b41
10.1016/j.cviu.2022.103406_b40
10.1016/j.cviu.2022.103406_b1
Dave (10.1016/j.cviu.2022.103406_b14) 2021
10.1016/j.cviu.2022.103406_b47
Kataoka (10.1016/j.cviu.2022.103406_b31) 2020
10.1016/j.cviu.2022.103406_b46
10.1016/j.cviu.2022.103406_b4
10.1016/j.cviu.2022.103406_b6
10.1016/j.cviu.2022.103406_b7
10.1016/j.cviu.2022.103406_b9
10.1016/j.cviu.2022.103406_b44
References_xml – reference: Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., Li, D., Lin, W., 2021b. Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization. In: Proceedings of the International Conference on Computer Vision. ICCV.
– start-page: 9912
  year: 2020
  end-page: 9924
  ident: b8
  article-title: Unsupervised learning of visual features by contrasting cluster assignments
  publication-title: Advances in Neural Information Processing Systems, vol. 33
– year: 2020
  ident: b65
  article-title: Video representation learning with visual tempo consistency
– reference: Lorre, G., Rabarisoa, J., Orcesi, A., Ainouz, S., Canu, S., 2020. Temporal Contrastive Pretraining for Video Action Recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 662–670.
– year: 2021
  ident: b48
  article-title: Self-supervised temporal learning
– year: 2020
  ident: b31
  article-title: Would mega-scale datasets further enhance spatiotemporal 3D cnns?
– year: 2018
  ident: b43
  article-title: Representation learning with contrastive predictive coding
– reference: Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q., 2020a. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6548–6557.
– reference: Jenni, S., Meishvili, G., Favaro, P., 2020. Video Representation Learning by Recognizing Temporal Transformations. In: The European Conference on Computer Vision. ECCV.
– reference: Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T., 2011. HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. ICCV.
– start-page: 9758
  year: 2020
  end-page: 9770
  ident: b3
  article-title: Self-supervised learning by cross-modal audio-video clustering
  publication-title: Advances in Neural Information Processing Systems, vol. 33
– reference: Gavrilyuk, K., Jain, M., Karmanov, I., Snoek, C.G., 2021. Motion-Augmented Self-Training for Video Recognition at Smaller Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10429–10438.
– reference: Behrmann, N., Gall, J., Noroozi, M., 2021. Unsupervised Video Representation Learning by Bidirectional Feature Prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1670–1679.
– reference: Li, Y., Li, Y., Vasconcelos, N., 2018. Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 513–528.
– reference: Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H., 2017. Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 667–676.
– reference: Suzuki, T., Itazuri, T., Hara, K., Kataoka, H., 2018. Learning Spatiotemporal 3D Convolution with Video Order Self-supervision. In: Proceedings of the European Conference on Computer Vision. ECCV.
– reference: Yao, T., Zhang, Y., Qiu, Z., Pan, Y., Mei, T., 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In: AAAI. 2, p. 7.
– year: 2012
  ident: b49
  article-title: UCF101: A dataset of 101 human actions classes from videos in the wild
– year: 2018
  ident: b30
  article-title: Self-supervised spatiotemporal feature learning via video rotation prediction
– reference: Tao, L., Wang, X., Yamasaki, T., 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2193–2201.
– start-page: 71
  year: 2020
  end-page: 89
  ident: b53
  article-title: Self-supervised motion representation via scattering local motion cues
  publication-title: Computer Vision–ECCV 2020: 16th European Conference
– volume: 9
  start-page: 79562
  year: 2021
  end-page: 79571
  ident: b12
  article-title: Self-supervised visual learning by variable playback speeds prediction of a video
  publication-title: IEEE Access
– year: 2020
  ident: b15
  article-title: Representation learning with video deep InfoMax
– year: 2021
  ident: b59
  article-title: Self-supervised video representation learning by uncovering spatio-temporal statistics
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– year: 2021
  ident: b14
  article-title: “Knights”: first place submission for vipriors21 action recognition challenge at iccv 2021
  publication-title: arXiv preprint arXiv:2110.07758
– start-page: 8914
  year: 2021
  end-page: 8921
  ident: b33
  article-title: Temporally coherent embeddings for self-supervised video representation learning
  publication-title: 2020 25th International Conference on Pattern Recognition (ICPR)
– year: 2019
  ident: b50
  article-title: Learning video representations using contrastive bidirectional transformer
– start-page: 404
  year: 2020
  end-page: 421
  ident: b54
  article-title: Unsupervised learning of video representations via dense trajectory clustering
  publication-title: European Conference on Computer Vision
– reference: Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459.
– reference: Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., Wang, W., 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 11701–11708.
– reference: Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W., 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11205–11214.
– reference: Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T., 2020. SpeedNet: Learning the Speediness in Videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9922–9931.
– start-page: 312
  year: 2020
  end-page: 329
  ident: b23
  article-title: Memory-augmented dense predictive coding for video representation learning
  publication-title: Computer Vision–ECCV 2020: 16th European Conference
– reference: Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple framework for contrastive learning of visual representations. In: ICML.
– year: 2021
  ident: b45
  article-title: Multi-modal self-supervision from generalized data transformations
– reference: Zhuang, C., She, T., Andonian, A., Mark, M.S., Yamins, D., 2020. Unsupervised learning from video with deep neural embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9563–9572.
– reference: Fernando, B., Bilen, H., Gavves, E., Gould, S., 2017. Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3636–3645.
– reference: Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T., 2018. Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8052–8060.
– reference: Han, T., Xie, W., Zisserman, A., 2019. Video Representation Learning by Dense Predictive Coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.
– reference: Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020a. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889.
– reference: Jenni, S., Jin, H., 2021. Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9970–9980.
– reference: Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K., 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 305–321.
– reference: Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., Tan, M., Gan, C., 2021. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning. In: The AAAI Conference on Artificial Intelligence. AAAI.
– reference: Bachman, P., Hjelm, R.D., Buchwalter, W., 2019. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545.
– reference: Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6202–6211.
– reference: He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729–9738.
– reference: Kim, D., Cho, D., Kweon, I.S., 2019. Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 8545–8552.
– volume: 88
  year: 2020
  ident: b64
  article-title: Self-supervised video representation learning by maximizing mutual information
  publication-title: Signal Process., Image Commun.
– year: 2022
  ident: b36
  article-title: Vipriors 2: visual inductive priors for data-efficient deep learning challenges
  publication-title: arXiv preprint arXiv:2201.08625
– start-page: 179
  year: 2019
  end-page: 189
  ident: b2
  article-title: Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition
  publication-title: 2019 IEEE Winter Conference on Applications of Computer Vision
– year: 2020
  ident: b5
  article-title: Can temporal information help with contrastive self-supervised learning?
– reference: Wang, J., Jiao, J., Liu, Y.-H., 2020. Self-supervised Video Representation Learning by Pace Prediction. In: The European Conference on Computer Vision. ECCV.
– reference: Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020b. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889.
– start-page: 593
  year: 2020
  end-page: 610
  ident: b16
  article-title: Large scale holistic video understanding
  publication-title: European Conference on Computer Vision
– reference: Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308.
– reference: Choi, J., Gao, C., Messou, J.C., Huang, J.-B., 2019. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In: Advances in Neural Information Processing Systems. pp. 853–865.
– start-page: 527
  year: 2016
  end-page: 544
  ident: b42
  article-title: Shuffle and learn: unsupervised learning using temporal order verification
  publication-title: European Conference on Computer Vision
– reference: Wang, J., Gao, Y., Li, K., Jiang, X., Guo, X., Ji, R., Sun, X., 2021. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion. In: The AAAI Conference on Artificial Intelligence. AAAI.
– reference: Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W., 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4006–4015.
– reference: Hara, K., Kataoka, H., Satoh, Y., 2018. Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions. In: 2018 24th International Conference on Pattern Recognition. ICPR, pp. 2516–2521.
– reference: Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K., 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 3299–3309.
– reference: Gutmann, M., Hyvärinen, A., 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 297–304.
– reference: Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S., Cui, Y., 2021a. Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6964–6974.
– reference: Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y., 2019. Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10334–10343.
– start-page: 5679
  year: 2020
  end-page: 5690
  ident: b24
  article-title: Self-supervised co-training for video representation learning
  publication-title: Advances in Neural Information Processing Systems, vol. 33
– reference: Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497.
– year: 2021
  ident: b27
  article-title: Self-supervised video representation learning with constrained spatiotemporal jigsaw
– reference: Afouras, T., Owens, A., Chung, J.S., Zisserman, A., 2020. Self-supervised learning of audio-visual objects from video. In: The European Conference on Computer Vision. ECCV.
– ident: 10.1016/j.cviu.2022.103406_b1
  doi: 10.1007/978-3-030-58523-5_13
– ident: 10.1016/j.cviu.2022.103406_b58
  doi: 10.1109/CVPR.2019.00413
– ident: 10.1016/j.cviu.2022.103406_b61
  doi: 10.1109/CVPR.2018.00840
– ident: 10.1016/j.cviu.2022.103406_b67
  doi: 10.1609/aaai.v35i12.17274
– ident: 10.1016/j.cviu.2022.103406_b68
  doi: 10.1109/CVPR42600.2020.00958
– start-page: 9758
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b3
  article-title: Self-supervised learning by cross-modal audio-video clustering
– ident: 10.1016/j.cviu.2022.103406_b4
– ident: 10.1016/j.cviu.2022.103406_b29
  doi: 10.1007/978-3-030-58604-1_26
– year: 2022
  ident: 10.1016/j.cviu.2022.103406_b36
  article-title: Vipriors 2: visual inductive priors for data-efficient deep learning challenges
  publication-title: arXiv preprint arXiv:2201.08625
– start-page: 404
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b54
  article-title: Unsupervised learning of video representations via dense trajectory clustering
– year: 2018
  ident: 10.1016/j.cviu.2022.103406_b30
– ident: 10.1016/j.cviu.2022.103406_b60
  doi: 10.1007/978-3-030-58520-4_30
– ident: 10.1016/j.cviu.2022.103406_b6
  doi: 10.1109/WACV48630.2021.00171
– ident: 10.1016/j.cviu.2022.103406_b28
  doi: 10.1109/ICCV48922.2021.00982
– ident: 10.1016/j.cviu.2022.103406_b17
  doi: 10.1109/ICCV.2019.00630
– year: 2021
  ident: 10.1016/j.cviu.2022.103406_b27
– year: 2020
  ident: 10.1016/j.cviu.2022.103406_b65
– ident: 10.1016/j.cviu.2022.103406_b19
  doi: 10.1109/CVPR.2017.607
– start-page: 8914
  year: 2021
  ident: 10.1016/j.cviu.2022.103406_b33
  article-title: Temporally coherent embeddings for self-supervised video representation learning
– start-page: 312
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b23
  article-title: Memory-augmented dense predictive coding for video representation learning
– ident: 10.1016/j.cviu.2022.103406_b40
  doi: 10.1109/CVPR42600.2020.00990
– ident: 10.1016/j.cviu.2022.103406_b46
  doi: 10.1109/ICCV48922.2021.00789
– ident: 10.1016/j.cviu.2022.103406_b35
  doi: 10.1109/ICCV.2017.79
– year: 2012
  ident: 10.1016/j.cviu.2022.103406_b49
– year: 2020
  ident: 10.1016/j.cviu.2022.103406_b5
– ident: 10.1016/j.cviu.2022.103406_b13
– ident: 10.1016/j.cviu.2022.103406_b62
  doi: 10.1007/978-3-030-01267-0_19
– ident: 10.1016/j.cviu.2022.103406_b52
  doi: 10.1145/3394171.3413694
– ident: 10.1016/j.cviu.2022.103406_b55
  doi: 10.1109/ICCV.2015.510
– ident: 10.1016/j.cviu.2022.103406_b57
  doi: 10.1609/aaai.v35i11.17215
– ident: 10.1016/j.cviu.2022.103406_b9
  doi: 10.1109/CVPR.2017.502
– ident: 10.1016/j.cviu.2022.103406_b10
  doi: 10.1609/aaai.v35i2.16189
– start-page: 5679
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b24
  article-title: Self-supervised co-training for video representation learning
– ident: 10.1016/j.cviu.2022.103406_b7
  doi: 10.1109/CVPR42600.2020.00994
– start-page: 71
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b53
  article-title: Self-supervised motion representation via scattering local motion cues
– year: 2018
  ident: 10.1016/j.cviu.2022.103406_b43
– ident: 10.1016/j.cviu.2022.103406_b38
  doi: 10.1109/WACV45572.2020.9093278
– ident: 10.1016/j.cviu.2022.103406_b25
  doi: 10.1109/ICPR.2018.8546325
– volume: 9
  start-page: 79562
  year: 2021
  ident: 10.1016/j.cviu.2022.103406_b12
  article-title: Self-supervised visual learning by variable playback speeds prediction of a video
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2021.3084840
– year: 2020
  ident: 10.1016/j.cviu.2022.103406_b15
– ident: 10.1016/j.cviu.2022.103406_b47
  doi: 10.1109/CVPR46437.2021.00689
– ident: 10.1016/j.cviu.2022.103406_b66
  doi: 10.1109/CVPR42600.2020.00658
– start-page: 593
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b16
  article-title: Large scale holistic video understanding
– ident: 10.1016/j.cviu.2022.103406_b20
  doi: 10.1109/ICCV48922.2021.01026
– ident: 10.1016/j.cviu.2022.103406_b26
  doi: 10.1109/CVPR42600.2020.00975
– year: 2021
  ident: 10.1016/j.cviu.2022.103406_b14
  article-title: “Knights”: first place submission for vipriors21 action recognition challenge at iccv 2021
  publication-title: arXiv preprint arXiv:2110.07758
– ident: 10.1016/j.cviu.2022.103406_b34
  doi: 10.1109/ICCV.2011.6126543
– ident: 10.1016/j.cviu.2022.103406_b44
  doi: 10.1109/CVPR46437.2021.01105
– start-page: 179
  year: 2019
  ident: 10.1016/j.cviu.2022.103406_b2
  article-title: Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition
– ident: 10.1016/j.cviu.2022.103406_b18
  doi: 10.1109/CVPR46437.2021.00331
– start-page: 9912
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b8
  article-title: Unsupervised learning of visual features by contrasting cluster assignments
– ident: 10.1016/j.cviu.2022.103406_b22
  doi: 10.1109/ICCVW.2019.00186
– year: 2019
  ident: 10.1016/j.cviu.2022.103406_b50
– ident: 10.1016/j.cviu.2022.103406_b39
  doi: 10.1609/aaai.v34i07.6840
– ident: 10.1016/j.cviu.2022.103406_b41
  doi: 10.1109/CVPR42600.2020.00990
– year: 2021
  ident: 10.1016/j.cviu.2022.103406_b48
– year: 2020
  ident: 10.1016/j.cviu.2022.103406_b31
– ident: 10.1016/j.cviu.2022.103406_b56
  doi: 10.1109/CVPR.2018.00675
– ident: 10.1016/j.cviu.2022.103406_b11
– ident: 10.1016/j.cviu.2022.103406_b37
  doi: 10.1007/978-3-030-01231-1_32
– year: 2021
  ident: 10.1016/j.cviu.2022.103406_b45
– year: 2021
  ident: 10.1016/j.cviu.2022.103406_b59
  article-title: Self-supervised video representation learning by uncovering spatio-temporal statistics
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– ident: 10.1016/j.cviu.2022.103406_b32
  doi: 10.1609/aaai.v33i01.33018545
– volume: 88
  year: 2020
  ident: 10.1016/j.cviu.2022.103406_b64
  article-title: Self-supervised video representation learning by maximizing mutual information
  publication-title: Signal Process., Image Commun.
  doi: 10.1016/j.image.2020.115967
– ident: 10.1016/j.cviu.2022.103406_b51
  doi: 10.1007/978-3-030-11012-3_45
– ident: 10.1016/j.cviu.2022.103406_b21
– start-page: 527
  year: 2016
  ident: 10.1016/j.cviu.2022.103406_b42
  article-title: Shuffle and learn: unsupervised learning using temporal order verification
– ident: 10.1016/j.cviu.2022.103406_b63
  doi: 10.1109/CVPR.2019.01058
SSID ssj0011491
Score 2.6699183
Snippet Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos....
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 103406
SubjectTerms Action Recognition
Self-Supervised Learning
Video Representation
Title TCLR: Temporal contrastive learning for video representation
URI https://dx.doi.org/10.1016/j.cviu.2022.103406
Volume 219
WOSCitedRecordID wos000793292400002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1090-235X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011491
  issn: 1077-3142
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZKxwEOAwYTYwz5wK1K5cQOjhGXaRpsiE1oDKm3KLEdraNLq7WpNv76Pf9Ks4EmQOISVZZdR35f3fc-v_cZobeCy5QmIokEIVXESghQBKlYFAsuEkKrouDSXjbBj4-z0Uh87fWqUAuznPC6zq6uxOy_mhrawNimdPYvzN1-KTTAZzA6PMHs8Pwzw-99OXGkuRWdmrhs9GJuc4QmgQkx2YWmBG86sLKWoQSp7jqr4caHgStAd-cMFybJp-mWxKzI7qVlRw_nZyvIfWpm3j-dno3bFJuT8U_X96i4gN4KNvlr7UlkqxdZWLbnqDFnIT-6zAQEtW0GlaPLQsnMrYxOCDcNMeoktYbatwkSJTQddbdlv5X-ssU7tuF8KJfjZmimNboBjNzR07b_0N_MZNSyXFZo590DtJbwVGR9tLZ7uD_63J43QZwYu-xU93K-vMplAt6d6fcuTMctOX2K1n08gXcdDp6hnq430BMfW2C_c8-hKRgztG2gxx0tyufog8HNexxQgzuowQE1GFCDLWrwbdS8QN8_7p_uHUT-ao1IUsYWkSoVlxkvMyYpUUqVPOVUCR5LTsuyKCtaceM-SvixGv1WnYlUV-DwZFIkmlV0E_Xraa1fIgyrpyWLJaGFYIKo0igaCkIzQ7yUSm2hOKxWLr3uvLn-ZJKHBMPz3KxwblY4dyu8hQbtmJlTXbm3dxqMkHu_0fmDOWDmnnGv_nHcNnq0gvtr1F9cNnoHPZTLxXh--cZD6waA_pEs
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TCLR%3A+Temporal+contrastive+learning+for+video+representation&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Dave%2C+Ishan&rft.au=Gupta%2C+Rohit&rft.au=Rizve%2C+Mamshad+Nayeem&rft.au=Shah%2C+Mubarak&rft.date=2022-06-01&rft.pub=Elsevier+Inc&rft.issn=1077-3142&rft.eissn=1090-235X&rft.volume=219&rft_id=info:doi/10.1016%2Fj.cviu.2022.103406&rft.externalDocID=S1077314222000376
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon