TCLR: Temporal contrastive learning for video representation
Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct...
Uloženo v:
| Vydáno v: | Computer vision and image understanding Ročník 219; s. 103406 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Elsevier Inc
01.06.2022
|
| Témata: | |
| ISSN: | 1077-3142, 1090-235X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local–local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global–local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture with UCF101 pretraining, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval. Code released at https://github.com/DAVEISHAN/TCLR.
•TCLR is a contrastive learning framework for video understanding tasks.•Explicitly enforces within instance temporal feature variation without pretext tasks.•Proposes novel local–local and global–local temporal contrastive losses.•Significantly outperforms state-of-art pre-training on video understanding tasks.•Uses fine-grained action classification task for evaluating learned representations. |
|---|---|
| AbstractList | Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local–local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global–local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture with UCF101 pretraining, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval. Code released at https://github.com/DAVEISHAN/TCLR.
•TCLR is a contrastive learning framework for video understanding tasks.•Explicitly enforces within instance temporal feature variation without pretext tasks.•Proposes novel local–local and global–local temporal contrastive losses.•Significantly outperforms state-of-art pre-training on video understanding tasks.•Uses fine-grained action classification task for evaluating learned representations. |
| ArticleNumber | 103406 |
| Author | Shah, Mubarak Rizve, Mamshad Nayeem Dave, Ishan Gupta, Rohit |
| Author_xml | – sequence: 1 givenname: Ishan orcidid: 0000-0001-9920-6970 surname: Dave fullname: Dave, Ishan email: ishandave@knights.ucf.edu – sequence: 2 givenname: Rohit surname: Gupta fullname: Gupta, Rohit – sequence: 3 givenname: Mamshad Nayeem orcidid: 0000-0001-5378-1697 surname: Rizve fullname: Rizve, Mamshad Nayeem – sequence: 4 givenname: Mubarak orcidid: 0000-0001-6172-5572 surname: Shah fullname: Shah, Mubarak |
| BookMark | eNp9kMtqwzAQRUVJoUnaH-jKP-B0ZCmWXbIpoS8IFEoK3QlZGhUFRwqSa-jf18ZddZHVDAPncucsyMwHj4TcUlhRoOXdYaV7970qoCiGA-NQXpA5hRrygq0_Z-MuRM4oL67IIqUDAKW8pnOy2W937_fZHo-nEFWb6eC7qFLnesxaVNE7_5XZELPeGQxZxFPEhL5TnQv-mlxa1Sa8-ZtL8vH0uN--5Lu359ftwy7XjPMuN40RuhJNxTUDY0wj1oKZWlAtWNOoxjIrAEBoq1RdloBVvUbLKat0XSC3bEmqKVfHkFJEK7WbGgxdXSspyNGCPMjRghwtyMnCgBb_0FN0RxV_zkObCcLhqd5hlEk79BqNi6g7aYI7h_8Cu2N5Ug |
| CitedBy_id | crossref_primary_10_1007_s11263_022_01713_6 crossref_primary_10_1016_j_engappai_2023_106203 crossref_primary_10_1109_TPAMI_2023_3273415 crossref_primary_10_1016_j_patrec_2025_03_015 crossref_primary_10_1016_j_jhydrol_2024_130962 crossref_primary_10_1109_ACCESS_2025_3545768 crossref_primary_10_3390_s23031707 crossref_primary_10_1016_j_imavis_2024_105159 crossref_primary_10_1016_j_neunet_2023_06_010 crossref_primary_10_1145_3696445 crossref_primary_10_1016_j_imavis_2023_104765 crossref_primary_10_1109_TMM_2022_3193559 crossref_primary_10_1145_3577925 crossref_primary_10_1016_j_neucom_2025_129694 crossref_primary_10_3390_app122412863 crossref_primary_10_1007_s00138_023_01444_9 crossref_primary_10_1038_s41524_023_00966_0 crossref_primary_10_1109_JBHI_2024_3511601 crossref_primary_10_1007_s10489_024_05460_8 crossref_primary_10_1016_j_ins_2023_119042 crossref_primary_10_1016_j_media_2024_103385 crossref_primary_10_1007_s11548_024_03101_6 crossref_primary_10_1109_TIFS_2025_3531772 crossref_primary_10_1109_TCBB_2024_3451051 crossref_primary_10_1109_TIP_2024_3431451 crossref_primary_10_4103_1673_5374_393103 crossref_primary_10_1016_j_bspc_2025_107491 crossref_primary_10_1109_TPAMI_2023_3312419 crossref_primary_10_1371_journal_pone_0322555 crossref_primary_10_1016_j_patcog_2024_110804 crossref_primary_10_1007_s00530_024_01612_5 crossref_primary_10_1109_TPAMI_2023_3243812 crossref_primary_10_1145_3651311 crossref_primary_10_1007_s11042_024_18126_x crossref_primary_10_1007_s11760_023_02605_z crossref_primary_10_1007_s13042_023_01904_8 crossref_primary_10_1007_s42486_025_00194_z crossref_primary_10_3389_fgene_2022_937042 crossref_primary_10_1007_s11042_025_21099_0 crossref_primary_10_1002_widm_70043 crossref_primary_10_1007_s10489_024_05661_1 crossref_primary_10_1016_j_image_2025_117381 crossref_primary_10_1109_LRA_2025_3583626 crossref_primary_10_2196_45547 crossref_primary_10_1016_j_ins_2025_122556 |
| Cites_doi | 10.1007/978-3-030-58523-5_13 10.1109/CVPR.2019.00413 10.1109/CVPR.2018.00840 10.1609/aaai.v35i12.17274 10.1109/CVPR42600.2020.00958 10.1007/978-3-030-58604-1_26 10.1007/978-3-030-58520-4_30 10.1109/WACV48630.2021.00171 10.1109/ICCV48922.2021.00982 10.1109/ICCV.2019.00630 10.1109/CVPR.2017.607 10.1109/CVPR42600.2020.00990 10.1109/ICCV48922.2021.00789 10.1109/ICCV.2017.79 10.1007/978-3-030-01267-0_19 10.1145/3394171.3413694 10.1109/ICCV.2015.510 10.1609/aaai.v35i11.17215 10.1109/CVPR.2017.502 10.1609/aaai.v35i2.16189 10.1109/CVPR42600.2020.00994 10.1109/WACV45572.2020.9093278 10.1109/ICPR.2018.8546325 10.1109/ACCESS.2021.3084840 10.1109/CVPR46437.2021.00689 10.1109/CVPR42600.2020.00658 10.1109/ICCV48922.2021.01026 10.1109/CVPR42600.2020.00975 10.1109/ICCV.2011.6126543 10.1109/CVPR46437.2021.01105 10.1109/CVPR46437.2021.00331 10.1109/ICCVW.2019.00186 10.1609/aaai.v34i07.6840 10.1109/CVPR.2018.00675 10.1007/978-3-030-01231-1_32 10.1609/aaai.v33i01.33018545 10.1016/j.image.2020.115967 10.1007/978-3-030-11012-3_45 10.1109/CVPR.2019.01058 |
| ContentType | Journal Article |
| Copyright | 2022 The Author(s) |
| Copyright_xml | – notice: 2022 The Author(s) |
| DBID | 6I. AAFTH AAYXX CITATION |
| DOI | 10.1016/j.cviu.2022.103406 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Engineering Computer Science |
| EISSN | 1090-235X |
| ExternalDocumentID | 10_1016_j_cviu_2022_103406 S1077314222000376 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 6I. 6TJ 7-5 71M 8P~ AABNK AACTN AAEDT AAEDW AAFTH AAIAV AAIKC AAIKJ AAKOC AALRI AAMNW AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABEFU ABFNM ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 EBS EFBJH EFLBG EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HF~ HVGLF HZ~ IHE J1W JJJVA KOM LG5 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG RNS ROL RPZ SDF SDG SDP SES SEW SPC SPCBC SSV SSZ T5K TN5 XPP ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP CITATION EFKBS SST ~HD |
| ID | FETCH-LOGICAL-c344t-dbd7c87b84c30dddb7573d971c73bbabf3f70007cfaa9660e895ef4138c92e4f3 |
| ISICitedReferencesCount | 99 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000793292400002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1077-3142 |
| IngestDate | Tue Nov 18 22:35:32 EST 2025 Sat Nov 29 07:09:40 EST 2025 Fri Feb 23 02:40:38 EST 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | 68T45 68T07 Action Recognition 68T30 Self-Supervised Learning Video Representation |
| Language | English |
| License | This is an open access article under the CC BY-NC-ND license. |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c344t-dbd7c87b84c30dddb7573d971c73bbabf3f70007cfaa9660e895ef4138c92e4f3 |
| ORCID | 0000-0001-5378-1697 0000-0001-9920-6970 0000-0001-6172-5572 |
| OpenAccessLink | https://dx.doi.org/10.1016/j.cviu.2022.103406 |
| ParticipantIDs | crossref_citationtrail_10_1016_j_cviu_2022_103406 crossref_primary_10_1016_j_cviu_2022_103406 elsevier_sciencedirect_doi_10_1016_j_cviu_2022_103406 |
| PublicationCentury | 2000 |
| PublicationDate | June 2022 |
| PublicationDateYYYYMMDD | 2022-06-01 |
| PublicationDate_xml | – month: 06 year: 2022 text: June 2022 |
| PublicationDecade | 2020 |
| PublicationTitle | Computer vision and image understanding |
| PublicationYear | 2022 |
| Publisher | Elsevier Inc |
| Publisher_xml | – name: Elsevier Inc |
| References | Li, Y., Li, Y., Vasconcelos, N., 2018. Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 513–528. Huo, Ding, Lu, Lu, Xiang, Wen, Huang, Jiang, Zhang, Tang, Huang, Luo (b27) 2021 Xue, Ji, Zhang, Cao (b64) 2020; 88 Yang, Xu, Dai, Zhou (b65) 2020 Knights, Harwood, Ward, Vanderkop, Mackenzie-Ross, Moghadam (b33) 2021 Sun, Baradel, Murphy, Schmid (b50) 2019 Lengyel, Bruintjes, Rios, Kayhan, Zambrano, Tomen, van Gemert (b36) 2022 Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T., 2011. HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. ICCV. Dave, Biyani, Clark, Gupta, Rawat, Shah (b14) 2021 Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020b. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889. Diba, Fayyaz, Sharma, Paluri, Gall, Stiefelhagen, Van Gool (b16) 2020 Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6202–6211. Han, Xie, Zisserman (b24) 2020 Fernando, B., Bilen, H., Gavves, E., Gould, S., 2017. Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3636–3645. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K., 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 305–321. Tao, L., Wang, X., Yamasaki, T., 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2193–2201. Wang, J., Gao, Y., Li, K., Jiang, X., Guo, X., Ji, R., Sun, X., 2021. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion. In: The AAAI Conference on Artificial Intelligence. AAAI. Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S., Cui, Y., 2021a. Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6964–6974. Behrmann, N., Gall, J., Noroozi, M., 2021. Unsupervised Video Representation Learning by Bidirectional Feature Prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1670–1679. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459. Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308. Wang, Jiao, Bao, He, Liu, Liu (b59) 2021 Jenni, S., Jin, H., 2021. Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9970–9980. Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T., 2018. Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8052–8060. Choi, J., Gao, C., Messou, J.C., Huang, J.-B., 2019. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In: Advances in Neural Information Processing Systems. pp. 853–865. Kataoka, Wakamiya, Hara, Satoh (b31) 2020 Lorre, G., Rabarisoa, J., Orcesi, A., Ainouz, S., Canu, S., 2020. Temporal Contrastive Pretraining for Video Action Recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 662–670. Oord, Li, Vinyals (b43) 2018 Zhuang, C., She, T., Andonian, A., Mark, M.S., Yamins, D., 2020. Unsupervised learning from video with deep neural embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9563–9572. Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T., 2020. SpeedNet: Learning the Speediness in Videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9922–9931. Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W., 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4006–4015. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q., 2020a. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6548–6557. Gutmann, M., Hyvärinen, A., 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 297–304. Caron, Misra, Mairal, Goyal, Bojanowski, Joulin (b8) 2020 Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W., 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11205–11214. Alwassel, Mahajan, Korbar, Torresani, Ghanem, Tran (b3) 2020 Wang, J., Jiao, J., Liu, Y.-H., 2020. Self-supervised Video Representation Learning by Pace Prediction. In: The European Conference on Computer Vision. ECCV. Misra, Zitnick, Hebert (b42) 2016 Soomro, Zamir, Shah (b49) 2012 Hara, K., Kataoka, H., Satoh, Y., 2018. Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions. In: 2018 24th International Conference on Pattern Recognition. ICPR, pp. 2516–2521. Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020a. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889. Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., Li, D., Lin, W., 2021b. Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization. In: Proceedings of the International Conference on Computer Vision. ICCV. Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., Wang, W., 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 11701–11708. Tokmakov, Hebert, Schmid (b54) 2020 Han, T., Xie, W., Zisserman, A., 2019. Video Representation Learning by Dense Predictive Coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. Cho, Kim, Chang, Hwang (b12) 2021; 9 Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y., 2019. Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10334–10343. Bachman, P., Hjelm, R.D., Buchwalter, W., 2019. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545. Yao, T., Zhang, Y., Qiu, Z., Pan, Y., Mei, T., 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In: AAAI. 2, p. 7. Jing, Yang, Liu, Tian (b30) 2018 Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H., 2017. Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 667–676. Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple framework for contrastive learning of visual representations. In: ICML. Patrick, Asano, Kuznetsova, Fong, Henriques, Zweig, Vedaldi (b45) 2021 Ahsan, Madhok, Essa (b2) 2019 He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729–9738. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K., 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 3299–3309. Shao, Liu, Li (b48) 2021 Devon Hjelm, Bachman (b15) 2020 Suzuki, T., Itazuri, T., Hara, K., Kataoka, H., 2018. Learning Spatiotemporal 3D Convolution with Video Order Self-supervision. In: Proceedings of the European Conference on Computer Vision. ECCV. Han, Xie, Zisserman (b23) 2020 Jenni, S., Meishvili, G., Favaro, P., 2020. Video Representation Learning by Recognizing Temporal Transformations. In: The European Conference on Computer Vision. ECCV. Bai, Fan, Misra, Venkatesh, Lu, Zhou, Yu, Chandra, Yuille (b5) 2020 Gavrilyuk, K., Jain, M., Karmanov, I., Snoek, C.G., 2021. Motion-Augmented Self-Training for Video Recognition at Smaller Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10429–10438. Tian, Che, Bao, Zhai, Gao (b53) 2020 Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., Tan, M., Gan, C., 2021. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning. In: The AAAI Conference on Artificial Intelligence. AAAI. Kim, D., Cho, D., Kweon, I.S., 2019. Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 8545–8552. Afouras, T., Owens, A., Chung, J.S., Zis Han (10.1016/j.cviu.2022.103406_b24) 2020 Shao (10.1016/j.cviu.2022.103406_b48) 2021 10.1016/j.cviu.2022.103406_b18 10.1016/j.cviu.2022.103406_b17 Alwassel (10.1016/j.cviu.2022.103406_b3) 2020 10.1016/j.cviu.2022.103406_b19 Diba (10.1016/j.cviu.2022.103406_b16) 2020 Caron (10.1016/j.cviu.2022.103406_b8) 2020 Cho (10.1016/j.cviu.2022.103406_b12) 2021; 9 10.1016/j.cviu.2022.103406_b52 Xue (10.1016/j.cviu.2022.103406_b64) 2020; 88 10.1016/j.cviu.2022.103406_b51 Tokmakov (10.1016/j.cviu.2022.103406_b54) 2020 10.1016/j.cviu.2022.103406_b58 10.1016/j.cviu.2022.103406_b13 10.1016/j.cviu.2022.103406_b57 10.1016/j.cviu.2022.103406_b10 Yang (10.1016/j.cviu.2022.103406_b65) 2020 10.1016/j.cviu.2022.103406_b56 10.1016/j.cviu.2022.103406_b11 10.1016/j.cviu.2022.103406_b55 10.1016/j.cviu.2022.103406_b29 10.1016/j.cviu.2022.103406_b28 Lengyel (10.1016/j.cviu.2022.103406_b36) 2022 Patrick (10.1016/j.cviu.2022.103406_b45) 2021 Wang (10.1016/j.cviu.2022.103406_b59) 2021 Han (10.1016/j.cviu.2022.103406_b23) 2020 Tian (10.1016/j.cviu.2022.103406_b53) 2020 10.1016/j.cviu.2022.103406_b61 Oord (10.1016/j.cviu.2022.103406_b43) 2018 10.1016/j.cviu.2022.103406_b60 10.1016/j.cviu.2022.103406_b63 10.1016/j.cviu.2022.103406_b62 Sun (10.1016/j.cviu.2022.103406_b50) 2019 10.1016/j.cviu.2022.103406_b25 10.1016/j.cviu.2022.103406_b68 10.1016/j.cviu.2022.103406_b26 Knights (10.1016/j.cviu.2022.103406_b33) 2021 10.1016/j.cviu.2022.103406_b21 10.1016/j.cviu.2022.103406_b20 10.1016/j.cviu.2022.103406_b67 10.1016/j.cviu.2022.103406_b22 10.1016/j.cviu.2022.103406_b66 10.1016/j.cviu.2022.103406_b39 Jing (10.1016/j.cviu.2022.103406_b30) 2018 Soomro (10.1016/j.cviu.2022.103406_b49) 2012 Devon Hjelm (10.1016/j.cviu.2022.103406_b15) 2020 10.1016/j.cviu.2022.103406_b35 10.1016/j.cviu.2022.103406_b38 10.1016/j.cviu.2022.103406_b37 Ahsan (10.1016/j.cviu.2022.103406_b2) 2019 10.1016/j.cviu.2022.103406_b32 10.1016/j.cviu.2022.103406_b34 Bai (10.1016/j.cviu.2022.103406_b5) 2020 Huo (10.1016/j.cviu.2022.103406_b27) 2021 Misra (10.1016/j.cviu.2022.103406_b42) 2016 10.1016/j.cviu.2022.103406_b41 10.1016/j.cviu.2022.103406_b40 10.1016/j.cviu.2022.103406_b1 Dave (10.1016/j.cviu.2022.103406_b14) 2021 10.1016/j.cviu.2022.103406_b47 Kataoka (10.1016/j.cviu.2022.103406_b31) 2020 10.1016/j.cviu.2022.103406_b46 10.1016/j.cviu.2022.103406_b4 10.1016/j.cviu.2022.103406_b6 10.1016/j.cviu.2022.103406_b7 10.1016/j.cviu.2022.103406_b9 10.1016/j.cviu.2022.103406_b44 |
| References_xml | – reference: Qian, R., Li, Y., Liu, H., See, J., Ding, S., Liu, X., Li, D., Lin, W., 2021b. Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization. In: Proceedings of the International Conference on Computer Vision. ICCV. – start-page: 9912 year: 2020 end-page: 9924 ident: b8 article-title: Unsupervised learning of visual features by contrasting cluster assignments publication-title: Advances in Neural Information Processing Systems, vol. 33 – year: 2020 ident: b65 article-title: Video representation learning with visual tempo consistency – reference: Lorre, G., Rabarisoa, J., Orcesi, A., Ainouz, S., Canu, S., 2020. Temporal Contrastive Pretraining for Video Action Recognition. In: The IEEE Winter Conference on Applications of Computer Vision. pp. 662–670. – year: 2021 ident: b48 article-title: Self-supervised temporal learning – year: 2020 ident: b31 article-title: Would mega-scale datasets further enhance spatiotemporal 3D cnns? – year: 2018 ident: b43 article-title: Representation learning with contrastive predictive coding – reference: Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q., 2020a. Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6548–6557. – reference: Jenni, S., Meishvili, G., Favaro, P., 2020. Video Representation Learning by Recognizing Temporal Transformations. In: The European Conference on Computer Vision. ECCV. – reference: Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T., 2011. HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. ICCV. – start-page: 9758 year: 2020 end-page: 9770 ident: b3 article-title: Self-supervised learning by cross-modal audio-video clustering publication-title: Advances in Neural Information Processing Systems, vol. 33 – reference: Gavrilyuk, K., Jain, M., Karmanov, I., Snoek, C.G., 2021. Motion-Augmented Self-Training for Video Recognition at Smaller Scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10429–10438. – reference: Behrmann, N., Gall, J., Noroozi, M., 2021. Unsupervised Video Representation Learning by Bidirectional Feature Prediction. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1670–1679. – reference: Li, Y., Li, Y., Vasconcelos, N., 2018. Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 513–528. – reference: Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H., 2017. Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 667–676. – reference: Suzuki, T., Itazuri, T., Hara, K., Kataoka, H., 2018. Learning Spatiotemporal 3D Convolution with Video Order Self-supervision. In: Proceedings of the European Conference on Computer Vision. ECCV. – reference: Yao, T., Zhang, Y., Qiu, Z., Pan, Y., Mei, T., 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In: AAAI. 2, p. 7. – year: 2012 ident: b49 article-title: UCF101: A dataset of 101 human actions classes from videos in the wild – year: 2018 ident: b30 article-title: Self-supervised spatiotemporal feature learning via video rotation prediction – reference: Tao, L., Wang, X., Yamasaki, T., 2020. Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2193–2201. – start-page: 71 year: 2020 end-page: 89 ident: b53 article-title: Self-supervised motion representation via scattering local motion cues publication-title: Computer Vision–ECCV 2020: 16th European Conference – volume: 9 start-page: 79562 year: 2021 end-page: 79571 ident: b12 article-title: Self-supervised visual learning by variable playback speeds prediction of a video publication-title: IEEE Access – year: 2020 ident: b15 article-title: Representation learning with video deep InfoMax – year: 2021 ident: b59 article-title: Self-supervised video representation learning by uncovering spatio-temporal statistics publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – year: 2021 ident: b14 article-title: “Knights”: first place submission for vipriors21 action recognition challenge at iccv 2021 publication-title: arXiv preprint arXiv:2110.07758 – start-page: 8914 year: 2021 end-page: 8921 ident: b33 article-title: Temporally coherent embeddings for self-supervised video representation learning publication-title: 2020 25th International Conference on Pattern Recognition (ICPR) – year: 2019 ident: b50 article-title: Learning video representations using contrastive bidirectional transformer – start-page: 404 year: 2020 end-page: 421 ident: b54 article-title: Unsupervised learning of video representations via dense trajectory clustering publication-title: European Conference on Computer Vision – reference: Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M., 2018. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6450–6459. – reference: Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., Wang, W., 2020. Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 11701–11708. – reference: Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W., 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11205–11214. – reference: Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W.T., Rubinstein, M., Irani, M., Dekel, T., 2020. SpeedNet: Learning the Speediness in Videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9922–9931. – start-page: 312 year: 2020 end-page: 329 ident: b23 article-title: Memory-augmented dense predictive coding for video representation learning publication-title: Computer Vision–ECCV 2020: 16th European Conference – reference: Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple framework for contrastive learning of visual representations. In: ICML. – year: 2021 ident: b45 article-title: Multi-modal self-supervision from generalized data transformations – reference: Zhuang, C., She, T., Andonian, A., Mark, M.S., Yamins, D., 2020. Unsupervised learning from video with deep neural embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9563–9572. – reference: Fernando, B., Bilen, H., Gavves, E., Gould, S., 2017. Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3636–3645. – reference: Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T., 2018. Learning and using the arrow of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8052–8060. – reference: Han, T., Xie, W., Zisserman, A., 2019. Video Representation Learning by Dense Predictive Coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. – reference: Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020a. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889. – reference: Jenni, S., Jin, H., 2021. Time-equivariant contrastive video representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9970–9980. – reference: Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K., 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 305–321. – reference: Chen, P., Huang, D., He, D., Long, X., Zeng, R., Wen, S., Tan, M., Gan, C., 2021. RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning. In: The AAAI Conference on Artificial Intelligence. AAAI. – reference: Bachman, P., Hjelm, R.D., Buchwalter, W., 2019. Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545. – reference: Feichtenhofer, C., Fan, H., Malik, J., He, K., 2019. Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6202–6211. – reference: He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729–9738. – reference: Kim, D., Cho, D., Kweon, I.S., 2019. Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. pp. 8545–8552. – volume: 88 year: 2020 ident: b64 article-title: Self-supervised video representation learning by maximizing mutual information publication-title: Signal Process., Image Commun. – year: 2022 ident: b36 article-title: Vipriors 2: visual inductive priors for data-efficient deep learning challenges publication-title: arXiv preprint arXiv:2201.08625 – start-page: 179 year: 2019 end-page: 189 ident: b2 article-title: Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition publication-title: 2019 IEEE Winter Conference on Applications of Computer Vision – year: 2020 ident: b5 article-title: Can temporal information help with contrastive self-supervised learning? – reference: Wang, J., Jiao, J., Liu, Y.-H., 2020. Self-supervised Video Representation Learning by Pace Prediction. In: The European Conference on Computer Vision. ECCV. – reference: Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A., 2020b. End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9879–9889. – start-page: 593 year: 2020 end-page: 610 ident: b16 article-title: Large scale holistic video understanding publication-title: European Conference on Computer Vision – reference: Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308. – reference: Choi, J., Gao, C., Messou, J.C., Huang, J.-B., 2019. Why Can’t I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition. In: Advances in Neural Information Processing Systems. pp. 853–865. – start-page: 527 year: 2016 end-page: 544 ident: b42 article-title: Shuffle and learn: unsupervised learning using temporal order verification publication-title: European Conference on Computer Vision – reference: Wang, J., Gao, Y., Li, K., Jiang, X., Guo, X., Ji, R., Sun, X., 2021. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion. In: The AAAI Conference on Artificial Intelligence. AAAI. – reference: Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W., 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4006–4015. – reference: Hara, K., Kataoka, H., Satoh, Y., 2018. Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions. In: 2018 24th International Conference on Pattern Recognition. ICPR, pp. 2516–2521. – reference: Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K., 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 3299–3309. – reference: Gutmann, M., Hyvärinen, A., 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 297–304. – reference: Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S., Cui, Y., 2021a. Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6964–6974. – reference: Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y., 2019. Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10334–10343. – start-page: 5679 year: 2020 end-page: 5690 ident: b24 article-title: Self-supervised co-training for video representation learning publication-title: Advances in Neural Information Processing Systems, vol. 33 – reference: Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497. – year: 2021 ident: b27 article-title: Self-supervised video representation learning with constrained spatiotemporal jigsaw – reference: Afouras, T., Owens, A., Chung, J.S., Zisserman, A., 2020. Self-supervised learning of audio-visual objects from video. In: The European Conference on Computer Vision. ECCV. – ident: 10.1016/j.cviu.2022.103406_b1 doi: 10.1007/978-3-030-58523-5_13 – ident: 10.1016/j.cviu.2022.103406_b58 doi: 10.1109/CVPR.2019.00413 – ident: 10.1016/j.cviu.2022.103406_b61 doi: 10.1109/CVPR.2018.00840 – ident: 10.1016/j.cviu.2022.103406_b67 doi: 10.1609/aaai.v35i12.17274 – ident: 10.1016/j.cviu.2022.103406_b68 doi: 10.1109/CVPR42600.2020.00958 – start-page: 9758 year: 2020 ident: 10.1016/j.cviu.2022.103406_b3 article-title: Self-supervised learning by cross-modal audio-video clustering – ident: 10.1016/j.cviu.2022.103406_b4 – ident: 10.1016/j.cviu.2022.103406_b29 doi: 10.1007/978-3-030-58604-1_26 – year: 2022 ident: 10.1016/j.cviu.2022.103406_b36 article-title: Vipriors 2: visual inductive priors for data-efficient deep learning challenges publication-title: arXiv preprint arXiv:2201.08625 – start-page: 404 year: 2020 ident: 10.1016/j.cviu.2022.103406_b54 article-title: Unsupervised learning of video representations via dense trajectory clustering – year: 2018 ident: 10.1016/j.cviu.2022.103406_b30 – ident: 10.1016/j.cviu.2022.103406_b60 doi: 10.1007/978-3-030-58520-4_30 – ident: 10.1016/j.cviu.2022.103406_b6 doi: 10.1109/WACV48630.2021.00171 – ident: 10.1016/j.cviu.2022.103406_b28 doi: 10.1109/ICCV48922.2021.00982 – ident: 10.1016/j.cviu.2022.103406_b17 doi: 10.1109/ICCV.2019.00630 – year: 2021 ident: 10.1016/j.cviu.2022.103406_b27 – year: 2020 ident: 10.1016/j.cviu.2022.103406_b65 – ident: 10.1016/j.cviu.2022.103406_b19 doi: 10.1109/CVPR.2017.607 – start-page: 8914 year: 2021 ident: 10.1016/j.cviu.2022.103406_b33 article-title: Temporally coherent embeddings for self-supervised video representation learning – start-page: 312 year: 2020 ident: 10.1016/j.cviu.2022.103406_b23 article-title: Memory-augmented dense predictive coding for video representation learning – ident: 10.1016/j.cviu.2022.103406_b40 doi: 10.1109/CVPR42600.2020.00990 – ident: 10.1016/j.cviu.2022.103406_b46 doi: 10.1109/ICCV48922.2021.00789 – ident: 10.1016/j.cviu.2022.103406_b35 doi: 10.1109/ICCV.2017.79 – year: 2012 ident: 10.1016/j.cviu.2022.103406_b49 – year: 2020 ident: 10.1016/j.cviu.2022.103406_b5 – ident: 10.1016/j.cviu.2022.103406_b13 – ident: 10.1016/j.cviu.2022.103406_b62 doi: 10.1007/978-3-030-01267-0_19 – ident: 10.1016/j.cviu.2022.103406_b52 doi: 10.1145/3394171.3413694 – ident: 10.1016/j.cviu.2022.103406_b55 doi: 10.1109/ICCV.2015.510 – ident: 10.1016/j.cviu.2022.103406_b57 doi: 10.1609/aaai.v35i11.17215 – ident: 10.1016/j.cviu.2022.103406_b9 doi: 10.1109/CVPR.2017.502 – ident: 10.1016/j.cviu.2022.103406_b10 doi: 10.1609/aaai.v35i2.16189 – start-page: 5679 year: 2020 ident: 10.1016/j.cviu.2022.103406_b24 article-title: Self-supervised co-training for video representation learning – ident: 10.1016/j.cviu.2022.103406_b7 doi: 10.1109/CVPR42600.2020.00994 – start-page: 71 year: 2020 ident: 10.1016/j.cviu.2022.103406_b53 article-title: Self-supervised motion representation via scattering local motion cues – year: 2018 ident: 10.1016/j.cviu.2022.103406_b43 – ident: 10.1016/j.cviu.2022.103406_b38 doi: 10.1109/WACV45572.2020.9093278 – ident: 10.1016/j.cviu.2022.103406_b25 doi: 10.1109/ICPR.2018.8546325 – volume: 9 start-page: 79562 year: 2021 ident: 10.1016/j.cviu.2022.103406_b12 article-title: Self-supervised visual learning by variable playback speeds prediction of a video publication-title: IEEE Access doi: 10.1109/ACCESS.2021.3084840 – year: 2020 ident: 10.1016/j.cviu.2022.103406_b15 – ident: 10.1016/j.cviu.2022.103406_b47 doi: 10.1109/CVPR46437.2021.00689 – ident: 10.1016/j.cviu.2022.103406_b66 doi: 10.1109/CVPR42600.2020.00658 – start-page: 593 year: 2020 ident: 10.1016/j.cviu.2022.103406_b16 article-title: Large scale holistic video understanding – ident: 10.1016/j.cviu.2022.103406_b20 doi: 10.1109/ICCV48922.2021.01026 – ident: 10.1016/j.cviu.2022.103406_b26 doi: 10.1109/CVPR42600.2020.00975 – year: 2021 ident: 10.1016/j.cviu.2022.103406_b14 article-title: “Knights”: first place submission for vipriors21 action recognition challenge at iccv 2021 publication-title: arXiv preprint arXiv:2110.07758 – ident: 10.1016/j.cviu.2022.103406_b34 doi: 10.1109/ICCV.2011.6126543 – ident: 10.1016/j.cviu.2022.103406_b44 doi: 10.1109/CVPR46437.2021.01105 – start-page: 179 year: 2019 ident: 10.1016/j.cviu.2022.103406_b2 article-title: Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition – ident: 10.1016/j.cviu.2022.103406_b18 doi: 10.1109/CVPR46437.2021.00331 – start-page: 9912 year: 2020 ident: 10.1016/j.cviu.2022.103406_b8 article-title: Unsupervised learning of visual features by contrasting cluster assignments – ident: 10.1016/j.cviu.2022.103406_b22 doi: 10.1109/ICCVW.2019.00186 – year: 2019 ident: 10.1016/j.cviu.2022.103406_b50 – ident: 10.1016/j.cviu.2022.103406_b39 doi: 10.1609/aaai.v34i07.6840 – ident: 10.1016/j.cviu.2022.103406_b41 doi: 10.1109/CVPR42600.2020.00990 – year: 2021 ident: 10.1016/j.cviu.2022.103406_b48 – year: 2020 ident: 10.1016/j.cviu.2022.103406_b31 – ident: 10.1016/j.cviu.2022.103406_b56 doi: 10.1109/CVPR.2018.00675 – ident: 10.1016/j.cviu.2022.103406_b11 – ident: 10.1016/j.cviu.2022.103406_b37 doi: 10.1007/978-3-030-01231-1_32 – year: 2021 ident: 10.1016/j.cviu.2022.103406_b45 – year: 2021 ident: 10.1016/j.cviu.2022.103406_b59 article-title: Self-supervised video representation learning by uncovering spatio-temporal statistics publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – ident: 10.1016/j.cviu.2022.103406_b32 doi: 10.1609/aaai.v33i01.33018545 – volume: 88 year: 2020 ident: 10.1016/j.cviu.2022.103406_b64 article-title: Self-supervised video representation learning by maximizing mutual information publication-title: Signal Process., Image Commun. doi: 10.1016/j.image.2020.115967 – ident: 10.1016/j.cviu.2022.103406_b51 doi: 10.1007/978-3-030-11012-3_45 – ident: 10.1016/j.cviu.2022.103406_b21 – start-page: 527 year: 2016 ident: 10.1016/j.cviu.2022.103406_b42 article-title: Shuffle and learn: unsupervised learning using temporal order verification – ident: 10.1016/j.cviu.2022.103406_b63 doi: 10.1109/CVPR.2019.01058 |
| SSID | ssj0011491 |
| Score | 2.6699183 |
| Snippet | Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos.... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 103406 |
| SubjectTerms | Action Recognition Self-Supervised Learning Video Representation |
| Title | TCLR: Temporal contrastive learning for video representation |
| URI | https://dx.doi.org/10.1016/j.cviu.2022.103406 |
| Volume | 219 |
| WOSCitedRecordID | wos000793292400002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1090-235X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011491 issn: 1077-3142 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZKxwEOAwYTYwz5wK1K5cQOjhGXaRpsiE1oDKm3KLEdraNLq7WpNv76Pf9Ks4EmQOISVZZdR35f3fc-v_cZobeCy5QmIokEIVXESghQBKlYFAsuEkKrouDSXjbBj4-z0Uh87fWqUAuznPC6zq6uxOy_mhrawNimdPYvzN1-KTTAZzA6PMHs8Pwzw-99OXGkuRWdmrhs9GJuc4QmgQkx2YWmBG86sLKWoQSp7jqr4caHgStAd-cMFybJp-mWxKzI7qVlRw_nZyvIfWpm3j-dno3bFJuT8U_X96i4gN4KNvlr7UlkqxdZWLbnqDFnIT-6zAQEtW0GlaPLQsnMrYxOCDcNMeoktYbatwkSJTQddbdlv5X-ssU7tuF8KJfjZmimNboBjNzR07b_0N_MZNSyXFZo590DtJbwVGR9tLZ7uD_63J43QZwYu-xU93K-vMplAt6d6fcuTMctOX2K1n08gXcdDp6hnq430BMfW2C_c8-hKRgztG2gxx0tyufog8HNexxQgzuowQE1GFCDLWrwbdS8QN8_7p_uHUT-ao1IUsYWkSoVlxkvMyYpUUqVPOVUCR5LTsuyKCtaceM-SvixGv1WnYlUV-DwZFIkmlV0E_Xraa1fIgyrpyWLJaGFYIKo0igaCkIzQ7yUSm2hOKxWLr3uvLn-ZJKHBMPz3KxwblY4dyu8hQbtmJlTXbm3dxqMkHu_0fmDOWDmnnGv_nHcNnq0gvtr1F9cNnoHPZTLxXh--cZD6waA_pEs |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TCLR%3A+Temporal+contrastive+learning+for+video+representation&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Dave%2C+Ishan&rft.au=Gupta%2C+Rohit&rft.au=Rizve%2C+Mamshad+Nayeem&rft.au=Shah%2C+Mubarak&rft.date=2022-06-01&rft.pub=Elsevier+Inc&rft.issn=1077-3142&rft.eissn=1090-235X&rft.volume=219&rft_id=info:doi/10.1016%2Fj.cviu.2022.103406&rft.externalDocID=S1077314222000376 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon |