HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meetin...
Saved in:
| Published in: | Information fusion Vol. 108; p. 102382 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier B.V
01.08.2024
|
| Subjects: | |
| ISSN: | 1566-2535, 1872-6305 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross-modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE.
•A novel self-supervised framework is proposed for audio-visual emotion recognition.•A three-pronged strategy is introduced to foster hierarchical feature learning.•The proposed method achieves state-of-the-art performance on 9 datasets. |
|---|---|
| AbstractList | Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross-modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE.
•A novel self-supervised framework is proposed for audio-visual emotion recognition.•A three-pronged strategy is introduced to foster hierarchical feature learning.•The proposed method achieves state-of-the-art performance on 9 datasets. |
| ArticleNumber | 102382 |
| Author | Liu, Bin Sun, Licai Lian, Zheng Tao, Jianhua |
| Author_xml | – sequence: 1 givenname: Licai orcidid: 0000-0002-7944-3458 surname: Sun fullname: Sun, Licai email: sunlicai2019@ia.ac.cn organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China – sequence: 2 givenname: Zheng surname: Lian fullname: Lian, Zheng email: lianzheng2016@ia.ac.cn organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China – sequence: 3 givenname: Bin surname: Liu fullname: Liu, Bin email: liubin@nlpr.ia.ac.cn organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China – sequence: 4 givenname: Jianhua surname: Tao fullname: Tao, Jianhua email: jhtao@tsinghua.edu.cn organization: Department of Automation, Tsinghua University, Beijing, China |
| BookMark | eNqFkE1LAzEQhoNUsK3-Aw_7B7bmo_thD0JZqhUqghSvISYTO3W7KcluwX9v1vXkQU8zzPAM8z4TMmpcA4RcMzpjlOU3-xk21nZhximfxxEXJT8jY1YWPM0FzUaxz_I85ZnILsgkhD2lrKCCjclujdXTcrVI1gheeb1Dreqkck3rVWjxBMmTCh9gkmXXOmi0M-AT63wSoLZp6I7gTxi-9wZd-oqhi_zq4Fp0TfIC2r032PeX5NyqOsDVT52S7f1qW63TzfPDY7XcpFrQvE2tyHPBjMoN02-WQ6YopaYolb7lJUBZqKww8XMeAxjKlQVlaVkYnXFmFRVTshjOau9C8GClxlb1D8RAWEtGZa9M7uWgTPbK5KAswvNf8NHjQfnP_7C7AYOY6xQ9yqAxugKDHnQrjcO_D3wBvf6L2Q |
| CitedBy_id | crossref_primary_10_1016_j_inffus_2025_103576 crossref_primary_10_1049_itr2_70009 crossref_primary_10_1016_j_neucom_2025_130020 crossref_primary_10_1016_j_inffus_2025_103640 crossref_primary_10_1007_s00530_024_01551_1 crossref_primary_10_1109_TIM_2025_3578178 crossref_primary_10_1007_s10462_025_11228_4 crossref_primary_10_1109_TAFFC_2025_3555406 crossref_primary_10_3390_electronics14050978 crossref_primary_10_1007_s00371_025_03818_8 crossref_primary_10_1016_j_engappai_2025_110007 |
| Cites_doi | 10.1080/026999300402745 10.1145/3423327.3423672 10.1109/CVPR.2018.00745 10.1109/TAFFC.2021.3101563 10.1109/TASLP.2021.3122291 10.1109/CVPR.2018.00685 10.1145/3475957.3484456 10.1109/TPAMI.2007.1110 10.1109/TCSVT.2017.2719043 10.1109/ICCV51070.2023.01479 10.1371/journal.pone.0196391 10.1007/978-3-030-01231-1_39 10.1109/CVPR46437.2021.01229 10.1109/TASLP.2021.3049898 10.1109/TPAMI.2008.52 10.18653/v1/D16-1044 10.1016/j.patrec.2022.07.012 10.1109/CVPR52729.2023.00211 10.1109/ICCV51070.2023.00494 10.1109/CVPR46437.2021.00084 10.1145/3581783.3612836 10.1017/ATSIP.2014.11 10.1109/TIP.2021.3093397 10.1109/ICCV.2017.73 10.1109/ICCV.2017.74 10.21437/Interspeech.2018-1929 10.1145/3503161.3547865 10.1109/TASLP.2020.3030497 10.1145/3581783.3612365 10.1109/CVPR.2018.00675 10.1145/3581783.3612286 10.21437/Interspeech.2013-56 10.1109/CVPR.2016.90 10.34133/icomputing.0076 10.1016/j.patcog.2023.109368 10.1109/TAFFC.2022.3216993 10.1109/TAFFC.2014.2336244 10.1609/aaai.v37i11.26541 10.1109/JSTSP.2022.3188113 10.1145/3503161.3548190 10.1145/3581783.3613459 10.1109/CVPR52688.2022.01553 10.1007/s10579-008-9076-6 10.1109/JSTSP.2017.2764438 10.1109/TAFFC.2016.2515617 10.1109/CVPRW56347.2022.00261 10.1145/2993148.2997632 10.1016/j.ins.2022.03.062 10.1109/TAFFC.2015.2457417 10.18653/v1/D17-1115 10.1109/CVPR52688.2022.02025 10.1109/TAFFC.2023.3258900 10.1109/CVPR52729.2023.01722 10.1109/ICCV.2015.510 10.1145/3133944.3133949 10.1145/3394171.3413620 10.1145/3474085.3475292 |
| ContentType | Journal Article |
| Copyright | 2024 Elsevier B.V. |
| Copyright_xml | – notice: 2024 Elsevier B.V. |
| DBID | AAYXX CITATION |
| DOI | 10.1016/j.inffus.2024.102382 |
| DatabaseName | CrossRef |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Mathematics |
| EISSN | 1872-6305 |
| ExternalDocumentID | 10_1016_j_inffus_2024_102382 S156625352400160X |
| GroupedDBID | --K --M .DC .~1 0R~ 1B1 1~. 1~5 29I 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABJNI ABMAC ABXDB ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HVGLF HZ~ IHE J1W JJJVA KOM M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG ROL RPZ SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K UHS ZMT ~G- 9DU AATTM AAXKI AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AGQPQ AIGII AIIUN AKBMS AKYEP ANKPU APXCP CITATION EFKBS EFLBG ~HD |
| ID | FETCH-LOGICAL-c306t-f36631da6d1cbf2e5a000d78ac928ee87a57d0312017d02afeaf087dc521fa03 |
| ISICitedReferencesCount | 19 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001220967900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1566-2535 |
| IngestDate | Tue Nov 18 22:36:59 EST 2025 Sat Nov 29 06:25:34 EST 2025 Sat May 11 15:33:31 EDT 2024 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Audio-Visual Emotion Recognition Masked autoencoder Self-supervised learning Contrastive learning |
| Language | English |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c306t-f36631da6d1cbf2e5a000d78ac928ee87a57d0312017d02afeaf087dc521fa03 |
| ORCID | 0000-0002-7944-3458 |
| ParticipantIDs | crossref_citationtrail_10_1016_j_inffus_2024_102382 crossref_primary_10_1016_j_inffus_2024_102382 elsevier_sciencedirect_doi_10_1016_j_inffus_2024_102382 |
| PublicationCentury | 2000 |
| PublicationDate | August 2024 2024-08-00 |
| PublicationDateYYYYMMDD | 2024-08-01 |
| PublicationDate_xml | – month: 08 year: 2024 text: August 2024 |
| PublicationDecade | 2020 |
| PublicationTitle | Information fusion |
| PublicationYear | 2024 |
| Publisher | Elsevier B.V |
| Publisher_xml | – name: Elsevier B.V |
| References | Tong, Song, Wang, Wang (b61) 2022 Feichtenhofer, Fan, Li, He (b62) 2022 Goncalves, Busso (b105) 2023 L. Sun, Z. Lian, J. Tao, B. Liu, M. Niu, Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism, in: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop, 2020, pp. 27–34. Mittal, Morgado, Jain, Gupta (b111) 2022; 35 Sun, Lian, Wang, He, Xu, Sun, Liu, Tao (b49) 2023 Zhang, Li, Lin, Xu, Xiao (b9) 2023 Y. Liu, S. Zhang, J. Chen, Z. Yu, K. Chen, D. Lin, Improving pixel-based mim by reducing wasted modeling capability, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5361–5372. Verbitskiy, Berikov, Vyshegorodtsev (b109) 2022; 161 Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, S. Shan, MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 24–32. Sun, Lian, Liu, Tao (b57) 2023 Tran, Soleymani (b55) 2022 Huang, Li, Tao, Lian, Yi (b45) 2018 Z. Zhao, Q. Liu, Former-dfer: Dynamic facial expression recognition transformer, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561. Zhang, Lv, Guo, Shao, Yang, Xie, Xu, Bu, Chen, Zeng (b91) 2022 H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 782–791. Tzirakis, Trigeorgis, Nicolaou, Schuller, Zafeiriou (b8) 2017; 11 Trigeorgis, Ringeval, Brueckner, Marchi, Nicolaou, Schuller, Zafeiriou (b44) 2016 Livingstone, Russo (b69) 2018; 13 Zhang, Yang, Chen, Zhang, Leng, Zhao (b10) 2023 Lian, Liu, Tao (b54) 2021; 29 A. Keesing, Y.S. Koh, V. Yogarajan, M. Witbrock, Emotion Recognition ToolKit (ERTK): Standardising Tools For Emotion Recognition Research, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9693–9696. B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al., The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism, in: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013. Goncalves, Busso (b53) 2022 A. Owens, A.A. Efros, Audio-visual scene analysis with self-supervised multisensory features, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 631–648. Z. Zhao, I. Patras, Prompting Visual-Language Models for Dynamic Facial Expression Recognition, in: British Machine Vision Conference, BMVC, 2023, pp. 1–14. W. Li, L. Zhu, R. Mao, E. Cambria, Skier: A symbolic knowledge integrated model for conversational emotion recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023. R. Arandjelovic, A. Zisserman, Look, listen and learn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617. Z. Lian, H. Sun, L. Sun, K. Chen, M. Xu, K. Wang, K. Xu, Y. He, Y. Li, J. Zhao, et al., Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9610–9614. Mao, Liu, He, Li, Cambria (b14) 2022 Rajan, Brutti, Cavallaro (b72) 2022 Tsai, Bai, Liang, Kolter, Morency, Salakhutdinov (b51) 2019; vol. 2019 Touvron, Cord, Douze, Massa, Sablayrolles, Jégou (b80) 2021 M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Soleymani, SAAML: A framework for semi-supervised affective adaptation via metric learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6004–6015. Cao, Shen, Xie, Parkhi, Zisserman (b42) 2018 Eyben, Scherer, Schuller, Sundberg, André, Busso, Devillers, Epps, Laukka, Narayanan (b32) 2015; 7 Balestriero, Ibrahim, Sobal, Morcos, Shekhar, Goldstein, Bordes, Bardes, Mialon, Tian (b12) 2023 Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov, Mohamed (b48) 2021; 29 Tseng, Berry, Chen, Chiu, Lin, Liu, Peng, Shih, Wang, Wu (b98) 2023 Oord, Li, Vinyals (b67) 2018 Zhao, Liu, Wang (b93) 2021; 30 Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, Stoyanov (b81) 2019 Ghaleb, Niehues, Asteriadis (b104) 2020 Su, Hu, Li, Cao (b95) 2020 Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in: Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450. A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 457–468. Devlin, Chang, Lee, Toutanova (b13) 2018 L. Sun, Z. Lian, B. Liu, J. Tao, MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121. Hsu, Wu (b56) 2023 Busso, Bulut, Lee, Kazemzadeh, Mower, Kim, Chang, Lee, Narayanan (b70) 2008; 42 K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. L. Sun, M. Xu, Z. Lian, B. Liu, J. Tao, M. Wang, Y. Cheng, Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model, in: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, 2021, pp. 15–20. Huang, Xu, Li, Baevski, Auli, Galuba, Metze, Feichtenhofer (b65) 2022; 35 Chumachenko, Iosifidis, Gabbouj (b110) 2022 Dalal, Triggs (b34) 2005; vol. 1 Schwarz (b1) 2000; 14 Pei, Li, Lu, Wang, Hua, Li (b11) 2024; 3 Hendrycks, Gimpel (b66) 2016 Zhang, Zhang, Huang, Gao, Tian (b7) 2017; 28 Yoon, Dey, Lee, Jung (b78) 2020 R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626. H.R.V. Joze, A. Shaban, M.L. Iuzzolino, K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289–13299. Sarkar, Posen, Etemad (b24) 2022 J.S. Chung, A. Nagrani, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, in: Proc. Interspeech 2018, 2018, pp. 1086–1090. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459. Lei, Cao (b102) 2023 K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555. L. Meng, Y. Liu, X. Liu, Z. Huang, W. Jiang, T. Zhang, C. Liu, Q. Jin, Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2345–2352. Kong, Cao, Iqbal, Wang, Wang, Plumbley (b40) 2020; 28 Huang, Tao, Liu, Lian, Niu (b52) 2020 Parkhi, Vedaldi, Zisserman (b113) 2015 Ghaleb, Popa, Asteriadis (b100) 2019 Cao, Cooper, Keutmann, Gur, Nenkova, Verma (b28) 2014; 5 H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, K. Han, Masked Image Modeling with Local Multi-Scale Reconstruction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2122–2131. Yoon, Byun, Jung (b71) 2018 H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, A. Zhou, Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17958–17968. Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (b16) 2021 K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. Zhao, Pietikainen (b33) 2007; 29 Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, W. Zhang, DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110. Foteinopoulou, Patras (b79) 2023 Chen, Wang, Chen, Wu, Liu, Chen, Li, Kanda, Yoshioka, Xiao (b73) 2022; 16 Ronneberger, Fischer, Brox (b26) 2015 M.-I. Georgescu, E. Fonseca, R.T. Ionescu, M. Lucic, C. Schmid, A. Arnab, Audiovisual masked autoencoders, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16144–16154. Goncalves, Busso (b101) 2022; 13 B. Shi, W.-N. Hsu, K. Lakhotia, A. Mohamed, Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction, in: International Conference on Learning Representations, 2022. Busso, Parthasarathy, Burmania, AbdelWahab, Sadoughi, Provost (b68) 2016; 8 Sadok, Leglaive, Séguier (b20) 2023 Liu, Wang, Feng, Zhang, Chen, Zhan (b85) 2023; 138 P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, C. Feichtenhofer, MAViL: Masked Audio-Video Learners, in: Thirty-Seventh Conference on Neural Information Processing Systems, 2023. A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114. S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, Y. Song, Parameter Efficient Multimodal Transformers for Video Representation Learning, in: International Conference on Learning Representations, 2021. Y. Wang, Y. Sun, Y. Sun (10.1016/j.inffus.2024.102382_b49) 2023 10.1016/j.inffus.2024.102382_b50 Tsai (10.1016/j.inffus.2024.102382_b51) 2019; vol. 2019 10.1016/j.inffus.2024.102382_b115 10.1016/j.inffus.2024.102382_b116 Sarkar (10.1016/j.inffus.2024.102382_b24) 2022 Ronneberger (10.1016/j.inffus.2024.102382_b26) 2015 10.1016/j.inffus.2024.102382_b112 Sadok (10.1016/j.inffus.2024.102382_b20) 2023 Van der Maaten (10.1016/j.inffus.2024.102382_b114) 2008; 9 Yoon (10.1016/j.inffus.2024.102382_b78) 2020 Tong (10.1016/j.inffus.2024.102382_b61) 2022 10.1016/j.inffus.2024.102382_b46 Trigeorgis (10.1016/j.inffus.2024.102382_b44) 2016 10.1016/j.inffus.2024.102382_b43 10.1016/j.inffus.2024.102382_b107 Tran (10.1016/j.inffus.2024.102382_b55) 2022 10.1016/j.inffus.2024.102382_b108 Hershey (10.1016/j.inffus.2024.102382_b41) 2017 10.1016/j.inffus.2024.102382_b103 Pei (10.1016/j.inffus.2024.102382_b11) 2024; 3 Radford (10.1016/j.inffus.2024.102382_b16) 2021 10.1016/j.inffus.2024.102382_b106 Huang (10.1016/j.inffus.2024.102382_b45) 2018 10.1016/j.inffus.2024.102382_b38 Hendrycks (10.1016/j.inffus.2024.102382_b66) 2016 10.1016/j.inffus.2024.102382_b37 10.1016/j.inffus.2024.102382_b39 10.1016/j.inffus.2024.102382_b36 10.1016/j.inffus.2024.102382_b35 10.1016/j.inffus.2024.102382_b74 Dalal (10.1016/j.inffus.2024.102382_b34) 2005; vol. 1 Zhang (10.1016/j.inffus.2024.102382_b91) 2022 Zhao (10.1016/j.inffus.2024.102382_b93) 2021; 30 10.1016/j.inffus.2024.102382_b76 Minsky (10.1016/j.inffus.2024.102382_b2) 1988 Lei (10.1016/j.inffus.2024.102382_b102) 2023 Cao (10.1016/j.inffus.2024.102382_b42) 2018 Verbitskiy (10.1016/j.inffus.2024.102382_b109) 2022; 161 Mao (10.1016/j.inffus.2024.102382_b14) 2022 10.1016/j.inffus.2024.102382_b63 Mehrabian (10.1016/j.inffus.2024.102382_b3) 1968; 2 Vaswani (10.1016/j.inffus.2024.102382_b64) 2017; vol. 30 Ghaleb (10.1016/j.inffus.2024.102382_b104) 2020 Parkhi (10.1016/j.inffus.2024.102382_b113) 2015 10.1016/j.inffus.2024.102382_b60 Eyben (10.1016/j.inffus.2024.102382_b32) 2015; 7 Su (10.1016/j.inffus.2024.102382_b95) 2020 Zhang (10.1016/j.inffus.2024.102382_b7) 2017; 28 Goncalves (10.1016/j.inffus.2024.102382_b101) 2022; 13 Chen (10.1016/j.inffus.2024.102382_b97) 2023 10.1016/j.inffus.2024.102382_b59 Lian (10.1016/j.inffus.2024.102382_b30) 2024 10.1016/j.inffus.2024.102382_b58 Liu (10.1016/j.inffus.2024.102382_b85) 2023; 138 Chumachenko (10.1016/j.inffus.2024.102382_b110) 2022 Busso (10.1016/j.inffus.2024.102382_b70) 2008; 42 10.1016/j.inffus.2024.102382_b92 Feichtenhofer (10.1016/j.inffus.2024.102382_b62) 2022 10.1016/j.inffus.2024.102382_b94 Yoon (10.1016/j.inffus.2024.102382_b71) 2018 10.1016/j.inffus.2024.102382_b90 Dosovitskiy (10.1016/j.inffus.2024.102382_b75) 2020 Liu (10.1016/j.inffus.2024.102382_b81) 2019 Touvron (10.1016/j.inffus.2024.102382_b80) 2021 Balestriero (10.1016/j.inffus.2024.102382_b12) 2023 Oord (10.1016/j.inffus.2024.102382_b67) 2018 Chen (10.1016/j.inffus.2024.102382_b73) 2022; 16 Zhang (10.1016/j.inffus.2024.102382_b10) 2023 Goncalves (10.1016/j.inffus.2024.102382_b53) 2022 Zhao (10.1016/j.inffus.2024.102382_b33) 2007; 29 Goncalves (10.1016/j.inffus.2024.102382_b105) 2023 10.1016/j.inffus.2024.102382_b88 Wu (10.1016/j.inffus.2024.102382_b4) 2014; 3 Liu (10.1016/j.inffus.2024.102382_b84) 2022; 598 Ghaleb (10.1016/j.inffus.2024.102382_b100) 2019 Livingstone (10.1016/j.inffus.2024.102382_b69) 2018; 13 Cao (10.1016/j.inffus.2024.102382_b28) 2014; 5 10.1016/j.inffus.2024.102382_b83 10.1016/j.inffus.2024.102382_b82 Zeng (10.1016/j.inffus.2024.102382_b6) 2008; 31 Hsu (10.1016/j.inffus.2024.102382_b48) 2021; 29 Lian (10.1016/j.inffus.2024.102382_b54) 2021; 29 10.1016/j.inffus.2024.102382_b5 Baevski (10.1016/j.inffus.2024.102382_b47) 2020; vol. 33 Zhang (10.1016/j.inffus.2024.102382_b9) 2023 Devlin (10.1016/j.inffus.2024.102382_b13) 2018 Schwarz (10.1016/j.inffus.2024.102382_b1) 2000; 14 Ma (10.1016/j.inffus.2024.102382_b86) 2022 10.1016/j.inffus.2024.102382_b77 Li (10.1016/j.inffus.2024.102382_b87) 2022 Foteinopoulou (10.1016/j.inffus.2024.102382_b79) 2023 10.1016/j.inffus.2024.102382_b31 Sun (10.1016/j.inffus.2024.102382_b57) 2023 Huang (10.1016/j.inffus.2024.102382_b65) 2022; 35 Huang (10.1016/j.inffus.2024.102382_b52) 2020 Rajan (10.1016/j.inffus.2024.102382_b72) 2022 Zhang (10.1016/j.inffus.2024.102382_b23) 2023; 14 10.1016/j.inffus.2024.102382_b27 10.1016/j.inffus.2024.102382_b29 Li (10.1016/j.inffus.2024.102382_b89) 2023; vol. 37 10.1016/j.inffus.2024.102382_b22 10.1016/j.inffus.2024.102382_b25 10.1016/j.inffus.2024.102382_b21 Tseng (10.1016/j.inffus.2024.102382_b98) 2023 Hsu (10.1016/j.inffus.2024.102382_b56) 2023 Busso (10.1016/j.inffus.2024.102382_b68) 2016; 8 10.1016/j.inffus.2024.102382_b19 Mittal (10.1016/j.inffus.2024.102382_b111) 2022; 35 Tzirakis (10.1016/j.inffus.2024.102382_b8) 2017; 11 10.1016/j.inffus.2024.102382_b15 10.1016/j.inffus.2024.102382_b18 Kong (10.1016/j.inffus.2024.102382_b40) 2020; 28 10.1016/j.inffus.2024.102382_b17 10.1016/j.inffus.2024.102382_b99 Fu (10.1016/j.inffus.2024.102382_b96) 2021 |
| References_xml | – year: 2023 ident: b20 article-title: A vector quantized masked autoencoder for audiovisual speech emotion recognition – start-page: 4698 year: 2022 end-page: 4702 ident: b55 article-title: A pre-trained audio-visual transformer for emotion recognition publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: A. Owens, A.A. Efros, Audio-visual scene analysis with self-supervised multisensory features, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 631–648. – volume: 29 start-page: 3451 year: 2021 end-page: 3460 ident: b48 article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. – reference: H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, A. Zhou, Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17958–17968. – start-page: 112 year: 2018 end-page: 118 ident: b71 article-title: Multimodal speech emotion recognition using audio and text publication-title: 2018 IEEE Spoken Language Technology Workshop – reference: Y. Wang, Y. Sun, Y. Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, W. Zhang, FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20922–20931. – start-page: 552 year: 2019 end-page: 558 ident: b100 article-title: Multimodal and temporal perception of audio-visual cues for emotion recognition publication-title: 2019 8th International Conference on Affective Computing and Intelligent Interaction – year: 2022 ident: b14 article-title: The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection publication-title: IEEE Trans. Affect. Comput. – reference: Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, W. Zhang, DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110. – reference: P. Morgado, N. Vasconcelos, I. Misra, Audio-visual instance discrimination with cross-modal agreement, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12475–12486. – year: 2023 ident: b79 article-title: EmoCLIP: A vision-language method for zero-shot video facial expression recognition – year: 2023 ident: b10 article-title: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects publication-title: Expert Syst. Appl. – reference: Z. Zhao, I. Patras, Prompting Visual-Language Models for Dynamic Facial Expression Recognition, in: British Machine Vision Conference, BMVC, 2023, pp. 1–14. – start-page: 3507 year: 2020 end-page: 3511 ident: b52 article-title: Multimodal transformer fusion for continuous emotion recognition publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing – volume: 16 start-page: 1505 year: 2022 end-page: 1518 ident: b73 article-title: Wavlm: Large-scale self-supervised pre-training for full stack speech processing publication-title: IEEE J. Sel. Top. Sign. Proces. – year: 1988 ident: b2 article-title: Society of Mind – start-page: 1 year: 2023 end-page: 5 ident: b105 article-title: Learning cross-modal audiovisual representations with ladder networks for emotion recognition publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing – volume: 2 start-page: 53 year: 1968 end-page: 56 ident: b3 article-title: Communication without words publication-title: Psychol. Today – year: 2022 ident: b24 article-title: AVCAffe: A large scale audio-visual dataset of cognitive load and affect for remote work – reference: H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, K. Han, Masked Image Modeling with Local Multi-Scale Reconstruction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2122–2131. – reference: A. Keesing, Y.S. Koh, V. Yogarajan, M. Witbrock, Emotion Recognition ToolKit (ERTK): Standardising Tools For Emotion Recognition Research, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9693–9696. – volume: 8 start-page: 67 year: 2016 end-page: 80 ident: b68 article-title: MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception publication-title: IEEE Trans. Affect. Comput. – volume: 28 start-page: 2880 year: 2020 end-page: 2894 ident: b40 article-title: Panns: Large-scale pretrained audio neural networks for audio pattern recognition publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. – reference: D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459. – reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. – volume: 3 start-page: 0076 year: 2024 ident: b11 article-title: Affective computing: Recent advances, challenges, and future trends publication-title: Intell. Comput. – reference: W. Li, L. Zhu, R. Mao, E. Cambria, Skier: A symbolic knowledge integrated model for conversational emotion recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023. – reference: B. Shi, W.-N. Hsu, K. Lakhotia, A. Mohamed, Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction, in: International Conference on Learning Representations, 2022. – volume: 31 start-page: 39 year: 2008 end-page: 58 ident: b6 article-title: A survey of affect recognition methods: Audio, visual, and spontaneous expressions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – volume: 29 start-page: 915 year: 2007 end-page: 928 ident: b33 article-title: Dynamic texture recognition using local binary patterns with an application to facial expressions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – volume: 35 start-page: 28708 year: 2022 end-page: 28720 ident: b65 article-title: Masked autoencoders that listen publication-title: Adv. Neural Inf. Process. Syst. – year: 2015 ident: b113 article-title: Deep face recognition publication-title: BMVC 2015-Proceedings of the British Machine Vision Conference 2015 – year: 2024 ident: b30 article-title: MERBench: A unified evaluation benchmark for multimodal emotion recognition – year: 2023 ident: b98 article-title: AV-SUPERB: A multi-task evaluation benchmark for audio-visual representation models – year: 2023 ident: b9 article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild publication-title: IEEE Trans. Circuits Syst. Video Technol. – year: 2022 ident: b86 article-title: Spatio-temporal transformer for dynamic facial expression recognition in the wild – volume: 28 start-page: 3030 year: 2017 end-page: 3043 ident: b7 article-title: Learning affective features with a hybrid deep model for audio-visual emotion recognition publication-title: IEEE Trans. Circuits Syst. Video Technol. – reference: R. Arandjelovic, A. Zisserman, Look, listen and learn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617. – volume: 14 start-page: 1201 year: 2023 end-page: 1214 ident: b23 article-title: Werewolf-XL: A database for identifying spontaneous affect in large competitive group interactions publication-title: IEEE Trans. Affect. Comput. – volume: 29 start-page: 985 year: 2021 end-page: 1000 ident: b54 article-title: Ctnet: Conversational transformer network for emotion recognition publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. – reference: J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141. – reference: Y. Liu, S. Zhang, J. Chen, Z. Yu, K. Chen, D. Lin, Improving pixel-based mim by reducing wasted modeling capability, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5361–5372. – reference: L. Sun, M. Xu, Z. Lian, B. Liu, J. Tao, M. Wang, Y. Cheng, Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model, in: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, 2021, pp. 15–20. – volume: vol. 1 start-page: 886 year: 2005 end-page: 893 ident: b34 article-title: Histograms of oriented gradients for human detection publication-title: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition – start-page: 7357 year: 2022 end-page: 7361 ident: b53 article-title: AuxFormer: Robust approach to audiovisual emotion recognition publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing – volume: vol. 2019 start-page: 6558 year: 2019 ident: b51 article-title: Multimodal transformer for unaligned multimodal language sequences publication-title: Proceedings of the Conference. Association for Computational Linguistics. Meeting – start-page: 1 year: 2023 end-page: 5 ident: b97 article-title: Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: H.R.V. Joze, A. Shaban, M.L. Iuzzolino, K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289–13299. – start-page: 6837 year: 2018 end-page: 6841 ident: b45 article-title: End-to-end continuous emotion recognition from video using 3D ConvLSTM networks publication-title: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing – volume: 5 start-page: 377 year: 2014 end-page: 390 ident: b28 article-title: Crema-d: Crowd-sourced emotional multimodal actors dataset publication-title: IEEE Trans. Affect. Comput. – start-page: 8748 year: 2021 end-page: 8763 ident: b16 article-title: Learning transferable visual models from natural language supervision publication-title: International Conference on Machine Learning – volume: 3 year: 2014 ident: b4 article-title: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies publication-title: APSIPA Trans. Signal Inf. Process. – volume: 7 start-page: 190 year: 2015 end-page: 202 ident: b32 article-title: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing publication-title: IEEE Trans. Affect. Comput. – reference: H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 782–791. – start-page: 1 year: 2023 end-page: 16 ident: b102 article-title: Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels publication-title: IEEE Trans. Affect. Comput. – start-page: 2822 year: 2022 end-page: 2828 ident: b110 article-title: Self-attention fusion for audiovisual emotion recognition with incomplete data publication-title: 2022 26th International Conference on Pattern Recognition – reference: Y. Gong, A. Rouditchenko, A.H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, J.R. Glass, Contrastive Audio-Visual Masked Autoencoder, in: The Eleventh International Conference on Learning Representations, 2023. – start-page: 251 year: 2020 end-page: 255 ident: b104 article-title: Multimodal attention-mechanism for temporal emotion recognition publication-title: 2020 IEEE International Conference on Image Processing – reference: R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626. – start-page: 131 year: 2017 end-page: 135 ident: b41 article-title: CNN architectures for large-scale audio classification publication-title: 2017 Ieee International Conference on Acoustics, Speech and Signal Processing – year: 2019 ident: b81 article-title: Roberta: A robustly optimized bert pretraining approach – reference: Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in: Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450. – volume: 598 start-page: 182 year: 2022 end-page: 195 ident: b84 article-title: Clip-aware expressive feature learning for video-based facial expression recognition publication-title: Inform. Sci. – year: 2023 ident: b12 article-title: A cookbook of self-supervised learning – reference: B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al., The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism, in: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013. – start-page: 4693 year: 2022 end-page: 4697 ident: b72 article-title: Is cross-attention preferable to self-attention for multi-modal emotion recognition? publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing – year: 2021 ident: b96 article-title: A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition – year: 2020 ident: b95 article-title: Msaf: Multimodal split attention fusion – reference: A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 457–468. – reference: S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, Y. Song, Parameter Efficient Multimodal Transformers for Video Representation Learning, in: International Conference on Learning Representations, 2021. – reference: M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Soleymani, SAAML: A framework for semi-supervised affective adaptation via metric learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6004–6015. – reference: D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497. – year: 2018 ident: b13 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding – volume: 42 start-page: 335 year: 2008 end-page: 359 ident: b70 article-title: IEMOCAP: Interactive emotional dyadic motion capture database publication-title: Lang. Resour. Eval. – start-page: 67 year: 2018 end-page: 74 ident: b42 article-title: Vggface2: A dataset for recognising faces across pose and age publication-title: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition – year: 2023 ident: b56 article-title: Applying segment-level attention on bi-modal transformer encoder for audio-visual emotion recognition publication-title: IEEE Trans. Affect. Comput. – start-page: 5200 year: 2016 end-page: 5204 ident: b44 article-title: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network publication-title: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing – year: 2022 ident: b62 article-title: Masked autoencoders as spatiotemporal learners – volume: 30 start-page: 6544 year: 2021 end-page: 6556 ident: b93 article-title: Learning deep global multi-scale and local attention features for facial expression recognition in the wild publication-title: IEEE Trans. Image Process. – volume: 9 year: 2008 ident: b114 article-title: Visualizing data using t-sne. publication-title: J. Mach. Learn. Res. – volume: 11 start-page: 1301 year: 2017 end-page: 1309 ident: b8 article-title: End-to-end multimodal emotion recognition using deep neural networks publication-title: IEEE J. Sel. Top. Signal Process. – year: 2018 ident: b67 article-title: Representation learning with contrastive predictive coding – volume: 138 year: 2023 ident: b85 article-title: Expression snippet transformer for robust video-based facial expression recognition publication-title: Pattern Recognit. – year: 2016 ident: b66 article-title: Gaussian error linear units (gelus) – reference: S. Chen, Q. Jin, J. Zhao, S. Wang, Multimodal multi-task learning for dimensional and continuous emotion recognition, in: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, 2017, pp. 19–26. – year: 2022 ident: b87 article-title: NR-dfernet: Noise-robust network for dynamic facial expression recognition – year: 2020 ident: b75 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – reference: M.-I. Georgescu, E. Fonseca, R.T. Ionescu, M. Lucic, C. Schmid, A. Arnab, Audiovisual masked autoencoders, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16144–16154. – reference: P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, C. Feichtenhofer, MAViL: Masked Audio-Video Learners, in: Thirty-Seventh Conference on Neural Information Processing Systems, 2023. – reference: A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114. – start-page: 6182 year: 2022 end-page: 6186 ident: b91 article-title: Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: J.S. Chung, A. Nagrani, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, in: Proc. Interspeech 2018, 2018, pp. 1086–1090. – reference: Z. Zhao, Q. Liu, Former-dfer: Dynamic facial expression recognition transformer, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561. – reference: Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, S. Shan, MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 24–32. – reference: Z. Lian, H. Sun, L. Sun, K. Chen, M. Xu, K. Wang, K. Xu, Y. He, Y. Li, J. Zhao, et al., Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9610–9614. – reference: K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009. – volume: vol. 33 start-page: 12449 year: 2020 end-page: 12460 ident: b47 article-title: Wav2vec 2.0: A framework for self-supervised learning of speech representations publication-title: Advances in Neural Information Processing Systems – volume: 13 year: 2018 ident: b69 article-title: The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American english publication-title: PLoS One – start-page: 234 year: 2015 end-page: 241 ident: b26 article-title: U-net: Convolutional networks for biomedical image segmentation publication-title: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 – reference: L. Sun, Z. Lian, J. Tao, B. Liu, M. Niu, Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism, in: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop, 2020, pp. 27–34. – reference: K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555. – reference: L. Meng, Y. Liu, X. Liu, Z. Huang, W. Jiang, T. Zhang, C. Liu, Q. Jin, Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2345–2352. – start-page: 3362 year: 2020 end-page: 3366 ident: b78 article-title: Attentive modality hopping mechanism for speech emotion recognition publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing – volume: vol. 37 start-page: 67 year: 2023 end-page: 75 ident: b89 article-title: Intensity-aware loss for dynamic facial expression recognition in the wild publication-title: Proceedings of the AAAI Conference on Artificial Intelligence – start-page: 10347 year: 2021 end-page: 10357 ident: b80 article-title: Training data-efficient image transformers & distillation through attention publication-title: International Conference on Machine Learning – year: 2022 ident: b61 article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training publication-title: Advances in Neural Information Processing Systems – reference: X. Jiang, Y. Zong, W. Zheng, C. Tang, W. Xia, C. Lu, J. Liu, Dfew: A large-scale database for recognizing dynamic facial expressions in the wild, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2881–2889. – volume: 35 start-page: 23765 year: 2022 end-page: 23779 ident: b111 article-title: Learning state-aware visual representations from audible interactions publication-title: Adv. Neural Inf. Process. Syst. – volume: vol. 30 year: 2017 ident: b64 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems – reference: L. Sun, Z. Lian, B. Liu, J. Tao, MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121. – volume: 13 start-page: 2156 year: 2022 end-page: 2170 ident: b101 article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features publication-title: IEEE Trans. Affect. Comput. – volume: 14 start-page: 433 year: 2000 end-page: 440 ident: b1 article-title: Emotion, cognition, and decision making publication-title: Cogn. Emot. – year: 2023 ident: b57 article-title: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis publication-title: IEEE Trans. Affect. Comput. – volume: 161 start-page: 38 year: 2022 end-page: 44 ident: b109 article-title: Eranns: Efficient residual audio neural networks for audio pattern recognition publication-title: Pattern Recognit. Lett. – year: 2023 ident: b49 article-title: SVFAP: Self-supervised video facial affect perceiver – volume: 14 start-page: 433 issue: 4 year: 2000 ident: 10.1016/j.inffus.2024.102382_b1 article-title: Emotion, cognition, and decision making publication-title: Cogn. Emot. doi: 10.1080/026999300402745 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b10 article-title: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects publication-title: Expert Syst. Appl. – year: 2023 ident: 10.1016/j.inffus.2024.102382_b79 – volume: vol. 37 start-page: 67 year: 2023 ident: 10.1016/j.inffus.2024.102382_b89 article-title: Intensity-aware loss for dynamic facial expression recognition in the wild – ident: 10.1016/j.inffus.2024.102382_b37 doi: 10.1145/3423327.3423672 – ident: 10.1016/j.inffus.2024.102382_b92 doi: 10.1109/CVPR.2018.00745 – volume: 14 start-page: 1201 issue: 02 year: 2023 ident: 10.1016/j.inffus.2024.102382_b23 article-title: Werewolf-XL: A database for identifying spontaneous affect in large competitive group interactions publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2021.3101563 – ident: 10.1016/j.inffus.2024.102382_b18 – volume: 29 start-page: 3451 year: 2021 ident: 10.1016/j.inffus.2024.102382_b48 article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. doi: 10.1109/TASLP.2021.3122291 – ident: 10.1016/j.inffus.2024.102382_b83 doi: 10.1109/CVPR.2018.00685 – ident: 10.1016/j.inffus.2024.102382_b38 doi: 10.1145/3475957.3484456 – start-page: 6837 year: 2018 ident: 10.1016/j.inffus.2024.102382_b45 article-title: End-to-end continuous emotion recognition from video using 3D ConvLSTM networks – start-page: 552 year: 2019 ident: 10.1016/j.inffus.2024.102382_b100 article-title: Multimodal and temporal perception of audio-visual cues for emotion recognition – volume: 29 start-page: 915 issue: 6 year: 2007 ident: 10.1016/j.inffus.2024.102382_b33 article-title: Dynamic texture recognition using local binary patterns with an application to facial expressions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2007.1110 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b20 – start-page: 3507 year: 2020 ident: 10.1016/j.inffus.2024.102382_b52 article-title: Multimodal transformer fusion for continuous emotion recognition – volume: vol. 33 start-page: 12449 year: 2020 ident: 10.1016/j.inffus.2024.102382_b47 article-title: Wav2vec 2.0: A framework for self-supervised learning of speech representations – volume: 28 start-page: 3030 issue: 10 year: 2017 ident: 10.1016/j.inffus.2024.102382_b7 article-title: Learning affective features with a hybrid deep model for audio-visual emotion recognition publication-title: IEEE Trans. Circuits Syst. Video Technol. doi: 10.1109/TCSVT.2017.2719043 – year: 2016 ident: 10.1016/j.inffus.2024.102382_b66 – start-page: 1 year: 2023 ident: 10.1016/j.inffus.2024.102382_b102 article-title: Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels publication-title: IEEE Trans. Affect. Comput. – year: 2024 ident: 10.1016/j.inffus.2024.102382_b30 – ident: 10.1016/j.inffus.2024.102382_b63 doi: 10.1109/ICCV51070.2023.01479 – volume: 13 issue: 5 year: 2018 ident: 10.1016/j.inffus.2024.102382_b69 article-title: The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American english publication-title: PLoS One doi: 10.1371/journal.pone.0196391 – ident: 10.1016/j.inffus.2024.102382_b59 doi: 10.1007/978-3-030-01231-1_39 – start-page: 8748 year: 2021 ident: 10.1016/j.inffus.2024.102382_b16 article-title: Learning transferable visual models from natural language supervision – volume: 2 start-page: 53 issue: 4 year: 1968 ident: 10.1016/j.inffus.2024.102382_b3 article-title: Communication without words publication-title: Psychol. Today – year: 2023 ident: 10.1016/j.inffus.2024.102382_b57 article-title: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis publication-title: IEEE Trans. Affect. Comput. – year: 2023 ident: 10.1016/j.inffus.2024.102382_b49 – ident: 10.1016/j.inffus.2024.102382_b60 doi: 10.1109/CVPR46437.2021.01229 – year: 2018 ident: 10.1016/j.inffus.2024.102382_b67 – volume: 29 start-page: 985 year: 2021 ident: 10.1016/j.inffus.2024.102382_b54 article-title: Ctnet: Conversational transformer network for emotion recognition publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. doi: 10.1109/TASLP.2021.3049898 – start-page: 7357 year: 2022 ident: 10.1016/j.inffus.2024.102382_b53 article-title: AuxFormer: Robust approach to audiovisual emotion recognition – volume: 31 start-page: 39 issue: 1 year: 2008 ident: 10.1016/j.inffus.2024.102382_b6 article-title: A survey of affect recognition methods: Audio, visual, and spontaneous expressions publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2008.52 – ident: 10.1016/j.inffus.2024.102382_b107 doi: 10.18653/v1/D16-1044 – volume: 161 start-page: 38 year: 2022 ident: 10.1016/j.inffus.2024.102382_b109 article-title: Eranns: Efficient residual audio neural networks for audio pattern recognition publication-title: Pattern Recognit. Lett. doi: 10.1016/j.patrec.2022.07.012 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b61 article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training – ident: 10.1016/j.inffus.2024.102382_b21 doi: 10.1109/CVPR52729.2023.00211 – ident: 10.1016/j.inffus.2024.102382_b22 doi: 10.1109/ICCV51070.2023.00494 – ident: 10.1016/j.inffus.2024.102382_b115 doi: 10.1109/CVPR46437.2021.00084 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b12 – start-page: 4698 year: 2022 ident: 10.1016/j.inffus.2024.102382_b55 article-title: A pre-trained audio-visual transformer for emotion recognition – volume: 35 start-page: 23765 year: 2022 ident: 10.1016/j.inffus.2024.102382_b111 article-title: Learning state-aware visual representations from audible interactions publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.inffus.2024.102382_b25 doi: 10.1145/3581783.3612836 – volume: 3 year: 2014 ident: 10.1016/j.inffus.2024.102382_b4 article-title: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies publication-title: APSIPA Trans. Signal Inf. Process. doi: 10.1017/ATSIP.2014.11 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b62 – volume: 30 start-page: 6544 year: 2021 ident: 10.1016/j.inffus.2024.102382_b93 article-title: Learning deep global multi-scale and local attention features for facial expression recognition in the wild publication-title: IEEE Trans. Image Process. doi: 10.1109/TIP.2021.3093397 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b9 article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild publication-title: IEEE Trans. Circuits Syst. Video Technol. – ident: 10.1016/j.inffus.2024.102382_b58 doi: 10.1109/ICCV.2017.73 – year: 2018 ident: 10.1016/j.inffus.2024.102382_b13 – ident: 10.1016/j.inffus.2024.102382_b116 doi: 10.1109/ICCV.2017.74 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b24 – ident: 10.1016/j.inffus.2024.102382_b27 doi: 10.21437/Interspeech.2018-1929 – ident: 10.1016/j.inffus.2024.102382_b88 doi: 10.1145/3503161.3547865 – volume: vol. 1 start-page: 886 year: 2005 ident: 10.1016/j.inffus.2024.102382_b34 article-title: Histograms of oriented gradients for human detection – ident: 10.1016/j.inffus.2024.102382_b108 – ident: 10.1016/j.inffus.2024.102382_b19 – volume: 28 start-page: 2880 year: 2020 ident: 10.1016/j.inffus.2024.102382_b40 article-title: Panns: Large-scale pretrained audio neural networks for audio pattern recognition publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process. doi: 10.1109/TASLP.2020.3030497 – volume: 9 issue: 11 year: 2008 ident: 10.1016/j.inffus.2024.102382_b114 article-title: Visualizing data using t-sne. publication-title: J. Mach. Learn. Res. – ident: 10.1016/j.inffus.2024.102382_b50 doi: 10.1145/3581783.3612365 – ident: 10.1016/j.inffus.2024.102382_b82 doi: 10.1109/CVPR.2018.00675 – ident: 10.1016/j.inffus.2024.102382_b94 doi: 10.1145/3581783.3612286 – ident: 10.1016/j.inffus.2024.102382_b31 doi: 10.21437/Interspeech.2013-56 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b98 – ident: 10.1016/j.inffus.2024.102382_b74 doi: 10.1109/CVPR.2016.90 – year: 2019 ident: 10.1016/j.inffus.2024.102382_b81 – volume: vol. 30 year: 2017 ident: 10.1016/j.inffus.2024.102382_b64 article-title: Attention is all you need – volume: 3 start-page: 0076 year: 2024 ident: 10.1016/j.inffus.2024.102382_b11 article-title: Affective computing: Recent advances, challenges, and future trends publication-title: Intell. Comput. doi: 10.34133/icomputing.0076 – start-page: 251 year: 2020 ident: 10.1016/j.inffus.2024.102382_b104 article-title: Multimodal attention-mechanism for temporal emotion recognition – start-page: 10347 year: 2021 ident: 10.1016/j.inffus.2024.102382_b80 article-title: Training data-efficient image transformers & distillation through attention – volume: 138 year: 2023 ident: 10.1016/j.inffus.2024.102382_b85 article-title: Expression snippet transformer for robust video-based facial expression recognition publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2023.109368 – year: 2021 ident: 10.1016/j.inffus.2024.102382_b96 – start-page: 1 year: 2023 ident: 10.1016/j.inffus.2024.102382_b105 article-title: Learning cross-modal audiovisual representations with ladder networks for emotion recognition – volume: 13 start-page: 2156 issue: 04 year: 2022 ident: 10.1016/j.inffus.2024.102382_b101 article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2022.3216993 – start-page: 131 year: 2017 ident: 10.1016/j.inffus.2024.102382_b41 article-title: CNN architectures for large-scale audio classification – volume: 5 start-page: 377 issue: 4 year: 2014 ident: 10.1016/j.inffus.2024.102382_b28 article-title: Crema-d: Crowd-sourced emotional multimodal actors dataset publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2014.2336244 – start-page: 112 year: 2018 ident: 10.1016/j.inffus.2024.102382_b71 article-title: Multimodal speech emotion recognition using audio and text – start-page: 234 year: 2015 ident: 10.1016/j.inffus.2024.102382_b26 article-title: U-net: Convolutional networks for biomedical image segmentation – ident: 10.1016/j.inffus.2024.102382_b15 doi: 10.1609/aaai.v37i11.26541 – volume: 16 start-page: 1505 issue: 6 year: 2022 ident: 10.1016/j.inffus.2024.102382_b73 article-title: Wavlm: Large-scale self-supervised pre-training for full stack speech processing publication-title: IEEE J. Sel. Top. Sign. Proces. doi: 10.1109/JSTSP.2022.3188113 – ident: 10.1016/j.inffus.2024.102382_b5 doi: 10.1145/3503161.3548190 – year: 2020 ident: 10.1016/j.inffus.2024.102382_b95 – year: 2015 ident: 10.1016/j.inffus.2024.102382_b113 article-title: Deep face recognition – start-page: 3362 year: 2020 ident: 10.1016/j.inffus.2024.102382_b78 article-title: Attentive modality hopping mechanism for speech emotion recognition – start-page: 6182 year: 2022 ident: 10.1016/j.inffus.2024.102382_b91 article-title: Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition – start-page: 1 year: 2023 ident: 10.1016/j.inffus.2024.102382_b97 article-title: Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition – ident: 10.1016/j.inffus.2024.102382_b99 doi: 10.1145/3581783.3613459 – ident: 10.1016/j.inffus.2024.102382_b106 – ident: 10.1016/j.inffus.2024.102382_b17 doi: 10.1109/CVPR52688.2022.01553 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b14 article-title: The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection publication-title: IEEE Trans. Affect. Comput. – year: 2020 ident: 10.1016/j.inffus.2024.102382_b75 – year: 1988 ident: 10.1016/j.inffus.2024.102382_b2 – volume: 42 start-page: 335 year: 2008 ident: 10.1016/j.inffus.2024.102382_b70 article-title: IEMOCAP: Interactive emotional dyadic motion capture database publication-title: Lang. Resour. Eval. doi: 10.1007/s10579-008-9076-6 – volume: vol. 2019 start-page: 6558 year: 2019 ident: 10.1016/j.inffus.2024.102382_b51 article-title: Multimodal transformer for unaligned multimodal language sequences – volume: 11 start-page: 1301 issue: 8 year: 2017 ident: 10.1016/j.inffus.2024.102382_b8 article-title: End-to-end multimodal emotion recognition using deep neural networks publication-title: IEEE J. Sel. Top. Signal Process. doi: 10.1109/JSTSP.2017.2764438 – volume: 8 start-page: 67 issue: 1 year: 2016 ident: 10.1016/j.inffus.2024.102382_b68 article-title: MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2016.2515617 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b86 – ident: 10.1016/j.inffus.2024.102382_b112 – start-page: 2822 year: 2022 ident: 10.1016/j.inffus.2024.102382_b110 article-title: Self-attention fusion for audiovisual emotion recognition with incomplete data – ident: 10.1016/j.inffus.2024.102382_b39 doi: 10.1109/CVPRW56347.2022.00261 – volume: 35 start-page: 28708 year: 2022 ident: 10.1016/j.inffus.2024.102382_b65 article-title: Masked autoencoders that listen publication-title: Adv. Neural Inf. Process. Syst. – start-page: 4693 year: 2022 ident: 10.1016/j.inffus.2024.102382_b72 article-title: Is cross-attention preferable to self-attention for multi-modal emotion recognition? – ident: 10.1016/j.inffus.2024.102382_b35 doi: 10.1145/2993148.2997632 – year: 2022 ident: 10.1016/j.inffus.2024.102382_b87 – volume: 598 start-page: 182 year: 2022 ident: 10.1016/j.inffus.2024.102382_b84 article-title: Clip-aware expressive feature learning for video-based facial expression recognition publication-title: Inform. Sci. doi: 10.1016/j.ins.2022.03.062 – ident: 10.1016/j.inffus.2024.102382_b77 – volume: 7 start-page: 190 issue: 2 year: 2015 ident: 10.1016/j.inffus.2024.102382_b32 article-title: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2015.2457417 – ident: 10.1016/j.inffus.2024.102382_b103 doi: 10.18653/v1/D17-1115 – ident: 10.1016/j.inffus.2024.102382_b46 doi: 10.1109/CVPR52688.2022.02025 – year: 2023 ident: 10.1016/j.inffus.2024.102382_b56 article-title: Applying segment-level attention on bi-modal transformer encoder for audio-visual emotion recognition publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2023.3258900 – ident: 10.1016/j.inffus.2024.102382_b90 doi: 10.1109/CVPR52729.2023.01722 – ident: 10.1016/j.inffus.2024.102382_b43 doi: 10.1109/ICCV.2015.510 – ident: 10.1016/j.inffus.2024.102382_b36 doi: 10.1145/3133944.3133949 – start-page: 5200 year: 2016 ident: 10.1016/j.inffus.2024.102382_b44 article-title: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network – ident: 10.1016/j.inffus.2024.102382_b29 doi: 10.1145/3394171.3413620 – start-page: 67 year: 2018 ident: 10.1016/j.inffus.2024.102382_b42 article-title: Vggface2: A dataset for recognising faces across pose and age – ident: 10.1016/j.inffus.2024.102382_b76 doi: 10.1145/3474085.3475292 |
| SSID | ssj0017031 |
| Score | 2.4852796 |
| Snippet | Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines.... |
| SourceID | crossref elsevier |
| SourceType | Enrichment Source Index Database Publisher |
| StartPage | 102382 |
| SubjectTerms | Audio-Visual Emotion Recognition Contrastive learning Masked autoencoder Self-supervised learning |
| Title | HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition |
| URI | https://dx.doi.org/10.1016/j.inffus.2024.102382 |
| Volume | 108 |
| WOSCitedRecordID | wos001220967900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals 2021 customDbUrl: eissn: 1872-6305 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0017031 issn: 1566-2535 databaseCode: AIEXJ dateStart: 20000701 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3NT9swFLe6wmE7THxsGhtMPuw2GaVJUzvcKlRUphUhrUIVl8h1HDWlSivcIPbf8xw7roGJjcMuaeU4juX3y_N7z-8DoW8g5cawzQuS8W5MQN-YkkTymISChvBdBiKoo_ivftKLCzaZJJet1u8mFuZuQcuS3d8nq_9KamgDYuvQ2VeQ2w0KDfAfiA5XIDtc_4nww-J01B9oTX9Y6OjiutiJthBoM66qPYVGXN2AoNmv1kudx1Knk9DehkoucqKqlWYfqr6fFUtyVSgdYjIw9X60nGk8jiw9540rvAuD_J5Xyjvd_1WZ6AaYReHcfwpjd72eSbtz1o1VjbZi4xvMzbkQ9J5V3DdQhF3nHud4aq9HwthkJXFMN2Ae29T5I0wNomcc3RgX5loNgdkf6xccb7o_TqD9ZGNz7oaNJ9s8NaOkepTUjPIGbYUUtKo22uqfDyY_3BGUTuxfJ9u1s2_iLmvnwOez-bNc48kq4x303ioZuG_AsYtastxD70YuQ6_aRzMDkxPsgwR7IMEGJNgDCQYS4ycgwT5IsAUJ9kDyAY3PBuPTIbFVN4gA9XFN8giE0E7Ge1lHTPNQxhx2zYwyLpKQSckoj2kGawOSI_yGPJc8DxjNBAiCOQ-ij6hdLkv5CeGAdTmwgWkmWAzffZ5EHd4RYgpLyLqRlAcoapYsFTYjvS6MskhfItgBIu6plcnI8pf-tKFGaqVKIy2mALEXn_z8yjd9QW83-D9E7fVtJY_QtrhbF-r2q8XXAyWYmyE |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HiCMAE%3A+Hierarchical+Contrastive+Masked+Autoencoder+for+self-supervised+Audio-Visual+Emotion+Recognition&rft.jtitle=Information+fusion&rft.au=Sun%2C+Licai&rft.au=Lian%2C+Zheng&rft.au=Liu%2C+Bin&rft.au=Tao%2C+Jianhua&rft.date=2024-08-01&rft.issn=1566-2535&rft.volume=108&rft.spage=102382&rft_id=info:doi/10.1016%2Fj.inffus.2024.102382&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_inffus_2024_102382 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1566-2535&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1566-2535&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1566-2535&client=summon |