HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meetin...

Full description

Saved in:
Bibliographic Details
Published in:Information fusion Vol. 108; p. 102382
Main Authors: Sun, Licai, Lian, Zheng, Liu, Bin, Tao, Jianhua
Format: Journal Article
Language:English
Published: Elsevier B.V 01.08.2024
Subjects:
ISSN:1566-2535, 1872-6305
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross-modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE. •A novel self-supervised framework is proposed for audio-visual emotion recognition.•A three-pronged strategy is introduced to foster hierarchical feature learning.•The proposed method achieves state-of-the-art performance on 9 datasets.
AbstractList Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross-modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE. •A novel self-supervised framework is proposed for audio-visual emotion recognition.•A three-pronged strategy is introduced to foster hierarchical feature learning.•The proposed method achieves state-of-the-art performance on 9 datasets.
ArticleNumber 102382
Author Liu, Bin
Sun, Licai
Lian, Zheng
Tao, Jianhua
Author_xml – sequence: 1
  givenname: Licai
  orcidid: 0000-0002-7944-3458
  surname: Sun
  fullname: Sun, Licai
  email: sunlicai2019@ia.ac.cn
  organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
– sequence: 2
  givenname: Zheng
  surname: Lian
  fullname: Lian, Zheng
  email: lianzheng2016@ia.ac.cn
  organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
– sequence: 3
  givenname: Bin
  surname: Liu
  fullname: Liu, Bin
  email: liubin@nlpr.ia.ac.cn
  organization: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
– sequence: 4
  givenname: Jianhua
  surname: Tao
  fullname: Tao, Jianhua
  email: jhtao@tsinghua.edu.cn
  organization: Department of Automation, Tsinghua University, Beijing, China
BookMark eNqFkE1LAzEQhoNUsK3-Aw_7B7bmo_thD0JZqhUqghSvISYTO3W7KcluwX9v1vXkQU8zzPAM8z4TMmpcA4RcMzpjlOU3-xk21nZhximfxxEXJT8jY1YWPM0FzUaxz_I85ZnILsgkhD2lrKCCjclujdXTcrVI1gheeb1Dreqkck3rVWjxBMmTCh9gkmXXOmi0M-AT63wSoLZp6I7gTxi-9wZd-oqhi_zq4Fp0TfIC2r032PeX5NyqOsDVT52S7f1qW63TzfPDY7XcpFrQvE2tyHPBjMoN02-WQ6YopaYolb7lJUBZqKww8XMeAxjKlQVlaVkYnXFmFRVTshjOau9C8GClxlb1D8RAWEtGZa9M7uWgTPbK5KAswvNf8NHjQfnP_7C7AYOY6xQ9yqAxugKDHnQrjcO_D3wBvf6L2Q
CitedBy_id crossref_primary_10_1016_j_inffus_2025_103576
crossref_primary_10_1049_itr2_70009
crossref_primary_10_1016_j_neucom_2025_130020
crossref_primary_10_1016_j_inffus_2025_103640
crossref_primary_10_1007_s00530_024_01551_1
crossref_primary_10_1109_TIM_2025_3578178
crossref_primary_10_1007_s10462_025_11228_4
crossref_primary_10_1109_TAFFC_2025_3555406
crossref_primary_10_3390_electronics14050978
crossref_primary_10_1007_s00371_025_03818_8
crossref_primary_10_1016_j_engappai_2025_110007
Cites_doi 10.1080/026999300402745
10.1145/3423327.3423672
10.1109/CVPR.2018.00745
10.1109/TAFFC.2021.3101563
10.1109/TASLP.2021.3122291
10.1109/CVPR.2018.00685
10.1145/3475957.3484456
10.1109/TPAMI.2007.1110
10.1109/TCSVT.2017.2719043
10.1109/ICCV51070.2023.01479
10.1371/journal.pone.0196391
10.1007/978-3-030-01231-1_39
10.1109/CVPR46437.2021.01229
10.1109/TASLP.2021.3049898
10.1109/TPAMI.2008.52
10.18653/v1/D16-1044
10.1016/j.patrec.2022.07.012
10.1109/CVPR52729.2023.00211
10.1109/ICCV51070.2023.00494
10.1109/CVPR46437.2021.00084
10.1145/3581783.3612836
10.1017/ATSIP.2014.11
10.1109/TIP.2021.3093397
10.1109/ICCV.2017.73
10.1109/ICCV.2017.74
10.21437/Interspeech.2018-1929
10.1145/3503161.3547865
10.1109/TASLP.2020.3030497
10.1145/3581783.3612365
10.1109/CVPR.2018.00675
10.1145/3581783.3612286
10.21437/Interspeech.2013-56
10.1109/CVPR.2016.90
10.34133/icomputing.0076
10.1016/j.patcog.2023.109368
10.1109/TAFFC.2022.3216993
10.1109/TAFFC.2014.2336244
10.1609/aaai.v37i11.26541
10.1109/JSTSP.2022.3188113
10.1145/3503161.3548190
10.1145/3581783.3613459
10.1109/CVPR52688.2022.01553
10.1007/s10579-008-9076-6
10.1109/JSTSP.2017.2764438
10.1109/TAFFC.2016.2515617
10.1109/CVPRW56347.2022.00261
10.1145/2993148.2997632
10.1016/j.ins.2022.03.062
10.1109/TAFFC.2015.2457417
10.18653/v1/D17-1115
10.1109/CVPR52688.2022.02025
10.1109/TAFFC.2023.3258900
10.1109/CVPR52729.2023.01722
10.1109/ICCV.2015.510
10.1145/3133944.3133949
10.1145/3394171.3413620
10.1145/3474085.3475292
ContentType Journal Article
Copyright 2024 Elsevier B.V.
Copyright_xml – notice: 2024 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.inffus.2024.102382
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Mathematics
EISSN 1872-6305
ExternalDocumentID 10_1016_j_inffus_2024_102382
S156625352400160X
GroupedDBID --K
--M
.DC
.~1
0R~
1B1
1~.
1~5
29I
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABJNI
ABMAC
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
UHS
ZMT
~G-
9DU
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ID FETCH-LOGICAL-c306t-f36631da6d1cbf2e5a000d78ac928ee87a57d0312017d02afeaf087dc521fa03
ISICitedReferencesCount 19
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001220967900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1566-2535
IngestDate Tue Nov 18 22:36:59 EST 2025
Sat Nov 29 06:25:34 EST 2025
Sat May 11 15:33:31 EDT 2024
IsPeerReviewed true
IsScholarly true
Keywords Audio-Visual Emotion Recognition
Masked autoencoder
Self-supervised learning
Contrastive learning
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c306t-f36631da6d1cbf2e5a000d78ac928ee87a57d0312017d02afeaf087dc521fa03
ORCID 0000-0002-7944-3458
ParticipantIDs crossref_citationtrail_10_1016_j_inffus_2024_102382
crossref_primary_10_1016_j_inffus_2024_102382
elsevier_sciencedirect_doi_10_1016_j_inffus_2024_102382
PublicationCentury 2000
PublicationDate August 2024
2024-08-00
PublicationDateYYYYMMDD 2024-08-01
PublicationDate_xml – month: 08
  year: 2024
  text: August 2024
PublicationDecade 2020
PublicationTitle Information fusion
PublicationYear 2024
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Tong, Song, Wang, Wang (b61) 2022
Feichtenhofer, Fan, Li, He (b62) 2022
Goncalves, Busso (b105) 2023
L. Sun, Z. Lian, J. Tao, B. Liu, M. Niu, Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism, in: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop, 2020, pp. 27–34.
Mittal, Morgado, Jain, Gupta (b111) 2022; 35
Sun, Lian, Wang, He, Xu, Sun, Liu, Tao (b49) 2023
Zhang, Li, Lin, Xu, Xiao (b9) 2023
Y. Liu, S. Zhang, J. Chen, Z. Yu, K. Chen, D. Lin, Improving pixel-based mim by reducing wasted modeling capability, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5361–5372.
Verbitskiy, Berikov, Vyshegorodtsev (b109) 2022; 161
Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, S. Shan, MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 24–32.
Sun, Lian, Liu, Tao (b57) 2023
Tran, Soleymani (b55) 2022
Huang, Li, Tao, Lian, Yi (b45) 2018
Z. Zhao, Q. Liu, Former-dfer: Dynamic facial expression recognition transformer, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561.
Zhang, Lv, Guo, Shao, Yang, Xie, Xu, Bu, Chen, Zeng (b91) 2022
H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 782–791.
Tzirakis, Trigeorgis, Nicolaou, Schuller, Zafeiriou (b8) 2017; 11
Trigeorgis, Ringeval, Brueckner, Marchi, Nicolaou, Schuller, Zafeiriou (b44) 2016
Livingstone, Russo (b69) 2018; 13
Zhang, Yang, Chen, Zhang, Leng, Zhao (b10) 2023
Lian, Liu, Tao (b54) 2021; 29
A. Keesing, Y.S. Koh, V. Yogarajan, M. Witbrock, Emotion Recognition ToolKit (ERTK): Standardising Tools For Emotion Recognition Research, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9693–9696.
B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al., The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism, in: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013.
Goncalves, Busso (b53) 2022
A. Owens, A.A. Efros, Audio-visual scene analysis with self-supervised multisensory features, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 631–648.
Z. Zhao, I. Patras, Prompting Visual-Language Models for Dynamic Facial Expression Recognition, in: British Machine Vision Conference, BMVC, 2023, pp. 1–14.
W. Li, L. Zhu, R. Mao, E. Cambria, Skier: A symbolic knowledge integrated model for conversational emotion recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
R. Arandjelovic, A. Zisserman, Look, listen and learn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617.
Z. Lian, H. Sun, L. Sun, K. Chen, M. Xu, K. Wang, K. Xu, Y. He, Y. Li, J. Zhao, et al., Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9610–9614.
Mao, Liu, He, Li, Cambria (b14) 2022
Rajan, Brutti, Cavallaro (b72) 2022
Tsai, Bai, Liang, Kolter, Morency, Salakhutdinov (b51) 2019; vol. 2019
Touvron, Cord, Douze, Massa, Sablayrolles, Jégou (b80) 2021
M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Soleymani, SAAML: A framework for semi-supervised affective adaptation via metric learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6004–6015.
Cao, Shen, Xie, Parkhi, Zisserman (b42) 2018
Eyben, Scherer, Schuller, Sundberg, André, Busso, Devillers, Epps, Laukka, Narayanan (b32) 2015; 7
Balestriero, Ibrahim, Sobal, Morcos, Shekhar, Goldstein, Bordes, Bardes, Mialon, Tian (b12) 2023
Hsu, Bolte, Tsai, Lakhotia, Salakhutdinov, Mohamed (b48) 2021; 29
Tseng, Berry, Chen, Chiu, Lin, Liu, Peng, Shih, Wang, Wu (b98) 2023
Oord, Li, Vinyals (b67) 2018
Zhao, Liu, Wang (b93) 2021; 30
Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, Stoyanov (b81) 2019
Ghaleb, Niehues, Asteriadis (b104) 2020
Su, Hu, Li, Cao (b95) 2020
Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in: Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450.
A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 457–468.
Devlin, Chang, Lee, Toutanova (b13) 2018
L. Sun, Z. Lian, B. Liu, J. Tao, MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121.
Hsu, Wu (b56) 2023
Busso, Bulut, Lee, Kazemzadeh, Mower, Kim, Chang, Lee, Narayanan (b70) 2008; 42
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
L. Sun, M. Xu, Z. Lian, B. Liu, J. Tao, M. Wang, Y. Cheng, Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model, in: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, 2021, pp. 15–20.
Huang, Xu, Li, Baevski, Auli, Galuba, Metze, Feichtenhofer (b65) 2022; 35
Chumachenko, Iosifidis, Gabbouj (b110) 2022
Dalal, Triggs (b34) 2005; vol. 1
Schwarz (b1) 2000; 14
Pei, Li, Lu, Wang, Hua, Li (b11) 2024; 3
Hendrycks, Gimpel (b66) 2016
Zhang, Zhang, Huang, Gao, Tian (b7) 2017; 28
Yoon, Dey, Lee, Jung (b78) 2020
R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.
H.R.V. Joze, A. Shaban, M.L. Iuzzolino, K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289–13299.
Sarkar, Posen, Etemad (b24) 2022
J.S. Chung, A. Nagrani, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, in: Proc. Interspeech 2018, 2018, pp. 1086–1090.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
Lei, Cao (b102) 2023
K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
L. Meng, Y. Liu, X. Liu, Z. Huang, W. Jiang, T. Zhang, C. Liu, Q. Jin, Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2345–2352.
Kong, Cao, Iqbal, Wang, Wang, Plumbley (b40) 2020; 28
Huang, Tao, Liu, Lian, Niu (b52) 2020
Parkhi, Vedaldi, Zisserman (b113) 2015
Ghaleb, Popa, Asteriadis (b100) 2019
Cao, Cooper, Keutmann, Gur, Nenkova, Verma (b28) 2014; 5
H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, K. Han, Masked Image Modeling with Local Multi-Scale Reconstruction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2122–2131.
Yoon, Byun, Jung (b71) 2018
H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, A. Zhou, Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17958–17968.
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (b16) 2021
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009.
Zhao, Pietikainen (b33) 2007; 29
Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, W. Zhang, DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110.
Foteinopoulou, Patras (b79) 2023
Chen, Wang, Chen, Wu, Liu, Chen, Li, Kanda, Yoshioka, Xiao (b73) 2022; 16
Ronneberger, Fischer, Brox (b26) 2015
M.-I. Georgescu, E. Fonseca, R.T. Ionescu, M. Lucic, C. Schmid, A. Arnab, Audiovisual masked autoencoders, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16144–16154.
Goncalves, Busso (b101) 2022; 13
B. Shi, W.-N. Hsu, K. Lakhotia, A. Mohamed, Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction, in: International Conference on Learning Representations, 2022.
Busso, Parthasarathy, Burmania, AbdelWahab, Sadoughi, Provost (b68) 2016; 8
Sadok, Leglaive, Séguier (b20) 2023
Liu, Wang, Feng, Zhang, Chen, Zhan (b85) 2023; 138
P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, C. Feichtenhofer, MAViL: Masked Audio-Video Learners, in: Thirty-Seventh Conference on Neural Information Processing Systems, 2023.
A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, Y. Song, Parameter Efficient Multimodal Transformers for Video Representation Learning, in: International Conference on Learning Representations, 2021.
Y. Wang, Y. Sun, Y.
Sun (10.1016/j.inffus.2024.102382_b49) 2023
10.1016/j.inffus.2024.102382_b50
Tsai (10.1016/j.inffus.2024.102382_b51) 2019; vol. 2019
10.1016/j.inffus.2024.102382_b115
10.1016/j.inffus.2024.102382_b116
Sarkar (10.1016/j.inffus.2024.102382_b24) 2022
Ronneberger (10.1016/j.inffus.2024.102382_b26) 2015
10.1016/j.inffus.2024.102382_b112
Sadok (10.1016/j.inffus.2024.102382_b20) 2023
Van der Maaten (10.1016/j.inffus.2024.102382_b114) 2008; 9
Yoon (10.1016/j.inffus.2024.102382_b78) 2020
Tong (10.1016/j.inffus.2024.102382_b61) 2022
10.1016/j.inffus.2024.102382_b46
Trigeorgis (10.1016/j.inffus.2024.102382_b44) 2016
10.1016/j.inffus.2024.102382_b43
10.1016/j.inffus.2024.102382_b107
Tran (10.1016/j.inffus.2024.102382_b55) 2022
10.1016/j.inffus.2024.102382_b108
Hershey (10.1016/j.inffus.2024.102382_b41) 2017
10.1016/j.inffus.2024.102382_b103
Pei (10.1016/j.inffus.2024.102382_b11) 2024; 3
Radford (10.1016/j.inffus.2024.102382_b16) 2021
10.1016/j.inffus.2024.102382_b106
Huang (10.1016/j.inffus.2024.102382_b45) 2018
10.1016/j.inffus.2024.102382_b38
Hendrycks (10.1016/j.inffus.2024.102382_b66) 2016
10.1016/j.inffus.2024.102382_b37
10.1016/j.inffus.2024.102382_b39
10.1016/j.inffus.2024.102382_b36
10.1016/j.inffus.2024.102382_b35
10.1016/j.inffus.2024.102382_b74
Dalal (10.1016/j.inffus.2024.102382_b34) 2005; vol. 1
Zhang (10.1016/j.inffus.2024.102382_b91) 2022
Zhao (10.1016/j.inffus.2024.102382_b93) 2021; 30
10.1016/j.inffus.2024.102382_b76
Minsky (10.1016/j.inffus.2024.102382_b2) 1988
Lei (10.1016/j.inffus.2024.102382_b102) 2023
Cao (10.1016/j.inffus.2024.102382_b42) 2018
Verbitskiy (10.1016/j.inffus.2024.102382_b109) 2022; 161
Mao (10.1016/j.inffus.2024.102382_b14) 2022
10.1016/j.inffus.2024.102382_b63
Mehrabian (10.1016/j.inffus.2024.102382_b3) 1968; 2
Vaswani (10.1016/j.inffus.2024.102382_b64) 2017; vol. 30
Ghaleb (10.1016/j.inffus.2024.102382_b104) 2020
Parkhi (10.1016/j.inffus.2024.102382_b113) 2015
10.1016/j.inffus.2024.102382_b60
Eyben (10.1016/j.inffus.2024.102382_b32) 2015; 7
Su (10.1016/j.inffus.2024.102382_b95) 2020
Zhang (10.1016/j.inffus.2024.102382_b7) 2017; 28
Goncalves (10.1016/j.inffus.2024.102382_b101) 2022; 13
Chen (10.1016/j.inffus.2024.102382_b97) 2023
10.1016/j.inffus.2024.102382_b59
Lian (10.1016/j.inffus.2024.102382_b30) 2024
10.1016/j.inffus.2024.102382_b58
Liu (10.1016/j.inffus.2024.102382_b85) 2023; 138
Chumachenko (10.1016/j.inffus.2024.102382_b110) 2022
Busso (10.1016/j.inffus.2024.102382_b70) 2008; 42
10.1016/j.inffus.2024.102382_b92
Feichtenhofer (10.1016/j.inffus.2024.102382_b62) 2022
10.1016/j.inffus.2024.102382_b94
Yoon (10.1016/j.inffus.2024.102382_b71) 2018
10.1016/j.inffus.2024.102382_b90
Dosovitskiy (10.1016/j.inffus.2024.102382_b75) 2020
Liu (10.1016/j.inffus.2024.102382_b81) 2019
Touvron (10.1016/j.inffus.2024.102382_b80) 2021
Balestriero (10.1016/j.inffus.2024.102382_b12) 2023
Oord (10.1016/j.inffus.2024.102382_b67) 2018
Chen (10.1016/j.inffus.2024.102382_b73) 2022; 16
Zhang (10.1016/j.inffus.2024.102382_b10) 2023
Goncalves (10.1016/j.inffus.2024.102382_b53) 2022
Zhao (10.1016/j.inffus.2024.102382_b33) 2007; 29
Goncalves (10.1016/j.inffus.2024.102382_b105) 2023
10.1016/j.inffus.2024.102382_b88
Wu (10.1016/j.inffus.2024.102382_b4) 2014; 3
Liu (10.1016/j.inffus.2024.102382_b84) 2022; 598
Ghaleb (10.1016/j.inffus.2024.102382_b100) 2019
Livingstone (10.1016/j.inffus.2024.102382_b69) 2018; 13
Cao (10.1016/j.inffus.2024.102382_b28) 2014; 5
10.1016/j.inffus.2024.102382_b83
10.1016/j.inffus.2024.102382_b82
Zeng (10.1016/j.inffus.2024.102382_b6) 2008; 31
Hsu (10.1016/j.inffus.2024.102382_b48) 2021; 29
Lian (10.1016/j.inffus.2024.102382_b54) 2021; 29
10.1016/j.inffus.2024.102382_b5
Baevski (10.1016/j.inffus.2024.102382_b47) 2020; vol. 33
Zhang (10.1016/j.inffus.2024.102382_b9) 2023
Devlin (10.1016/j.inffus.2024.102382_b13) 2018
Schwarz (10.1016/j.inffus.2024.102382_b1) 2000; 14
Ma (10.1016/j.inffus.2024.102382_b86) 2022
10.1016/j.inffus.2024.102382_b77
Li (10.1016/j.inffus.2024.102382_b87) 2022
Foteinopoulou (10.1016/j.inffus.2024.102382_b79) 2023
10.1016/j.inffus.2024.102382_b31
Sun (10.1016/j.inffus.2024.102382_b57) 2023
Huang (10.1016/j.inffus.2024.102382_b65) 2022; 35
Huang (10.1016/j.inffus.2024.102382_b52) 2020
Rajan (10.1016/j.inffus.2024.102382_b72) 2022
Zhang (10.1016/j.inffus.2024.102382_b23) 2023; 14
10.1016/j.inffus.2024.102382_b27
10.1016/j.inffus.2024.102382_b29
Li (10.1016/j.inffus.2024.102382_b89) 2023; vol. 37
10.1016/j.inffus.2024.102382_b22
10.1016/j.inffus.2024.102382_b25
10.1016/j.inffus.2024.102382_b21
Tseng (10.1016/j.inffus.2024.102382_b98) 2023
Hsu (10.1016/j.inffus.2024.102382_b56) 2023
Busso (10.1016/j.inffus.2024.102382_b68) 2016; 8
10.1016/j.inffus.2024.102382_b19
Mittal (10.1016/j.inffus.2024.102382_b111) 2022; 35
Tzirakis (10.1016/j.inffus.2024.102382_b8) 2017; 11
10.1016/j.inffus.2024.102382_b15
10.1016/j.inffus.2024.102382_b18
Kong (10.1016/j.inffus.2024.102382_b40) 2020; 28
10.1016/j.inffus.2024.102382_b17
10.1016/j.inffus.2024.102382_b99
Fu (10.1016/j.inffus.2024.102382_b96) 2021
References_xml – year: 2023
  ident: b20
  article-title: A vector quantized masked autoencoder for audiovisual speech emotion recognition
– start-page: 4698
  year: 2022
  end-page: 4702
  ident: b55
  article-title: A pre-trained audio-visual transformer for emotion recognition
  publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: A. Owens, A.A. Efros, Audio-visual scene analysis with self-supervised multisensory features, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 631–648.
– volume: 29
  start-page: 3451
  year: 2021
  end-page: 3460
  ident: b48
  article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
– reference: H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, A. Zhou, Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17958–17968.
– start-page: 112
  year: 2018
  end-page: 118
  ident: b71
  article-title: Multimodal speech emotion recognition using audio and text
  publication-title: 2018 IEEE Spoken Language Technology Workshop
– reference: Y. Wang, Y. Sun, Y. Huang, Z. Liu, S. Gao, W. Zhang, W. Ge, W. Zhang, FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20922–20931.
– start-page: 552
  year: 2019
  end-page: 558
  ident: b100
  article-title: Multimodal and temporal perception of audio-visual cues for emotion recognition
  publication-title: 2019 8th International Conference on Affective Computing and Intelligent Interaction
– year: 2022
  ident: b14
  article-title: The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection
  publication-title: IEEE Trans. Affect. Comput.
– reference: Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, W. Zhang, DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 101–110.
– reference: P. Morgado, N. Vasconcelos, I. Misra, Audio-visual instance discrimination with cross-modal agreement, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12475–12486.
– year: 2023
  ident: b79
  article-title: EmoCLIP: A vision-language method for zero-shot video facial expression recognition
– year: 2023
  ident: b10
  article-title: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
  publication-title: Expert Syst. Appl.
– reference: Z. Zhao, I. Patras, Prompting Visual-Language Models for Dynamic Facial Expression Recognition, in: British Machine Vision Conference, BMVC, 2023, pp. 1–14.
– start-page: 3507
  year: 2020
  end-page: 3511
  ident: b52
  article-title: Multimodal transformer fusion for continuous emotion recognition
  publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
– volume: 16
  start-page: 1505
  year: 2022
  end-page: 1518
  ident: b73
  article-title: Wavlm: Large-scale self-supervised pre-training for full stack speech processing
  publication-title: IEEE J. Sel. Top. Sign. Proces.
– year: 1988
  ident: b2
  article-title: Society of Mind
– start-page: 1
  year: 2023
  end-page: 5
  ident: b105
  article-title: Learning cross-modal audiovisual representations with ladder networks for emotion recognition
  publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
– volume: 2
  start-page: 53
  year: 1968
  end-page: 56
  ident: b3
  article-title: Communication without words
  publication-title: Psychol. Today
– year: 2022
  ident: b24
  article-title: AVCAffe: A large scale audio-visual dataset of cognitive load and affect for remote work
– reference: H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, K. Han, Masked Image Modeling with Local Multi-Scale Reconstruction, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2122–2131.
– reference: A. Keesing, Y.S. Koh, V. Yogarajan, M. Witbrock, Emotion Recognition ToolKit (ERTK): Standardising Tools For Emotion Recognition Research, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9693–9696.
– volume: 8
  start-page: 67
  year: 2016
  end-page: 80
  ident: b68
  article-title: MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception
  publication-title: IEEE Trans. Affect. Comput.
– volume: 28
  start-page: 2880
  year: 2020
  end-page: 2894
  ident: b40
  article-title: Panns: Large-scale pretrained audio neural networks for audio pattern recognition
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
– reference: D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
– reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
– volume: 3
  start-page: 0076
  year: 2024
  ident: b11
  article-title: Affective computing: Recent advances, challenges, and future trends
  publication-title: Intell. Comput.
– reference: W. Li, L. Zhu, R. Mao, E. Cambria, Skier: A symbolic knowledge integrated model for conversational emotion recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
– reference: B. Shi, W.-N. Hsu, K. Lakhotia, A. Mohamed, Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction, in: International Conference on Learning Representations, 2022.
– volume: 31
  start-page: 39
  year: 2008
  end-page: 58
  ident: b6
  article-title: A survey of affect recognition methods: Audio, visual, and spontaneous expressions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– volume: 29
  start-page: 915
  year: 2007
  end-page: 928
  ident: b33
  article-title: Dynamic texture recognition using local binary patterns with an application to facial expressions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– volume: 35
  start-page: 28708
  year: 2022
  end-page: 28720
  ident: b65
  article-title: Masked autoencoders that listen
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2015
  ident: b113
  article-title: Deep face recognition
  publication-title: BMVC 2015-Proceedings of the British Machine Vision Conference 2015
– year: 2024
  ident: b30
  article-title: MERBench: A unified evaluation benchmark for multimodal emotion recognition
– year: 2023
  ident: b98
  article-title: AV-SUPERB: A multi-task evaluation benchmark for audio-visual representation models
– year: 2023
  ident: b9
  article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– year: 2022
  ident: b86
  article-title: Spatio-temporal transformer for dynamic facial expression recognition in the wild
– volume: 28
  start-page: 3030
  year: 2017
  end-page: 3043
  ident: b7
  article-title: Learning affective features with a hybrid deep model for audio-visual emotion recognition
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– reference: R. Arandjelovic, A. Zisserman, Look, listen and learn, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617.
– volume: 14
  start-page: 1201
  year: 2023
  end-page: 1214
  ident: b23
  article-title: Werewolf-XL: A database for identifying spontaneous affect in large competitive group interactions
  publication-title: IEEE Trans. Affect. Comput.
– volume: 29
  start-page: 985
  year: 2021
  end-page: 1000
  ident: b54
  article-title: Ctnet: Conversational transformer network for emotion recognition
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
– reference: J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
– reference: Y. Liu, S. Zhang, J. Chen, Z. Yu, K. Chen, D. Lin, Improving pixel-based mim by reducing wasted modeling capability, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5361–5372.
– reference: L. Sun, M. Xu, Z. Lian, B. Liu, J. Tao, M. Wang, Y. Cheng, Multimodal emotion recognition and sentiment analysis via attention enhanced recurrent model, in: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge, 2021, pp. 15–20.
– volume: vol. 1
  start-page: 886
  year: 2005
  end-page: 893
  ident: b34
  article-title: Histograms of oriented gradients for human detection
  publication-title: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
– start-page: 7357
  year: 2022
  end-page: 7361
  ident: b53
  article-title: AuxFormer: Robust approach to audiovisual emotion recognition
  publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
– volume: vol. 2019
  start-page: 6558
  year: 2019
  ident: b51
  article-title: Multimodal transformer for unaligned multimodal language sequences
  publication-title: Proceedings of the Conference. Association for Computational Linguistics. Meeting
– start-page: 1
  year: 2023
  end-page: 5
  ident: b97
  article-title: Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition
  publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: H.R.V. Joze, A. Shaban, M.L. Iuzzolino, K. Koishida, MMTM: Multimodal transfer module for CNN fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13289–13299.
– start-page: 6837
  year: 2018
  end-page: 6841
  ident: b45
  article-title: End-to-end continuous emotion recognition from video using 3D ConvLSTM networks
  publication-title: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing
– volume: 5
  start-page: 377
  year: 2014
  end-page: 390
  ident: b28
  article-title: Crema-d: Crowd-sourced emotional multimodal actors dataset
  publication-title: IEEE Trans. Affect. Comput.
– start-page: 8748
  year: 2021
  end-page: 8763
  ident: b16
  article-title: Learning transferable visual models from natural language supervision
  publication-title: International Conference on Machine Learning
– volume: 3
  year: 2014
  ident: b4
  article-title: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
  publication-title: APSIPA Trans. Signal Inf. Process.
– volume: 7
  start-page: 190
  year: 2015
  end-page: 202
  ident: b32
  article-title: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing
  publication-title: IEEE Trans. Affect. Comput.
– reference: H. Chefer, S. Gur, L. Wolf, Transformer interpretability beyond attention visualization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 782–791.
– start-page: 1
  year: 2023
  end-page: 16
  ident: b102
  article-title: Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels
  publication-title: IEEE Trans. Affect. Comput.
– start-page: 2822
  year: 2022
  end-page: 2828
  ident: b110
  article-title: Self-attention fusion for audiovisual emotion recognition with incomplete data
  publication-title: 2022 26th International Conference on Pattern Recognition
– reference: Y. Gong, A. Rouditchenko, A.H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, J.R. Glass, Contrastive Audio-Visual Masked Autoencoder, in: The Eleventh International Conference on Learning Representations, 2023.
– start-page: 251
  year: 2020
  end-page: 255
  ident: b104
  article-title: Multimodal attention-mechanism for temporal emotion recognition
  publication-title: 2020 IEEE International Conference on Image Processing
– reference: R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 618–626.
– start-page: 131
  year: 2017
  end-page: 135
  ident: b41
  article-title: CNN architectures for large-scale audio classification
  publication-title: 2017 Ieee International Conference on Acoustics, Speech and Signal Processing
– year: 2019
  ident: b81
  article-title: Roberta: A robustly optimized bert pretraining approach
– reference: Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in: Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 445–450.
– volume: 598
  start-page: 182
  year: 2022
  end-page: 195
  ident: b84
  article-title: Clip-aware expressive feature learning for video-based facial expression recognition
  publication-title: Inform. Sci.
– year: 2023
  ident: b12
  article-title: A cookbook of self-supervised learning
– reference: B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al., The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism, in: Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013.
– start-page: 4693
  year: 2022
  end-page: 4697
  ident: b72
  article-title: Is cross-attention preferable to self-attention for multi-modal emotion recognition?
  publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
– year: 2021
  ident: b96
  article-title: A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition
– year: 2020
  ident: b95
  article-title: Msaf: Multimodal split attention fusion
– reference: A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 457–468.
– reference: S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, Y. Song, Parameter Efficient Multimodal Transformers for Video Representation Learning, in: International Conference on Learning Representations, 2021.
– reference: M. Tran, Y. Kim, C.-C. Su, C.-H. Kuo, M. Soleymani, SAAML: A framework for semi-supervised affective adaptation via metric learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6004–6015.
– reference: D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
– year: 2018
  ident: b13
  article-title: Bert: Pre-training of deep bidirectional transformers for language understanding
– volume: 42
  start-page: 335
  year: 2008
  end-page: 359
  ident: b70
  article-title: IEMOCAP: Interactive emotional dyadic motion capture database
  publication-title: Lang. Resour. Eval.
– start-page: 67
  year: 2018
  end-page: 74
  ident: b42
  article-title: Vggface2: A dataset for recognising faces across pose and age
  publication-title: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition
– year: 2023
  ident: b56
  article-title: Applying segment-level attention on bi-modal transformer encoder for audio-visual emotion recognition
  publication-title: IEEE Trans. Affect. Comput.
– start-page: 5200
  year: 2016
  end-page: 5204
  ident: b44
  article-title: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network
  publication-title: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing
– year: 2022
  ident: b62
  article-title: Masked autoencoders as spatiotemporal learners
– volume: 30
  start-page: 6544
  year: 2021
  end-page: 6556
  ident: b93
  article-title: Learning deep global multi-scale and local attention features for facial expression recognition in the wild
  publication-title: IEEE Trans. Image Process.
– volume: 9
  year: 2008
  ident: b114
  article-title: Visualizing data using t-sne.
  publication-title: J. Mach. Learn. Res.
– volume: 11
  start-page: 1301
  year: 2017
  end-page: 1309
  ident: b8
  article-title: End-to-end multimodal emotion recognition using deep neural networks
  publication-title: IEEE J. Sel. Top. Signal Process.
– year: 2018
  ident: b67
  article-title: Representation learning with contrastive predictive coding
– volume: 138
  year: 2023
  ident: b85
  article-title: Expression snippet transformer for robust video-based facial expression recognition
  publication-title: Pattern Recognit.
– year: 2016
  ident: b66
  article-title: Gaussian error linear units (gelus)
– reference: S. Chen, Q. Jin, J. Zhao, S. Wang, Multimodal multi-task learning for dimensional and continuous emotion recognition, in: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, 2017, pp. 19–26.
– year: 2022
  ident: b87
  article-title: NR-dfernet: Noise-robust network for dynamic facial expression recognition
– year: 2020
  ident: b75
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– reference: M.-I. Georgescu, E. Fonseca, R.T. Ionescu, M. Lucic, C. Schmid, A. Arnab, Audiovisual masked autoencoders, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16144–16154.
– reference: P.-Y. Huang, V. Sharma, H. Xu, C. Ryali, H. Fan, Y. Li, S.-W. Li, G. Ghosh, J. Malik, C. Feichtenhofer, MAViL: Masked Audio-Video Learners, in: Thirty-Seventh Conference on Neural Information Processing Systems, 2023.
– reference: A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
– start-page: 6182
  year: 2022
  end-page: 6186
  ident: b91
  article-title: Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition
  publication-title: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: J.S. Chung, A. Nagrani, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, in: Proc. Interspeech 2018, 2018, pp. 1086–1090.
– reference: Z. Zhao, Q. Liu, Former-dfer: Dynamic facial expression recognition transformer, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1553–1561.
– reference: Y. Liu, W. Dai, C. Feng, W. Wang, G. Yin, J. Zeng, S. Shan, MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild, in: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 24–32.
– reference: Z. Lian, H. Sun, L. Sun, K. Chen, M. Xu, K. Wang, K. Xu, Y. He, Y. Li, J. Zhao, et al., Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9610–9614.
– reference: K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009.
– volume: vol. 33
  start-page: 12449
  year: 2020
  end-page: 12460
  ident: b47
  article-title: Wav2vec 2.0: A framework for self-supervised learning of speech representations
  publication-title: Advances in Neural Information Processing Systems
– volume: 13
  year: 2018
  ident: b69
  article-title: The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American english
  publication-title: PLoS One
– start-page: 234
  year: 2015
  end-page: 241
  ident: b26
  article-title: U-net: Convolutional networks for biomedical image segmentation
  publication-title: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18
– reference: L. Sun, Z. Lian, J. Tao, B. Liu, M. Niu, Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism, in: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop, 2020, pp. 27–34.
– reference: K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
– reference: L. Meng, Y. Liu, X. Liu, Z. Huang, W. Jiang, T. Zhang, C. Liu, Q. Jin, Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2345–2352.
– start-page: 3362
  year: 2020
  end-page: 3366
  ident: b78
  article-title: Attentive modality hopping mechanism for speech emotion recognition
  publication-title: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
– volume: vol. 37
  start-page: 67
  year: 2023
  end-page: 75
  ident: b89
  article-title: Intensity-aware loss for dynamic facial expression recognition in the wild
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
– start-page: 10347
  year: 2021
  end-page: 10357
  ident: b80
  article-title: Training data-efficient image transformers & distillation through attention
  publication-title: International Conference on Machine Learning
– year: 2022
  ident: b61
  article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
  publication-title: Advances in Neural Information Processing Systems
– reference: X. Jiang, Y. Zong, W. Zheng, C. Tang, W. Xia, C. Lu, J. Liu, Dfew: A large-scale database for recognizing dynamic facial expressions in the wild, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2881–2889.
– volume: 35
  start-page: 23765
  year: 2022
  end-page: 23779
  ident: b111
  article-title: Learning state-aware visual representations from audible interactions
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: vol. 30
  year: 2017
  ident: b64
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems
– reference: L. Sun, Z. Lian, B. Liu, J. Tao, MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition, in: Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6110–6121.
– volume: 13
  start-page: 2156
  year: 2022
  end-page: 2170
  ident: b101
  article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features
  publication-title: IEEE Trans. Affect. Comput.
– volume: 14
  start-page: 433
  year: 2000
  end-page: 440
  ident: b1
  article-title: Emotion, cognition, and decision making
  publication-title: Cogn. Emot.
– year: 2023
  ident: b57
  article-title: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis
  publication-title: IEEE Trans. Affect. Comput.
– volume: 161
  start-page: 38
  year: 2022
  end-page: 44
  ident: b109
  article-title: Eranns: Efficient residual audio neural networks for audio pattern recognition
  publication-title: Pattern Recognit. Lett.
– year: 2023
  ident: b49
  article-title: SVFAP: Self-supervised video facial affect perceiver
– volume: 14
  start-page: 433
  issue: 4
  year: 2000
  ident: 10.1016/j.inffus.2024.102382_b1
  article-title: Emotion, cognition, and decision making
  publication-title: Cogn. Emot.
  doi: 10.1080/026999300402745
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b10
  article-title: Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
  publication-title: Expert Syst. Appl.
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b79
– volume: vol. 37
  start-page: 67
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b89
  article-title: Intensity-aware loss for dynamic facial expression recognition in the wild
– ident: 10.1016/j.inffus.2024.102382_b37
  doi: 10.1145/3423327.3423672
– ident: 10.1016/j.inffus.2024.102382_b92
  doi: 10.1109/CVPR.2018.00745
– volume: 14
  start-page: 1201
  issue: 02
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b23
  article-title: Werewolf-XL: A database for identifying spontaneous affect in large competitive group interactions
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2021.3101563
– ident: 10.1016/j.inffus.2024.102382_b18
– volume: 29
  start-page: 3451
  year: 2021
  ident: 10.1016/j.inffus.2024.102382_b48
  article-title: Hubert: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
  doi: 10.1109/TASLP.2021.3122291
– ident: 10.1016/j.inffus.2024.102382_b83
  doi: 10.1109/CVPR.2018.00685
– ident: 10.1016/j.inffus.2024.102382_b38
  doi: 10.1145/3475957.3484456
– start-page: 6837
  year: 2018
  ident: 10.1016/j.inffus.2024.102382_b45
  article-title: End-to-end continuous emotion recognition from video using 3D ConvLSTM networks
– start-page: 552
  year: 2019
  ident: 10.1016/j.inffus.2024.102382_b100
  article-title: Multimodal and temporal perception of audio-visual cues for emotion recognition
– volume: 29
  start-page: 915
  issue: 6
  year: 2007
  ident: 10.1016/j.inffus.2024.102382_b33
  article-title: Dynamic texture recognition using local binary patterns with an application to facial expressions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2007.1110
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b20
– start-page: 3507
  year: 2020
  ident: 10.1016/j.inffus.2024.102382_b52
  article-title: Multimodal transformer fusion for continuous emotion recognition
– volume: vol. 33
  start-page: 12449
  year: 2020
  ident: 10.1016/j.inffus.2024.102382_b47
  article-title: Wav2vec 2.0: A framework for self-supervised learning of speech representations
– volume: 28
  start-page: 3030
  issue: 10
  year: 2017
  ident: 10.1016/j.inffus.2024.102382_b7
  article-title: Learning affective features with a hybrid deep model for audio-visual emotion recognition
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
  doi: 10.1109/TCSVT.2017.2719043
– year: 2016
  ident: 10.1016/j.inffus.2024.102382_b66
– start-page: 1
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b102
  article-title: Audio-visual emotion recognition with preference learning based on intended and multi-modal perceived labels
  publication-title: IEEE Trans. Affect. Comput.
– year: 2024
  ident: 10.1016/j.inffus.2024.102382_b30
– ident: 10.1016/j.inffus.2024.102382_b63
  doi: 10.1109/ICCV51070.2023.01479
– volume: 13
  issue: 5
  year: 2018
  ident: 10.1016/j.inffus.2024.102382_b69
  article-title: The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American english
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0196391
– ident: 10.1016/j.inffus.2024.102382_b59
  doi: 10.1007/978-3-030-01231-1_39
– start-page: 8748
  year: 2021
  ident: 10.1016/j.inffus.2024.102382_b16
  article-title: Learning transferable visual models from natural language supervision
– volume: 2
  start-page: 53
  issue: 4
  year: 1968
  ident: 10.1016/j.inffus.2024.102382_b3
  article-title: Communication without words
  publication-title: Psychol. Today
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b57
  article-title: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis
  publication-title: IEEE Trans. Affect. Comput.
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b49
– ident: 10.1016/j.inffus.2024.102382_b60
  doi: 10.1109/CVPR46437.2021.01229
– year: 2018
  ident: 10.1016/j.inffus.2024.102382_b67
– volume: 29
  start-page: 985
  year: 2021
  ident: 10.1016/j.inffus.2024.102382_b54
  article-title: Ctnet: Conversational transformer network for emotion recognition
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
  doi: 10.1109/TASLP.2021.3049898
– start-page: 7357
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b53
  article-title: AuxFormer: Robust approach to audiovisual emotion recognition
– volume: 31
  start-page: 39
  issue: 1
  year: 2008
  ident: 10.1016/j.inffus.2024.102382_b6
  article-title: A survey of affect recognition methods: Audio, visual, and spontaneous expressions
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2008.52
– ident: 10.1016/j.inffus.2024.102382_b107
  doi: 10.18653/v1/D16-1044
– volume: 161
  start-page: 38
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b109
  article-title: Eranns: Efficient residual audio neural networks for audio pattern recognition
  publication-title: Pattern Recognit. Lett.
  doi: 10.1016/j.patrec.2022.07.012
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b61
  article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
– ident: 10.1016/j.inffus.2024.102382_b21
  doi: 10.1109/CVPR52729.2023.00211
– ident: 10.1016/j.inffus.2024.102382_b22
  doi: 10.1109/ICCV51070.2023.00494
– ident: 10.1016/j.inffus.2024.102382_b115
  doi: 10.1109/CVPR46437.2021.00084
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b12
– start-page: 4698
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b55
  article-title: A pre-trained audio-visual transformer for emotion recognition
– volume: 35
  start-page: 23765
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b111
  article-title: Learning state-aware visual representations from audible interactions
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.inffus.2024.102382_b25
  doi: 10.1145/3581783.3612836
– volume: 3
  year: 2014
  ident: 10.1016/j.inffus.2024.102382_b4
  article-title: Survey on audiovisual emotion recognition: databases, features, and data fusion strategies
  publication-title: APSIPA Trans. Signal Inf. Process.
  doi: 10.1017/ATSIP.2014.11
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b62
– volume: 30
  start-page: 6544
  year: 2021
  ident: 10.1016/j.inffus.2024.102382_b93
  article-title: Learning deep global multi-scale and local attention features for facial expression recognition in the wild
  publication-title: IEEE Trans. Image Process.
  doi: 10.1109/TIP.2021.3093397
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b9
  article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– ident: 10.1016/j.inffus.2024.102382_b58
  doi: 10.1109/ICCV.2017.73
– year: 2018
  ident: 10.1016/j.inffus.2024.102382_b13
– ident: 10.1016/j.inffus.2024.102382_b116
  doi: 10.1109/ICCV.2017.74
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b24
– ident: 10.1016/j.inffus.2024.102382_b27
  doi: 10.21437/Interspeech.2018-1929
– ident: 10.1016/j.inffus.2024.102382_b88
  doi: 10.1145/3503161.3547865
– volume: vol. 1
  start-page: 886
  year: 2005
  ident: 10.1016/j.inffus.2024.102382_b34
  article-title: Histograms of oriented gradients for human detection
– ident: 10.1016/j.inffus.2024.102382_b108
– ident: 10.1016/j.inffus.2024.102382_b19
– volume: 28
  start-page: 2880
  year: 2020
  ident: 10.1016/j.inffus.2024.102382_b40
  article-title: Panns: Large-scale pretrained audio neural networks for audio pattern recognition
  publication-title: IEEE/ACM Trans. Audio, Speech, Lang. Process.
  doi: 10.1109/TASLP.2020.3030497
– volume: 9
  issue: 11
  year: 2008
  ident: 10.1016/j.inffus.2024.102382_b114
  article-title: Visualizing data using t-sne.
  publication-title: J. Mach. Learn. Res.
– ident: 10.1016/j.inffus.2024.102382_b50
  doi: 10.1145/3581783.3612365
– ident: 10.1016/j.inffus.2024.102382_b82
  doi: 10.1109/CVPR.2018.00675
– ident: 10.1016/j.inffus.2024.102382_b94
  doi: 10.1145/3581783.3612286
– ident: 10.1016/j.inffus.2024.102382_b31
  doi: 10.21437/Interspeech.2013-56
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b98
– ident: 10.1016/j.inffus.2024.102382_b74
  doi: 10.1109/CVPR.2016.90
– year: 2019
  ident: 10.1016/j.inffus.2024.102382_b81
– volume: vol. 30
  year: 2017
  ident: 10.1016/j.inffus.2024.102382_b64
  article-title: Attention is all you need
– volume: 3
  start-page: 0076
  year: 2024
  ident: 10.1016/j.inffus.2024.102382_b11
  article-title: Affective computing: Recent advances, challenges, and future trends
  publication-title: Intell. Comput.
  doi: 10.34133/icomputing.0076
– start-page: 251
  year: 2020
  ident: 10.1016/j.inffus.2024.102382_b104
  article-title: Multimodal attention-mechanism for temporal emotion recognition
– start-page: 10347
  year: 2021
  ident: 10.1016/j.inffus.2024.102382_b80
  article-title: Training data-efficient image transformers & distillation through attention
– volume: 138
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b85
  article-title: Expression snippet transformer for robust video-based facial expression recognition
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2023.109368
– year: 2021
  ident: 10.1016/j.inffus.2024.102382_b96
– start-page: 1
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b105
  article-title: Learning cross-modal audiovisual representations with ladder networks for emotion recognition
– volume: 13
  start-page: 2156
  issue: 04
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b101
  article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2022.3216993
– start-page: 131
  year: 2017
  ident: 10.1016/j.inffus.2024.102382_b41
  article-title: CNN architectures for large-scale audio classification
– volume: 5
  start-page: 377
  issue: 4
  year: 2014
  ident: 10.1016/j.inffus.2024.102382_b28
  article-title: Crema-d: Crowd-sourced emotional multimodal actors dataset
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2014.2336244
– start-page: 112
  year: 2018
  ident: 10.1016/j.inffus.2024.102382_b71
  article-title: Multimodal speech emotion recognition using audio and text
– start-page: 234
  year: 2015
  ident: 10.1016/j.inffus.2024.102382_b26
  article-title: U-net: Convolutional networks for biomedical image segmentation
– ident: 10.1016/j.inffus.2024.102382_b15
  doi: 10.1609/aaai.v37i11.26541
– volume: 16
  start-page: 1505
  issue: 6
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b73
  article-title: Wavlm: Large-scale self-supervised pre-training for full stack speech processing
  publication-title: IEEE J. Sel. Top. Sign. Proces.
  doi: 10.1109/JSTSP.2022.3188113
– ident: 10.1016/j.inffus.2024.102382_b5
  doi: 10.1145/3503161.3548190
– year: 2020
  ident: 10.1016/j.inffus.2024.102382_b95
– year: 2015
  ident: 10.1016/j.inffus.2024.102382_b113
  article-title: Deep face recognition
– start-page: 3362
  year: 2020
  ident: 10.1016/j.inffus.2024.102382_b78
  article-title: Attentive modality hopping mechanism for speech emotion recognition
– start-page: 6182
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b91
  article-title: Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition
– start-page: 1
  year: 2023
  ident: 10.1016/j.inffus.2024.102382_b97
  article-title: Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition
– ident: 10.1016/j.inffus.2024.102382_b99
  doi: 10.1145/3581783.3613459
– ident: 10.1016/j.inffus.2024.102382_b106
– ident: 10.1016/j.inffus.2024.102382_b17
  doi: 10.1109/CVPR52688.2022.01553
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b14
  article-title: The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection
  publication-title: IEEE Trans. Affect. Comput.
– year: 2020
  ident: 10.1016/j.inffus.2024.102382_b75
– year: 1988
  ident: 10.1016/j.inffus.2024.102382_b2
– volume: 42
  start-page: 335
  year: 2008
  ident: 10.1016/j.inffus.2024.102382_b70
  article-title: IEMOCAP: Interactive emotional dyadic motion capture database
  publication-title: Lang. Resour. Eval.
  doi: 10.1007/s10579-008-9076-6
– volume: vol. 2019
  start-page: 6558
  year: 2019
  ident: 10.1016/j.inffus.2024.102382_b51
  article-title: Multimodal transformer for unaligned multimodal language sequences
– volume: 11
  start-page: 1301
  issue: 8
  year: 2017
  ident: 10.1016/j.inffus.2024.102382_b8
  article-title: End-to-end multimodal emotion recognition using deep neural networks
  publication-title: IEEE J. Sel. Top. Signal Process.
  doi: 10.1109/JSTSP.2017.2764438
– volume: 8
  start-page: 67
  issue: 1
  year: 2016
  ident: 10.1016/j.inffus.2024.102382_b68
  article-title: MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2016.2515617
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b86
– ident: 10.1016/j.inffus.2024.102382_b112
– start-page: 2822
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b110
  article-title: Self-attention fusion for audiovisual emotion recognition with incomplete data
– ident: 10.1016/j.inffus.2024.102382_b39
  doi: 10.1109/CVPRW56347.2022.00261
– volume: 35
  start-page: 28708
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b65
  article-title: Masked autoencoders that listen
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 4693
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b72
  article-title: Is cross-attention preferable to self-attention for multi-modal emotion recognition?
– ident: 10.1016/j.inffus.2024.102382_b35
  doi: 10.1145/2993148.2997632
– year: 2022
  ident: 10.1016/j.inffus.2024.102382_b87
– volume: 598
  start-page: 182
  year: 2022
  ident: 10.1016/j.inffus.2024.102382_b84
  article-title: Clip-aware expressive feature learning for video-based facial expression recognition
  publication-title: Inform. Sci.
  doi: 10.1016/j.ins.2022.03.062
– ident: 10.1016/j.inffus.2024.102382_b77
– volume: 7
  start-page: 190
  issue: 2
  year: 2015
  ident: 10.1016/j.inffus.2024.102382_b32
  article-title: The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2015.2457417
– ident: 10.1016/j.inffus.2024.102382_b103
  doi: 10.18653/v1/D17-1115
– ident: 10.1016/j.inffus.2024.102382_b46
  doi: 10.1109/CVPR52688.2022.02025
– year: 2023
  ident: 10.1016/j.inffus.2024.102382_b56
  article-title: Applying segment-level attention on bi-modal transformer encoder for audio-visual emotion recognition
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2023.3258900
– ident: 10.1016/j.inffus.2024.102382_b90
  doi: 10.1109/CVPR52729.2023.01722
– ident: 10.1016/j.inffus.2024.102382_b43
  doi: 10.1109/ICCV.2015.510
– ident: 10.1016/j.inffus.2024.102382_b36
  doi: 10.1145/3133944.3133949
– start-page: 5200
  year: 2016
  ident: 10.1016/j.inffus.2024.102382_b44
  article-title: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network
– ident: 10.1016/j.inffus.2024.102382_b29
  doi: 10.1145/3394171.3413620
– start-page: 67
  year: 2018
  ident: 10.1016/j.inffus.2024.102382_b42
  article-title: Vggface2: A dataset for recognising faces across pose and age
– ident: 10.1016/j.inffus.2024.102382_b76
  doi: 10.1145/3474085.3475292
SSID ssj0017031
Score 2.4852796
Snippet Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines....
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 102382
SubjectTerms Audio-Visual Emotion Recognition
Contrastive learning
Masked autoencoder
Self-supervised learning
Title HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
URI https://dx.doi.org/10.1016/j.inffus.2024.102382
Volume 108
WOSCitedRecordID wos001220967900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals 2021
  customDbUrl:
  eissn: 1872-6305
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0017031
  issn: 1566-2535
  databaseCode: AIEXJ
  dateStart: 20000701
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3NT9swFLe6wmE7THxsGhtMPuw2GaVJUzvcKlRUphUhrUIVl8h1HDWlSivcIPbf8xw7roGJjcMuaeU4juX3y_N7z-8DoW8g5cawzQuS8W5MQN-YkkTymISChvBdBiKoo_ivftKLCzaZJJet1u8mFuZuQcuS3d8nq_9KamgDYuvQ2VeQ2w0KDfAfiA5XIDtc_4nww-J01B9oTX9Y6OjiutiJthBoM66qPYVGXN2AoNmv1kudx1Knk9DehkoucqKqlWYfqr6fFUtyVSgdYjIw9X60nGk8jiw9540rvAuD_J5Xyjvd_1WZ6AaYReHcfwpjd72eSbtz1o1VjbZi4xvMzbkQ9J5V3DdQhF3nHud4aq9HwthkJXFMN2Ae29T5I0wNomcc3RgX5loNgdkf6xccb7o_TqD9ZGNz7oaNJ9s8NaOkepTUjPIGbYUUtKo22uqfDyY_3BGUTuxfJ9u1s2_iLmvnwOez-bNc48kq4x303ioZuG_AsYtastxD70YuQ6_aRzMDkxPsgwR7IMEGJNgDCQYS4ycgwT5IsAUJ9kDyAY3PBuPTIbFVN4gA9XFN8giE0E7Ge1lHTPNQxhx2zYwyLpKQSckoj2kGawOSI_yGPJc8DxjNBAiCOQ-ij6hdLkv5CeGAdTmwgWkmWAzffZ5EHd4RYgpLyLqRlAcoapYsFTYjvS6MskhfItgBIu6plcnI8pf-tKFGaqVKIy2mALEXn_z8yjd9QW83-D9E7fVtJY_QtrhbF-r2q8XXAyWYmyE
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HiCMAE%3A+Hierarchical+Contrastive+Masked+Autoencoder+for+self-supervised+Audio-Visual+Emotion+Recognition&rft.jtitle=Information+fusion&rft.au=Sun%2C+Licai&rft.au=Lian%2C+Zheng&rft.au=Liu%2C+Bin&rft.au=Tao%2C+Jianhua&rft.date=2024-08-01&rft.issn=1566-2535&rft.volume=108&rft.spage=102382&rft_id=info:doi/10.1016%2Fj.inffus.2024.102382&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_inffus_2024_102382
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1566-2535&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1566-2535&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1566-2535&client=summon