A vector quantized masked autoencoder for audiovisual speech emotion recognition

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer vision and image understanding Jg. 257; S. 104362
Hauptverfasser: Sadok, Samir, Leglaive, Simon, Séguier, Renaud
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Inc 01.06.2025
Elsevier
Schlagworte:
ISSN:1077-3142, 1090-235X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions. [Display omitted] •We present a self-supervised model for audiovisual speech emotion recognition.•The model operates on discrete audiovisual speech tokens.•A multimodal masked autoencoder with attention fuses the audio and visual modalities.•The model achieves state-of-the-art audiovisual speech emotion recognition results.•Ablation studies reveal the importance of each model component.
AbstractList An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions. [Display omitted] •We present a self-supervised model for audiovisual speech emotion recognition.•The model operates on discrete audiovisual speech tokens.•A multimodal masked autoencoder with attention fuses the audio and visual modalities.•The model achieves state-of-the-art audiovisual speech emotion recognition results.•Ablation studies reveal the importance of each model component.
An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoderdecoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pretraining, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
ArticleNumber 104362
Author Séguier, Renaud
Sadok, Samir
Leglaive, Simon
Author_xml – sequence: 1
  givenname: Samir
  orcidid: 0009-0007-5956-4133
  surname: Sadok
  fullname: Sadok, Samir
  email: samir.sadok@inria.fr
– sequence: 2
  givenname: Simon
  surname: Leglaive
  fullname: Leglaive, Simon
– sequence: 3
  givenname: Renaud
  surname: Séguier
  fullname: Séguier, Renaud
BackLink https://hal.science/hal-05041905$$DView record in HAL
BookMark eNp9kE1LAzEQhoNUsK3-AU979bA1H_sJXkpRKxT0oOAtTJNZm7rd1GR3QX-9WVY8ePA0w8v7hMwzI5PGNkjIJaMLRll2vV-o3nQLTnkagkRk_IRMGS1pzEX6Ohn2PI8FS_gZmXm_p5SxpGRT8rSMelStddFHB01rvlBHB_DvYUDXWmyU1eiiKhSg08b2xndQR_6IqHYRHmxrbBM5VPatMcN-Tk4rqD1e_Mw5ebm7fV6t483j_cNquYmVKGgb64pzIbKKZaCLbZqw8FlAxUFDJRTmwHiZFAL4tlBZHtJ0q1Kd5aUYSghiTq7Gd3dQy6MzB3Cf0oKR6-VGDhlNacJKmvYsdPnYVc5677D6BRiVgz-5l4M_OfiTo78AFX8gZVoYTmwdmPp_9GZEMQjoDTrplQkqUZtgqpXamv_wb0XMj1s
CitedBy_id crossref_primary_10_1038_s41598_025_08703_x
Cites_doi 10.1016/j.patter.2022.100616
10.1109/ICCV48922.2021.00015
10.1109/ICCV.2019.00182
10.1109/ICCV.2017.116
10.1109/MSP.2017.2738401
10.1109/TPAMI.2023.3234160
10.1121/1.4950698
10.1109/TAFFC.2022.3216993
10.1109/FG57933.2023.10042638
10.1109/CVPR46437.2021.01268
10.1109/ICASSPW59220.2023.10193151
10.1609/aaai.v36i10.21315
10.1109/CVPR52688.2022.01553
10.1109/ACII.2019.8925444
10.1016/j.neunet.2024.106120
10.1109/ICIP40778.2020.9191019
10.1109/CVPR52688.2022.00943
10.1109/FG57933.2023.10042713
10.1007/s10579-008-9076-6
10.1371/journal.pone.0196391
10.1145/3394171.3413620
10.1016/j.patrec.2021.03.007
10.1109/ICASSP49660.2025.10887856
10.1109/CVPR52729.2023.00213
10.1007/978-3-031-19836-6_20
10.1109/ICCV.2019.00398
10.1109/ICASSP43922.2022.9747278
10.1109/ICCV48922.2021.00676
10.1109/ICPR56361.2022.9956592
10.1609/aaai.v37i1.25130
10.1162/neco.2008.04-08-771
10.1109/ICCVW54120.2021.00407
10.1016/j.patcog.2010.09.020
10.1109/TAFFC.2014.2336244
10.1007/978-3-319-46466-4_5
ContentType Journal Article
Copyright 2025 The Author(s)
Attribution
Copyright_xml – notice: 2025 The Author(s)
– notice: Attribution
DBID 6I.
AAFTH
AAYXX
CITATION
1XC
VOOES
DOI 10.1016/j.cviu.2025.104362
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
DatabaseTitleList

DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
Computer Science
EISSN 1090-235X
ExternalDocumentID oai:HAL:hal-05041905v1
10_1016_j_cviu_2025_104362
S1077314225000852
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
6I.
6TJ
7-5
71M
8P~
AAEDT
AAEDW
AAFTH
AAIKC
AAIKJ
AAKOC
AALRI
AAMNW
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
AAYWO
ABBOA
ABEFU
ABJNI
ABMAC
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACVFH
ACZNC
ADBBV
ADCNI
ADEZE
ADFGL
ADJOM
ADMUD
ADNMO
ADTZH
AEBSH
AECPX
AEIPS
AEKER
AENEX
AEUPX
AFJKZ
AFPUW
AFTJW
AGCQF
AGHFR
AGQPQ
AGRNS
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIGII
AIIUN
AIKHN
AITUG
AKBMS
AKRWK
AKYEP
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
APXCP
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
EBS
EFBJH
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HF~
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG5
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
RNS
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPCBC
SST
SSV
SSZ
T5K
TN5
XPP
ZMT
~G-
9DU
AABNK
AAYXX
ABFNM
ACLOT
AOUOD
CITATION
EFKBS
EFLBG
SPC
~HD
1XC
VOOES
ID FETCH-LOGICAL-c380t-df22336f16ad8b541109aec2adaf3ce7a129483a2b8c67ada5bc5d6793ec2aea3
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001477485400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1077-3142
IngestDate Tue Oct 14 20:52:32 EDT 2025
Sat Nov 29 06:54:24 EST 2025
Tue Nov 18 22:28:55 EST 2025
Sat Jun 21 16:53:16 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Emotion recognition
Audiovisual speech representation learning
Masked autoencoder
Self-supervised learning
Language English
License This is an open access article under the CC BY-NC-ND license.
Attribution: http://creativecommons.org/licenses/by
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c380t-df22336f16ad8b541109aec2adaf3ce7a129483a2b8c67ada5bc5d6793ec2aea3
ORCID 0009-0007-5956-4133
0000-0001-7199-7563
0000-0002-8219-1298
OpenAccessLink https://hal.science/hal-05041905
ParticipantIDs hal_primary_oai_HAL_hal_05041905v1
crossref_primary_10_1016_j_cviu_2025_104362
crossref_citationtrail_10_1016_j_cviu_2025_104362
elsevier_sciencedirect_doi_10_1016_j_cviu_2025_104362
PublicationCentury 2000
PublicationDate June 2025
2025-06-00
2025-06
PublicationDateYYYYMMDD 2025-06-01
PublicationDate_xml – month: 06
  year: 2025
  text: June 2025
PublicationDecade 2020
PublicationTitle Computer vision and image understanding
PublicationYear 2025
Publisher Elsevier Inc
Elsevier
Publisher_xml – name: Elsevier Inc
– name: Elsevier
References Fu, Liu, Wang, Qi, Fu, Zhou, Li (b26) 2021
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning. ICML, pp. 1298–1312.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 16000–16009.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b63) 2017; vol. 30
Kollias, Zafeiriou (b42) 2018
El Ayadi, Kamel, Karray (b22) 2011; 44
Goncalves, Busso (b32) 2022; 13
Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., 2021. PeCo: Perceptual codebook for BERT pre-training of vision transformers. In: AAAI Conference on Artificial Intelligence. pp. 552–560.
Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020b. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. ICML, pp. 1597–1607.
Busso, Bulut, Lee, Kazemzadeh, Mower, Kim, Chang, Lee, Narayanan (b11) 2008; 42
Gao, R., Grauman, K., 2019. Co-separating sounds of visual objects. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3879–3888.
Tong, Song, Wang, Wang (b58) 2022; vol. 35
Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D., 2023. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 2142–2152.
Feichtenhofer, Li, He (b24) 2022; vol. 35
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186.
Arnela, Blandin, Dabbaghchian, Guasch, Alías, Pelorson, Van Hirtum, Engwall (b5) 2016; 139
Jiang, X., Zong, Y., Zheng, W., Tang, C., Xia, W., Lu, C., Liu, J., 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: ACM International Conference on Multimedia. pp. 2881–2889.
Jin, Zheng, Gao, Xu (b41) 2021
Bao, H., Dong, L., Piao, S., Wei, F., 2021. BEiT: BERT Pre-Training of Image Transformers. In: International Conference on Learning Representations. ICLR.
Jegorova, M., Petridis, S., Pantic, M., 2023. SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8.
Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 12873–12883.
Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L., 2021. Asymmetric Loss For Multi-Label Classification. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 82–91.
Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision. ECCV, pp. 69–84.
Ghaleb, E., Niehues, J., Asteriadis, S., 2020. Multimodal attention-mechanism for temporal emotion recognition. In: IEEE International Conference on Image Processing. ICIP, pp. 251–255.
Baade, Peng, Harwath (b6) 2022
Huang, Jin, Lu, Hou, Cheng, Fu, Shen, Feng (b37) 2023; 46
Livingstone, Russo (b47) 2018; 13
Sadok, Leglaive, Girin, Alameda-Pineda, Séguier (b54) 2024; 172
Dib, A., Ahn, J., Thebault, C., Gosselin, P.-H., Chevallier, L., 2023. S2f2: Self-supervised high fidelity face reconstruction from monocular image. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8.
Chen, Fan, Girshick, He (b13) 2020
Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. In: International Conference on Learning Representations. ICLR.
Liu, Zeng, Shan, Chen (b45) 2020
Sadok, S., Leglaive, S., Girin, L., Richard, G., Alameda-Pineda, X., 2025. AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP.
Mehrabian (b49) 2017
Chumachenko, K., Iosifidis, A., Gabbouj, M., 2022. Self-attention fusion for audiovisual emotion recognition with incomplete data. In: International Conference on Pattern Recognition. ICPR, pp. 2822–2828.
Liu, Zhang, Yang, Su, Zhu (b46) 2021
Wang, Boumadane, Heba (b64) 2021
Bulat, A., Tzimiropoulos, G., 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1021–1030.
Zhang, Li, Lin, Xu, Xiao (b67) 2023
Alayrac, Recasens, Schneider, Arandjelović, Ramapuram, De Fauw, Smaira, Dieleman, Zisserman (b2) 2020; vol. 33
Tran, M., Soleymani, M., 2022. A pre-trained audio-visual transformer for emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 4698–4702.
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. ViViT: A video vision transformer. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 6836–6846.
Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y., 2021. Vector-quantized Image Modeling with Improved VQGAN. In: International Conference on Learning Representations. ICLR.
Ghaleb, E., Popa, M., Asteriadis, S., 2019. Multimodal and temporal perception of audio-visual cues for emotion recognition. In: International Conference on Affective Computing and Intelligent Interaction. ACII, pp. 552–558.
Zhao, H., Gan, C., Ma, W.-C., Torralba, A., 2019. The sound of motions. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1735–1744.
Huang, Xu, Li, Baevski, Auli, Galuba, Metze, Feichtenhofer (b38) 2022; vol. 35
Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A., 2022. MultiMAE: Multi-modal multi-task masked autoencoders. In: European Conference on Computer Vision. ECCV, pp. 348–367.
Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations. ICLR.
Liu, Mallol-Ragolta, Parada-Cabaleiro, Qian, Jing, Kathan, Hu, Schuller (b44) 2022; 3
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR).
Tsai, Bai, Liang, Kolter, Morency, Salakhutdinov (b61) 2019; 2019
Schoneveld, Othmani, Abdelkawy (b57) 2021; 146
Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R., 2022b. Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations. ICLR.
Touvron, Cord, El-Nouby, Bojanowski, Joulin, Synnaeve, Jégou (b59) 2021
Févotte, Bertin, Durrieu (b25) 2009; 21
Goyal, Dollár, Girshick, Noordhuis, Wesolowski, Kyrola, Tulloch, Jia, He (b35) 2017
Sadok, S., Leglaive, S., Séguier, R., 2023. A vector quantized masked autoencoder for speech emotion recognition. In: IEEE ICASSP Workshop on Self-Supervision in Audio, Speech and beyond. SASB, pp. 1–5.
Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J., 2022a. SSAST: Self-supervised audio spectrogram transformer. In: AAAI Conference on Artificial Intelligence. 36, (10), pp. 10699–10709.
Chung, Nagrani, Zisserman (b17) 2018
Ramachandram, Taylor (b52) 2017; 34
Geng, Liu, Lee, Schuurams, Levine, Abbeel (b28) 2022
Chen, Rudnicky (b15) 2023
Cao, Cooper, Keutmann, Gur, Nenkova, Verma (b12) 2014; 5
Antoniadis, P., Pikoulis, I., Filntisis, P.P., Maragos, P., 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3645–3651.
Van Den Oord, Vinyals (b62) 2017; vol. 30
Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H., 2022. Simmim: A simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 9653–9663.
Zhang, Zhang, Song, Yi, Zhang, Kweon (b68) 2022
Pepino, Riera, Ferrer (b51) 2021
Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong (b1) 2021; vol. 34
Livingstone (10.1016/j.cviu.2025.104362_b47) 2018; 13
Geng (10.1016/j.cviu.2025.104362_b28) 2022
10.1016/j.cviu.2025.104362_b27
Liu (10.1016/j.cviu.2025.104362_b44) 2022; 3
10.1016/j.cviu.2025.104362_b29
Tong (10.1016/j.cviu.2025.104362_b58) 2022; vol. 35
Chen (10.1016/j.cviu.2025.104362_b13) 2020
10.1016/j.cviu.2025.104362_b31
10.1016/j.cviu.2025.104362_b30
10.1016/j.cviu.2025.104362_b33
10.1016/j.cviu.2025.104362_b34
Chung (10.1016/j.cviu.2025.104362_b17) 2018
10.1016/j.cviu.2025.104362_b36
Kollias (10.1016/j.cviu.2025.104362_b42) 2018
Busso (10.1016/j.cviu.2025.104362_b11) 2008; 42
Zhang (10.1016/j.cviu.2025.104362_b67) 2023
Liu (10.1016/j.cviu.2025.104362_b46) 2021
Pepino (10.1016/j.cviu.2025.104362_b51) 2021
Baade (10.1016/j.cviu.2025.104362_b6) 2022
Vaswani (10.1016/j.cviu.2025.104362_b63) 2017; vol. 30
Mehrabian (10.1016/j.cviu.2025.104362_b49) 2017
Sadok (10.1016/j.cviu.2025.104362_b54) 2024; 172
10.1016/j.cviu.2025.104362_b16
Akbari (10.1016/j.cviu.2025.104362_b1) 2021; vol. 34
10.1016/j.cviu.2025.104362_b19
10.1016/j.cviu.2025.104362_b18
Liu (10.1016/j.cviu.2025.104362_b45) 2020
Feichtenhofer (10.1016/j.cviu.2025.104362_b24) 2022; vol. 35
El Ayadi (10.1016/j.cviu.2025.104362_b22) 2011; 44
Goncalves (10.1016/j.cviu.2025.104362_b32) 2022; 13
10.1016/j.cviu.2025.104362_b20
Tsai (10.1016/j.cviu.2025.104362_b61) 2019; 2019
10.1016/j.cviu.2025.104362_b66
10.1016/j.cviu.2025.104362_b21
10.1016/j.cviu.2025.104362_b65
10.1016/j.cviu.2025.104362_b23
Févotte (10.1016/j.cviu.2025.104362_b25) 2009; 21
10.1016/j.cviu.2025.104362_b69
10.1016/j.cviu.2025.104362_b60
Jin (10.1016/j.cviu.2025.104362_b41) 2021
Goyal (10.1016/j.cviu.2025.104362_b35) 2017
Zhang (10.1016/j.cviu.2025.104362_b68) 2022
Cao (10.1016/j.cviu.2025.104362_b12) 2014; 5
10.1016/j.cviu.2025.104362_b53
Van Den Oord (10.1016/j.cviu.2025.104362_b62) 2017; vol. 30
10.1016/j.cviu.2025.104362_b55
10.1016/j.cviu.2025.104362_b10
10.1016/j.cviu.2025.104362_b56
10.1016/j.cviu.2025.104362_b14
10.1016/j.cviu.2025.104362_b3
10.1016/j.cviu.2025.104362_b4
10.1016/j.cviu.2025.104362_b9
10.1016/j.cviu.2025.104362_b7
10.1016/j.cviu.2025.104362_b8
10.1016/j.cviu.2025.104362_b50
Arnela (10.1016/j.cviu.2025.104362_b5) 2016; 139
Huang (10.1016/j.cviu.2025.104362_b38) 2022; vol. 35
Chen (10.1016/j.cviu.2025.104362_b15) 2023
Fu (10.1016/j.cviu.2025.104362_b26) 2021
10.1016/j.cviu.2025.104362_b39
Wang (10.1016/j.cviu.2025.104362_b64) 2021
Alayrac (10.1016/j.cviu.2025.104362_b2) 2020; vol. 33
Schoneveld (10.1016/j.cviu.2025.104362_b57) 2021; 146
Touvron (10.1016/j.cviu.2025.104362_b59) 2021
10.1016/j.cviu.2025.104362_b43
10.1016/j.cviu.2025.104362_b48
Ramachandram (10.1016/j.cviu.2025.104362_b52) 2017; 34
Huang (10.1016/j.cviu.2025.104362_b37) 2023; 46
10.1016/j.cviu.2025.104362_b40
References_xml – reference: Sadok, S., Leglaive, S., Girin, L., Richard, G., Alameda-Pineda, X., 2025. AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP.
– volume: vol. 35
  start-page: 10078
  year: 2022
  end-page: 10093
  ident: b58
  article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– volume: vol. 30
  year: 2017
  ident: b63
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– reference: Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D., 2023. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 2142–2152.
– volume: vol. 35
  start-page: 35946
  year: 2022
  end-page: 35958
  ident: b24
  article-title: Masked autoencoders as spatiotemporal learners
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– reference: Zhao, H., Gan, C., Ma, W.-C., Torralba, A., 2019. The sound of motions. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1735–1744.
– reference: Antoniadis, P., Pikoulis, I., Filntisis, P.P., Maragos, P., 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3645–3651.
– reference: Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H., 2022. Simmim: A simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 9653–9663.
– year: 2022
  ident: b28
  article-title: Multimodal masked autoencoders learn transferable representations
– reference: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 16000–16009.
– year: 2021
  ident: b59
  article-title: Augmenting convolutional networks with attention-based aggregation
– reference: Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations. ICLR.
– reference: Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y., 2021. Vector-quantized Image Modeling with Improved VQGAN. In: International Conference on Learning Representations. ICLR.
– volume: 172
  year: 2024
  ident: b54
  article-title: A multimodal dynamical variational autoencoder for audiovisual speech representation learning
  publication-title: Neural Netw.
– reference: Bulat, A., Tzimiropoulos, G., 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1021–1030.
– year: 2020
  ident: b13
  article-title: Improved baselines with momentum contrastive learning
– year: 2017
  ident: b35
  article-title: Accurate, large minibatch SGD: Training ImageNet in 1 hour
– start-page: 2438
  year: 2022
  end-page: 2442
  ident: b6
  article-title: MAE-AST: Masked autoencoding audio spectrogram transformer
  publication-title: INTERSPEECH
– volume: vol. 30
  year: 2017
  ident: b62
  article-title: Neural discrete representation learning
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– reference: Ghaleb, E., Niehues, J., Asteriadis, S., 2020. Multimodal attention-mechanism for temporal emotion recognition. In: IEEE International Conference on Image Processing. ICIP, pp. 251–255.
– volume: 3
  year: 2022
  ident: b44
  article-title: Audio self-supervised learning: A survey
  publication-title: Patterns
– volume: 5
  start-page: 377
  year: 2014
  end-page: 390
  ident: b12
  article-title: CREMA-D: Crowd-sourced emotional multimodal actors dataset
  publication-title: IEEE Trans. Affect. Comput.
– reference: Tran, M., Soleymani, M., 2022. A pre-trained audio-visual transformer for emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 4698–4702.
– year: 2017
  ident: b49
  article-title: Nonverbal Communication
– reference: Gao, R., Grauman, K., 2019. Co-separating sounds of visual objects. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3879–3888.
– reference: Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. In: International Conference on Learning Representations. ICLR.
– volume: 13
  start-page: 2156
  year: 2022
  end-page: 2170
  ident: b32
  article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features
  publication-title: IEEE Trans. Affect. Comput.
– start-page: 1
  year: 2023
  end-page: 5
  ident: b15
  article-title: Exploring Wav2vec 2.0 fine tuning for improved speech emotion recognition
  publication-title: IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: Dib, A., Ahn, J., Thebault, C., Gosselin, P.-H., Chevallier, L., 2023. S2f2: Self-supervised high fidelity face reconstruction from monocular image. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8.
– reference: Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 12873–12883.
– volume: 42
  start-page: 335
  year: 2008
  end-page: 359
  ident: b11
  article-title: IEMOCAP: Interactive emotional dyadic motion capture database
  publication-title: Lang. Resour. Eval.
– reference: Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision. ECCV, pp. 69–84.
– reference: Bao, H., Dong, L., Piao, S., Wei, F., 2021. BEiT: BERT Pre-Training of Image Transformers. In: International Conference on Learning Representations. ICLR.
– volume: vol. 35
  start-page: 28708
  year: 2022
  end-page: 28720
  ident: b38
  article-title: Masked autoencoders that listen
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– reference: Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020b. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. ICML, pp. 1597–1607.
– volume: 21
  start-page: 793
  year: 2009
  end-page: 830
  ident: b25
  article-title: Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis
  publication-title: Neural Comput.
– reference: Ghaleb, E., Popa, M., Asteriadis, S., 2019. Multimodal and temporal perception of audio-visual cues for emotion recognition. In: International Conference on Affective Computing and Intelligent Interaction. ACII, pp. 552–558.
– start-page: 1086
  year: 2018
  end-page: 1090
  ident: b17
  article-title: VoxCeleb2: Deep speaker recognition
  publication-title: INTERSPEECH
– reference: Jiang, X., Zong, Y., Zheng, W., Tang, C., Xia, W., Lu, C., Liu, J., 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: ACM International Conference on Multimedia. pp. 2881–2889.
– reference: Chumachenko, K., Iosifidis, A., Gabbouj, M., 2022. Self-attention fusion for audiovisual emotion recognition with incomplete data. In: International Conference on Pattern Recognition. ICPR, pp. 2822–2828.
– volume: 34
  start-page: 96
  year: 2017
  end-page: 108
  ident: b52
  article-title: Deep multimodal learning: A survey on recent advances and trends
  publication-title: IEEE Signal Process. Mag.
– reference: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR).
– year: 2018
  ident: b42
  article-title: Aff-Wild2: Extending the Aff-Wild database for affect recognition
– volume: 46
  start-page: 1
  year: 2023
  end-page: 13
  ident: b37
  article-title: Contrastive masked autoencoders are stronger vision learners
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
– year: 2020
  ident: b45
  article-title: Emotion recognition for in-the-wild videos
– year: 2022
  ident: b68
  article-title: A survey on masked autoencoder for self-supervised learning in vision and beyond
– year: 2023
  ident: b67
  article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– year: 2021
  ident: b41
  article-title: A multi-modal and multi-task learning method for action unit and expression recognition
– reference: Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. ViViT: A video vision transformer. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 6836–6846.
– reference: Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R., 2022b. Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations. ICLR.
– volume: 139
  start-page: 2852
  year: 2016
  end-page: 2859
  ident: b5
  article-title: Influence of lips on the production of vowels based on finite element simulations and experiments
  publication-title: J. Acoust. Soc. Am.
– volume: 146
  start-page: 1
  year: 2021
  end-page: 7
  ident: b57
  article-title: Leveraging recent advances in deep learning for audio-visual emotion recognition
  publication-title: Pattern Recognit. Lett.
– reference: Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186.
– volume: 44
  start-page: 572
  year: 2011
  end-page: 587
  ident: b22
  article-title: Survey on speech emotion recognition: Features, classification schemes, and databases
  publication-title: Pattern Recognit.
– year: 2021
  ident: b26
  article-title: A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition
– year: 2021
  ident: b46
  article-title: Query2label: A simple transformer way to multi-label classification
– reference: Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning. ICML, pp. 1298–1312.
– year: 2021
  ident: b64
  article-title: A fine-tuned Wav2vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding
– volume: 13
  year: 2018
  ident: b47
  article-title: The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English
  publication-title: PloS One
– reference: Sadok, S., Leglaive, S., Séguier, R., 2023. A vector quantized masked autoencoder for speech emotion recognition. In: IEEE ICASSP Workshop on Self-Supervision in Audio, Speech and beyond. SASB, pp. 1–5.
– volume: vol. 34
  start-page: 24206
  year: 2021
  end-page: 24221
  ident: b1
  article-title: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– volume: 2019
  start-page: 6558
  year: 2019
  ident: b61
  article-title: Multimodal transformer for unaligned multimodal language sequences
  publication-title: Association for Computational Linguistics. Meeting
– reference: Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., 2021. PeCo: Perceptual codebook for BERT pre-training of vision transformers. In: AAAI Conference on Artificial Intelligence. pp. 552–560.
– reference: Jegorova, M., Petridis, S., Pantic, M., 2023. SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8.
– reference: Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A., 2022. MultiMAE: Multi-modal multi-task masked autoencoders. In: European Conference on Computer Vision. ECCV, pp. 348–367.
– reference: Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L., 2021. Asymmetric Loss For Multi-Label Classification. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 82–91.
– reference: Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J., 2022a. SSAST: Self-supervised audio spectrogram transformer. In: AAAI Conference on Artificial Intelligence. 36, (10), pp. 10699–10709.
– start-page: 3400
  year: 2021
  end-page: 3404
  ident: b51
  article-title: Emotion recognition from speech using Wav2Vec 2.0 embeddings
  publication-title: INTERSPEECH
– volume: vol. 33
  start-page: 25
  year: 2020
  end-page: 37
  ident: b2
  article-title: Self-supervised multimodal versatile networks
  publication-title: Advances in Neural Information Processing Systems (NeurIPS)
– year: 2021
  ident: 10.1016/j.cviu.2025.104362_b26
– volume: 3
  issue: 12
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b44
  article-title: Audio self-supervised learning: A survey
  publication-title: Patterns
  doi: 10.1016/j.patter.2022.100616
– ident: 10.1016/j.cviu.2025.104362_b34
– volume: vol. 30
  year: 2017
  ident: 10.1016/j.cviu.2025.104362_b63
  article-title: Attention is all you need
– ident: 10.1016/j.cviu.2025.104362_b53
  doi: 10.1109/ICCV48922.2021.00015
– start-page: 2438
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b6
  article-title: MAE-AST: Masked autoencoding audio spectrogram transformer
– ident: 10.1016/j.cviu.2025.104362_b69
  doi: 10.1109/ICCV.2019.00182
– ident: 10.1016/j.cviu.2025.104362_b10
  doi: 10.1109/ICCV.2017.116
– volume: vol. 33
  start-page: 25
  year: 2020
  ident: 10.1016/j.cviu.2025.104362_b2
  article-title: Self-supervised multimodal versatile networks
– year: 2017
  ident: 10.1016/j.cviu.2025.104362_b49
– start-page: 1086
  year: 2018
  ident: 10.1016/j.cviu.2025.104362_b17
  article-title: VoxCeleb2: Deep speaker recognition
– volume: 34
  start-page: 96
  issue: 6
  year: 2017
  ident: 10.1016/j.cviu.2025.104362_b52
  article-title: Deep multimodal learning: A survey on recent advances and trends
  publication-title: IEEE Signal Process. Mag.
  doi: 10.1109/MSP.2017.2738401
– year: 2021
  ident: 10.1016/j.cviu.2025.104362_b64
– volume: 46
  start-page: 1
  year: 2023
  ident: 10.1016/j.cviu.2025.104362_b37
  article-title: Contrastive masked autoencoders are stronger vision learners
  publication-title: IEEE Trans. Pattern Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2023.3234160
– ident: 10.1016/j.cviu.2025.104362_b66
– volume: 139
  start-page: 2852
  issue: 5
  year: 2016
  ident: 10.1016/j.cviu.2025.104362_b5
  article-title: Influence of lips on the production of vowels based on finite element simulations and experiments
  publication-title: J. Acoust. Soc. Am.
  doi: 10.1121/1.4950698
– year: 2021
  ident: 10.1016/j.cviu.2025.104362_b59
– volume: vol. 35
  start-page: 28708
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b38
  article-title: Masked autoencoders that listen
– volume: 13
  start-page: 2156
  issue: 4
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b32
  article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2022.3216993
– ident: 10.1016/j.cviu.2025.104362_b39
  doi: 10.1109/FG57933.2023.10042638
– ident: 10.1016/j.cviu.2025.104362_b21
– ident: 10.1016/j.cviu.2025.104362_b23
  doi: 10.1109/CVPR46437.2021.01268
– ident: 10.1016/j.cviu.2025.104362_b9
– ident: 10.1016/j.cviu.2025.104362_b31
– ident: 10.1016/j.cviu.2025.104362_b14
– ident: 10.1016/j.cviu.2025.104362_b56
  doi: 10.1109/ICASSPW59220.2023.10193151
– ident: 10.1016/j.cviu.2025.104362_b33
  doi: 10.1609/aaai.v36i10.21315
– ident: 10.1016/j.cviu.2025.104362_b18
– ident: 10.1016/j.cviu.2025.104362_b36
  doi: 10.1109/CVPR52688.2022.01553
– ident: 10.1016/j.cviu.2025.104362_b30
  doi: 10.1109/ACII.2019.8925444
– volume: vol. 34
  start-page: 24206
  year: 2021
  ident: 10.1016/j.cviu.2025.104362_b1
  article-title: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text
– volume: 172
  year: 2024
  ident: 10.1016/j.cviu.2025.104362_b54
  article-title: A multimodal dynamical variational autoencoder for audiovisual speech representation learning
  publication-title: Neural Netw.
  doi: 10.1016/j.neunet.2024.106120
– year: 2017
  ident: 10.1016/j.cviu.2025.104362_b35
– ident: 10.1016/j.cviu.2025.104362_b29
  doi: 10.1109/ICIP40778.2020.9191019
– start-page: 3400
  year: 2021
  ident: 10.1016/j.cviu.2025.104362_b51
  article-title: Emotion recognition from speech using Wav2Vec 2.0 embeddings
– ident: 10.1016/j.cviu.2025.104362_b65
  doi: 10.1109/CVPR52688.2022.00943
– ident: 10.1016/j.cviu.2025.104362_b19
  doi: 10.1109/FG57933.2023.10042713
– volume: 42
  start-page: 335
  year: 2008
  ident: 10.1016/j.cviu.2025.104362_b11
  article-title: IEMOCAP: Interactive emotional dyadic motion capture database
  publication-title: Lang. Resour. Eval.
  doi: 10.1007/s10579-008-9076-6
– volume: vol. 35
  start-page: 35946
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b24
  article-title: Masked autoencoders as spatiotemporal learners
– volume: 13
  issue: 5
  year: 2018
  ident: 10.1016/j.cviu.2025.104362_b47
  article-title: The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English
  publication-title: PloS One
  doi: 10.1371/journal.pone.0196391
– ident: 10.1016/j.cviu.2025.104362_b8
– ident: 10.1016/j.cviu.2025.104362_b40
  doi: 10.1145/3394171.3413620
– volume: vol. 35
  start-page: 10078
  year: 2022
  ident: 10.1016/j.cviu.2025.104362_b58
  article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
– year: 2020
  ident: 10.1016/j.cviu.2025.104362_b13
– volume: 146
  start-page: 1
  year: 2021
  ident: 10.1016/j.cviu.2025.104362_b57
  article-title: Leveraging recent advances in deep learning for audio-visual emotion recognition
  publication-title: Pattern Recognit. Lett.
  doi: 10.1016/j.patrec.2021.03.007
– year: 2021
  ident: 10.1016/j.cviu.2025.104362_b46
– ident: 10.1016/j.cviu.2025.104362_b55
  doi: 10.1109/ICASSP49660.2025.10887856
– ident: 10.1016/j.cviu.2025.104362_b43
  doi: 10.1109/CVPR52729.2023.00213
– year: 2020
  ident: 10.1016/j.cviu.2025.104362_b45
– ident: 10.1016/j.cviu.2025.104362_b7
  doi: 10.1007/978-3-031-19836-6_20
– ident: 10.1016/j.cviu.2025.104362_b27
  doi: 10.1109/ICCV.2019.00398
– start-page: 1
  year: 2023
  ident: 10.1016/j.cviu.2025.104362_b15
  article-title: Exploring Wav2vec 2.0 fine tuning for improved speech emotion recognition
– ident: 10.1016/j.cviu.2025.104362_b60
  doi: 10.1109/ICASSP43922.2022.9747278
– year: 2023
  ident: 10.1016/j.cviu.2025.104362_b67
  article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– year: 2021
  ident: 10.1016/j.cviu.2025.104362_b41
– ident: 10.1016/j.cviu.2025.104362_b4
  doi: 10.1109/ICCV48922.2021.00676
– year: 2022
  ident: 10.1016/j.cviu.2025.104362_b28
– ident: 10.1016/j.cviu.2025.104362_b16
  doi: 10.1109/ICPR56361.2022.9956592
– ident: 10.1016/j.cviu.2025.104362_b20
  doi: 10.1609/aaai.v37i1.25130
– volume: 2019
  start-page: 6558
  year: 2019
  ident: 10.1016/j.cviu.2025.104362_b61
  article-title: Multimodal transformer for unaligned multimodal language sequences
– volume: 21
  start-page: 793
  issue: 3
  year: 2009
  ident: 10.1016/j.cviu.2025.104362_b25
  article-title: Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis
  publication-title: Neural Comput.
  doi: 10.1162/neco.2008.04-08-771
– ident: 10.1016/j.cviu.2025.104362_b3
  doi: 10.1109/ICCVW54120.2021.00407
– year: 2018
  ident: 10.1016/j.cviu.2025.104362_b42
– ident: 10.1016/j.cviu.2025.104362_b48
– year: 2022
  ident: 10.1016/j.cviu.2025.104362_b68
– volume: vol. 30
  year: 2017
  ident: 10.1016/j.cviu.2025.104362_b62
  article-title: Neural discrete representation learning
– volume: 44
  start-page: 572
  issue: 3
  year: 2011
  ident: 10.1016/j.cviu.2025.104362_b22
  article-title: Survey on speech emotion recognition: Features, classification schemes, and databases
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2010.09.020
– volume: 5
  start-page: 377
  issue: 4
  year: 2014
  ident: 10.1016/j.cviu.2025.104362_b12
  article-title: CREMA-D: Crowd-sourced emotional multimodal actors dataset
  publication-title: IEEE Trans. Affect. Comput.
  doi: 10.1109/TAFFC.2014.2336244
– ident: 10.1016/j.cviu.2025.104362_b50
  doi: 10.1007/978-3-319-46466-4_5
SSID ssj0011491
Score 2.46077
Snippet An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a...
SourceID hal
crossref
elsevier
SourceType Open Access Repository
Enrichment Source
Index Database
Publisher
StartPage 104362
SubjectTerms Audiovisual speech representation learning
Computer Science
Computer Vision and Pattern Recognition
Emotion recognition
Machine Learning
Masked autoencoder
Multimedia
Self-supervised learning
Signal and Image Processing
Sound
Title A vector quantized masked autoencoder for audiovisual speech emotion recognition
URI https://dx.doi.org/10.1016/j.cviu.2025.104362
https://hal.science/hal-05041905
Volume 257
WOSCitedRecordID wos001477485400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: ScienceDirect Freedom Collection - Elsevier
  customDbUrl:
  eissn: 1090-235X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0011491
  issn: 1077-3142
  databaseCode: AIEXJ
  dateStart: 19950101
  isFulltext: true
  titleUrlDefault: https://www.sciencedirect.com
  providerName: Elsevier
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwELZKlwMceCwglpcsxC1KlcR5HiPYVReqqtIuqLfIdhya3TatNg-t-AP8bcaxk2ZBrJYDFzeaOE3q-ToZj78ZI_TBdiihhAVmxrMIJijCMSOYCJk0o8wJWUBoS5D9Ngvm83C5jBaj0c8uF6ZZB0URXl9Hu_-qapCBsmXq7D-ou_9SEMAxKB1aUDu0d1J8bDRtJF7mSxZV_gM8yg0tL-GD1tVW1q2U5SNa8mQtqah5KTNIyp0QfGUItauP0fOKtNa6WgZ6DwhDpaSrlYeNpP3UwySZPmxD0-2lijtv8p4FPBPgsoORbU_kmz0P4Eyt2n-vc70hsyjgEYdxCcfb86dUsEy_2Qe21QpkSFQV05oILYss0yHecmiQHVWy-g_jruIMFxPe5PVE3lIuUBNtzG9U0p7GZ8ni00kyO51_uXl2QD-cxjNoV3RtWp7lgmPkNTCNPnACLwrH6CA-PV5-7hekYCJpK_qq-g06_0pRBX9_oL_5OPdWXbS-9V7On6BHetqBYwWXp2gkikP0WE9BsDbwJYg6DXeyQ_RwULLyGVrEWMEL9_DCCl54AC8M8MIDeGEFL6zhhQfweo6-nhyff5yaelcOk5PQqsw0A4-S-Jnt0zRknitL1lLBHZrSjHARUPAg3ZBQh4XcD0DqMe6lPrwHZCdByQs0LraFeImwxSxXCN-nARy4NAopszOPO4KFEWPEPUJ2N44J1yXr5c4p66TjJl4kcuwTOfaJGvsjZPTX7FTBllt7e516Eu1yKlcyAdDdet170GV_A1mjHeCUSNkeTK_u0uk1erD_97xB4-qqFm_Rfd5UeXn1TsPwF8FZrpI
linkProvider Elsevier
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+vector+quantized+masked+autoencoder+for+audiovisual+speech+emotion+recognition&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Sadok%2C+Samir&rft.au=Leglaive%2C+Simon&rft.au=S%C3%A9guier%2C+Renaud&rft.date=2025-06-01&rft.pub=Elsevier&rft.issn=1077-3142&rft.eissn=1090-235X&rft.volume=257&rft_id=info:doi/10.1016%2Fj.cviu.2025.104362&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-05041905v1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon