A vector quantized masked autoencoder for audiovisual speech emotion recognition
An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model...
Gespeichert in:
| Veröffentlicht in: | Computer vision and image understanding Jg. 257; S. 104362 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Elsevier Inc
01.06.2025
Elsevier |
| Schlagworte: | |
| ISSN: | 1077-3142, 1090-235X |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
[Display omitted]
•We present a self-supervised model for audiovisual speech emotion recognition.•The model operates on discrete audiovisual speech tokens.•A multimodal masked autoencoder with attention fuses the audio and visual modalities.•The model achieves state-of-the-art audiovisual speech emotion recognition results.•Ablation studies reveal the importance of each model component. |
|---|---|
| AbstractList | An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder–decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
[Display omitted]
•We present a self-supervised model for audiovisual speech emotion recognition.•The model operates on discrete audiovisual speech tokens.•A multimodal masked autoencoder with attention fuses the audio and visual modalities.•The model achieves state-of-the-art audiovisual speech emotion recognition results.•Ablation studies reveal the importance of each model component. An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoderdecoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pretraining, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions. |
| ArticleNumber | 104362 |
| Author | Séguier, Renaud Sadok, Samir Leglaive, Simon |
| Author_xml | – sequence: 1 givenname: Samir orcidid: 0009-0007-5956-4133 surname: Sadok fullname: Sadok, Samir email: samir.sadok@inria.fr – sequence: 2 givenname: Simon surname: Leglaive fullname: Leglaive, Simon – sequence: 3 givenname: Renaud surname: Séguier fullname: Séguier, Renaud |
| BackLink | https://hal.science/hal-05041905$$DView record in HAL |
| BookMark | eNp9kE1LAzEQhoNUsK3-AU979bA1H_sJXkpRKxT0oOAtTJNZm7rd1GR3QX-9WVY8ePA0w8v7hMwzI5PGNkjIJaMLRll2vV-o3nQLTnkagkRk_IRMGS1pzEX6Ohn2PI8FS_gZmXm_p5SxpGRT8rSMelStddFHB01rvlBHB_DvYUDXWmyU1eiiKhSg08b2xndQR_6IqHYRHmxrbBM5VPatMcN-Tk4rqD1e_Mw5ebm7fV6t483j_cNquYmVKGgb64pzIbKKZaCLbZqw8FlAxUFDJRTmwHiZFAL4tlBZHtJ0q1Kd5aUYSghiTq7Gd3dQy6MzB3Cf0oKR6-VGDhlNacJKmvYsdPnYVc5677D6BRiVgz-5l4M_OfiTo78AFX8gZVoYTmwdmPp_9GZEMQjoDTrplQkqUZtgqpXamv_wb0XMj1s |
| CitedBy_id | crossref_primary_10_1038_s41598_025_08703_x |
| Cites_doi | 10.1016/j.patter.2022.100616 10.1109/ICCV48922.2021.00015 10.1109/ICCV.2019.00182 10.1109/ICCV.2017.116 10.1109/MSP.2017.2738401 10.1109/TPAMI.2023.3234160 10.1121/1.4950698 10.1109/TAFFC.2022.3216993 10.1109/FG57933.2023.10042638 10.1109/CVPR46437.2021.01268 10.1109/ICASSPW59220.2023.10193151 10.1609/aaai.v36i10.21315 10.1109/CVPR52688.2022.01553 10.1109/ACII.2019.8925444 10.1016/j.neunet.2024.106120 10.1109/ICIP40778.2020.9191019 10.1109/CVPR52688.2022.00943 10.1109/FG57933.2023.10042713 10.1007/s10579-008-9076-6 10.1371/journal.pone.0196391 10.1145/3394171.3413620 10.1016/j.patrec.2021.03.007 10.1109/ICASSP49660.2025.10887856 10.1109/CVPR52729.2023.00213 10.1007/978-3-031-19836-6_20 10.1109/ICCV.2019.00398 10.1109/ICASSP43922.2022.9747278 10.1109/ICCV48922.2021.00676 10.1109/ICPR56361.2022.9956592 10.1609/aaai.v37i1.25130 10.1162/neco.2008.04-08-771 10.1109/ICCVW54120.2021.00407 10.1016/j.patcog.2010.09.020 10.1109/TAFFC.2014.2336244 10.1007/978-3-319-46466-4_5 |
| ContentType | Journal Article |
| Copyright | 2025 The Author(s) Attribution |
| Copyright_xml | – notice: 2025 The Author(s) – notice: Attribution |
| DBID | 6I. AAFTH AAYXX CITATION 1XC VOOES |
| DOI | 10.1016/j.cviu.2025.104362 |
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences Engineering Computer Science |
| EISSN | 1090-235X |
| ExternalDocumentID | oai:HAL:hal-05041905v1 10_1016_j_cviu_2025_104362 S1077314225000852 |
| GroupedDBID | --K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 6I. 6TJ 7-5 71M 8P~ AAEDT AAEDW AAFTH AAIKC AAIKJ AAKOC AALRI AAMNW AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN AAYWO ABBOA ABEFU ABJNI ABMAC ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACVFH ACZNC ADBBV ADCNI ADEZE ADFGL ADJOM ADMUD ADNMO ADTZH AEBSH AECPX AEIPS AEKER AENEX AEUPX AFJKZ AFPUW AFTJW AGCQF AGHFR AGQPQ AGRNS AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIGII AIIUN AIKHN AITUG AKBMS AKRWK AKYEP ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU APXCP ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 EBS EFBJH EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HF~ HVGLF HZ~ IHE J1W JJJVA KOM LG5 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG RNS ROL RPZ SDF SDG SDP SES SEW SPCBC SST SSV SSZ T5K TN5 XPP ZMT ~G- 9DU AABNK AAYXX ABFNM ACLOT AOUOD CITATION EFKBS EFLBG SPC ~HD 1XC VOOES |
| ID | FETCH-LOGICAL-c380t-df22336f16ad8b541109aec2adaf3ce7a129483a2b8c67ada5bc5d6793ec2aea3 |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001477485400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1077-3142 |
| IngestDate | Tue Oct 14 20:52:32 EDT 2025 Sat Nov 29 06:54:24 EST 2025 Tue Nov 18 22:28:55 EST 2025 Sat Jun 21 16:53:16 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Emotion recognition Audiovisual speech representation learning Masked autoencoder Self-supervised learning |
| Language | English |
| License | This is an open access article under the CC BY-NC-ND license. Attribution: http://creativecommons.org/licenses/by |
| LinkModel | OpenURL |
| MergedId | FETCHMERGED-LOGICAL-c380t-df22336f16ad8b541109aec2adaf3ce7a129483a2b8c67ada5bc5d6793ec2aea3 |
| ORCID | 0009-0007-5956-4133 0000-0001-7199-7563 0000-0002-8219-1298 |
| OpenAccessLink | https://hal.science/hal-05041905 |
| ParticipantIDs | hal_primary_oai_HAL_hal_05041905v1 crossref_primary_10_1016_j_cviu_2025_104362 crossref_citationtrail_10_1016_j_cviu_2025_104362 elsevier_sciencedirect_doi_10_1016_j_cviu_2025_104362 |
| PublicationCentury | 2000 |
| PublicationDate | June 2025 2025-06-00 2025-06 |
| PublicationDateYYYYMMDD | 2025-06-01 |
| PublicationDate_xml | – month: 06 year: 2025 text: June 2025 |
| PublicationDecade | 2020 |
| PublicationTitle | Computer vision and image understanding |
| PublicationYear | 2025 |
| Publisher | Elsevier Inc Elsevier |
| Publisher_xml | – name: Elsevier Inc – name: Elsevier |
| References | Fu, Liu, Wang, Qi, Fu, Zhou, Li (b26) 2021 Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning. ICML, pp. 1298–1312. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 16000–16009. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b63) 2017; vol. 30 Kollias, Zafeiriou (b42) 2018 El Ayadi, Kamel, Karray (b22) 2011; 44 Goncalves, Busso (b32) 2022; 13 Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., 2021. PeCo: Perceptual codebook for BERT pre-training of vision transformers. In: AAAI Conference on Artificial Intelligence. pp. 552–560. Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020b. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. ICML, pp. 1597–1607. Busso, Bulut, Lee, Kazemzadeh, Mower, Kim, Chang, Lee, Narayanan (b11) 2008; 42 Gao, R., Grauman, K., 2019. Co-separating sounds of visual objects. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3879–3888. Tong, Song, Wang, Wang (b58) 2022; vol. 35 Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D., 2023. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 2142–2152. Feichtenhofer, Li, He (b24) 2022; vol. 35 Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Arnela, Blandin, Dabbaghchian, Guasch, Alías, Pelorson, Van Hirtum, Engwall (b5) 2016; 139 Jiang, X., Zong, Y., Zheng, W., Tang, C., Xia, W., Lu, C., Liu, J., 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: ACM International Conference on Multimedia. pp. 2881–2889. Jin, Zheng, Gao, Xu (b41) 2021 Bao, H., Dong, L., Piao, S., Wei, F., 2021. BEiT: BERT Pre-Training of Image Transformers. In: International Conference on Learning Representations. ICLR. Jegorova, M., Petridis, S., Pantic, M., 2023. SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8. Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 12873–12883. Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L., 2021. Asymmetric Loss For Multi-Label Classification. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 82–91. Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision. ECCV, pp. 69–84. Ghaleb, E., Niehues, J., Asteriadis, S., 2020. Multimodal attention-mechanism for temporal emotion recognition. In: IEEE International Conference on Image Processing. ICIP, pp. 251–255. Baade, Peng, Harwath (b6) 2022 Huang, Jin, Lu, Hou, Cheng, Fu, Shen, Feng (b37) 2023; 46 Livingstone, Russo (b47) 2018; 13 Sadok, Leglaive, Girin, Alameda-Pineda, Séguier (b54) 2024; 172 Dib, A., Ahn, J., Thebault, C., Gosselin, P.-H., Chevallier, L., 2023. S2f2: Self-supervised high fidelity face reconstruction from monocular image. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8. Chen, Fan, Girshick, He (b13) 2020 Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. In: International Conference on Learning Representations. ICLR. Liu, Zeng, Shan, Chen (b45) 2020 Sadok, S., Leglaive, S., Girin, L., Richard, G., Alameda-Pineda, X., 2025. AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP. Mehrabian (b49) 2017 Chumachenko, K., Iosifidis, A., Gabbouj, M., 2022. Self-attention fusion for audiovisual emotion recognition with incomplete data. In: International Conference on Pattern Recognition. ICPR, pp. 2822–2828. Liu, Zhang, Yang, Su, Zhu (b46) 2021 Wang, Boumadane, Heba (b64) 2021 Bulat, A., Tzimiropoulos, G., 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1021–1030. Zhang, Li, Lin, Xu, Xiao (b67) 2023 Alayrac, Recasens, Schneider, Arandjelović, Ramapuram, De Fauw, Smaira, Dieleman, Zisserman (b2) 2020; vol. 33 Tran, M., Soleymani, M., 2022. A pre-trained audio-visual transformer for emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 4698–4702. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. ViViT: A video vision transformer. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 6836–6846. Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y., 2021. Vector-quantized Image Modeling with Improved VQGAN. In: International Conference on Learning Representations. ICLR. Ghaleb, E., Popa, M., Asteriadis, S., 2019. Multimodal and temporal perception of audio-visual cues for emotion recognition. In: International Conference on Affective Computing and Intelligent Interaction. ACII, pp. 552–558. Zhao, H., Gan, C., Ma, W.-C., Torralba, A., 2019. The sound of motions. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1735–1744. Huang, Xu, Li, Baevski, Auli, Galuba, Metze, Feichtenhofer (b38) 2022; vol. 35 Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A., 2022. MultiMAE: Multi-modal multi-task masked autoencoders. In: European Conference on Computer Vision. ECCV, pp. 348–367. Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations. ICLR. Liu, Mallol-Ragolta, Parada-Cabaleiro, Qian, Jing, Kathan, Hu, Schuller (b44) 2022; 3 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR). Tsai, Bai, Liang, Kolter, Morency, Salakhutdinov (b61) 2019; 2019 Schoneveld, Othmani, Abdelkawy (b57) 2021; 146 Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R., 2022b. Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations. ICLR. Touvron, Cord, El-Nouby, Bojanowski, Joulin, Synnaeve, Jégou (b59) 2021 Févotte, Bertin, Durrieu (b25) 2009; 21 Goyal, Dollár, Girshick, Noordhuis, Wesolowski, Kyrola, Tulloch, Jia, He (b35) 2017 Sadok, S., Leglaive, S., Séguier, R., 2023. A vector quantized masked autoencoder for speech emotion recognition. In: IEEE ICASSP Workshop on Self-Supervision in Audio, Speech and beyond. SASB, pp. 1–5. Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J., 2022a. SSAST: Self-supervised audio spectrogram transformer. In: AAAI Conference on Artificial Intelligence. 36, (10), pp. 10699–10709. Chung, Nagrani, Zisserman (b17) 2018 Ramachandram, Taylor (b52) 2017; 34 Geng, Liu, Lee, Schuurams, Levine, Abbeel (b28) 2022 Chen, Rudnicky (b15) 2023 Cao, Cooper, Keutmann, Gur, Nenkova, Verma (b12) 2014; 5 Antoniadis, P., Pikoulis, I., Filntisis, P.P., Maragos, P., 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3645–3651. Van Den Oord, Vinyals (b62) 2017; vol. 30 Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H., 2022. Simmim: A simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 9653–9663. Zhang, Zhang, Song, Yi, Zhang, Kweon (b68) 2022 Pepino, Riera, Ferrer (b51) 2021 Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong (b1) 2021; vol. 34 Livingstone (10.1016/j.cviu.2025.104362_b47) 2018; 13 Geng (10.1016/j.cviu.2025.104362_b28) 2022 10.1016/j.cviu.2025.104362_b27 Liu (10.1016/j.cviu.2025.104362_b44) 2022; 3 10.1016/j.cviu.2025.104362_b29 Tong (10.1016/j.cviu.2025.104362_b58) 2022; vol. 35 Chen (10.1016/j.cviu.2025.104362_b13) 2020 10.1016/j.cviu.2025.104362_b31 10.1016/j.cviu.2025.104362_b30 10.1016/j.cviu.2025.104362_b33 10.1016/j.cviu.2025.104362_b34 Chung (10.1016/j.cviu.2025.104362_b17) 2018 10.1016/j.cviu.2025.104362_b36 Kollias (10.1016/j.cviu.2025.104362_b42) 2018 Busso (10.1016/j.cviu.2025.104362_b11) 2008; 42 Zhang (10.1016/j.cviu.2025.104362_b67) 2023 Liu (10.1016/j.cviu.2025.104362_b46) 2021 Pepino (10.1016/j.cviu.2025.104362_b51) 2021 Baade (10.1016/j.cviu.2025.104362_b6) 2022 Vaswani (10.1016/j.cviu.2025.104362_b63) 2017; vol. 30 Mehrabian (10.1016/j.cviu.2025.104362_b49) 2017 Sadok (10.1016/j.cviu.2025.104362_b54) 2024; 172 10.1016/j.cviu.2025.104362_b16 Akbari (10.1016/j.cviu.2025.104362_b1) 2021; vol. 34 10.1016/j.cviu.2025.104362_b19 10.1016/j.cviu.2025.104362_b18 Liu (10.1016/j.cviu.2025.104362_b45) 2020 Feichtenhofer (10.1016/j.cviu.2025.104362_b24) 2022; vol. 35 El Ayadi (10.1016/j.cviu.2025.104362_b22) 2011; 44 Goncalves (10.1016/j.cviu.2025.104362_b32) 2022; 13 10.1016/j.cviu.2025.104362_b20 Tsai (10.1016/j.cviu.2025.104362_b61) 2019; 2019 10.1016/j.cviu.2025.104362_b66 10.1016/j.cviu.2025.104362_b21 10.1016/j.cviu.2025.104362_b65 10.1016/j.cviu.2025.104362_b23 Févotte (10.1016/j.cviu.2025.104362_b25) 2009; 21 10.1016/j.cviu.2025.104362_b69 10.1016/j.cviu.2025.104362_b60 Jin (10.1016/j.cviu.2025.104362_b41) 2021 Goyal (10.1016/j.cviu.2025.104362_b35) 2017 Zhang (10.1016/j.cviu.2025.104362_b68) 2022 Cao (10.1016/j.cviu.2025.104362_b12) 2014; 5 10.1016/j.cviu.2025.104362_b53 Van Den Oord (10.1016/j.cviu.2025.104362_b62) 2017; vol. 30 10.1016/j.cviu.2025.104362_b55 10.1016/j.cviu.2025.104362_b10 10.1016/j.cviu.2025.104362_b56 10.1016/j.cviu.2025.104362_b14 10.1016/j.cviu.2025.104362_b3 10.1016/j.cviu.2025.104362_b4 10.1016/j.cviu.2025.104362_b9 10.1016/j.cviu.2025.104362_b7 10.1016/j.cviu.2025.104362_b8 10.1016/j.cviu.2025.104362_b50 Arnela (10.1016/j.cviu.2025.104362_b5) 2016; 139 Huang (10.1016/j.cviu.2025.104362_b38) 2022; vol. 35 Chen (10.1016/j.cviu.2025.104362_b15) 2023 Fu (10.1016/j.cviu.2025.104362_b26) 2021 10.1016/j.cviu.2025.104362_b39 Wang (10.1016/j.cviu.2025.104362_b64) 2021 Alayrac (10.1016/j.cviu.2025.104362_b2) 2020; vol. 33 Schoneveld (10.1016/j.cviu.2025.104362_b57) 2021; 146 Touvron (10.1016/j.cviu.2025.104362_b59) 2021 10.1016/j.cviu.2025.104362_b43 10.1016/j.cviu.2025.104362_b48 Ramachandram (10.1016/j.cviu.2025.104362_b52) 2017; 34 Huang (10.1016/j.cviu.2025.104362_b37) 2023; 46 10.1016/j.cviu.2025.104362_b40 |
| References_xml | – reference: Sadok, S., Leglaive, S., Girin, L., Richard, G., Alameda-Pineda, X., 2025. AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP. – volume: vol. 35 start-page: 10078 year: 2022 end-page: 10093 ident: b58 article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training publication-title: Advances in Neural Information Processing Systems (NeurIPS) – volume: vol. 30 year: 2017 ident: b63 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems (NeurIPS) – reference: Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D., 2023. MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 2142–2152. – volume: vol. 35 start-page: 35946 year: 2022 end-page: 35958 ident: b24 article-title: Masked autoencoders as spatiotemporal learners publication-title: Advances in Neural Information Processing Systems (NeurIPS) – reference: Zhao, H., Gan, C., Ma, W.-C., Torralba, A., 2019. The sound of motions. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1735–1744. – reference: Antoniadis, P., Pikoulis, I., Filntisis, P.P., Maragos, P., 2021. An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3645–3651. – reference: Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H., 2022. Simmim: A simple framework for masked image modeling. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 9653–9663. – year: 2022 ident: b28 article-title: Multimodal masked autoencoders learn transferable representations – reference: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 16000–16009. – year: 2021 ident: b59 article-title: Augmenting convolutional networks with attention-based aggregation – reference: Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised Representation Learning by Predicting Image Rotations. In: International Conference on Learning Representations. ICLR. – reference: Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y., 2021. Vector-quantized Image Modeling with Improved VQGAN. In: International Conference on Learning Representations. ICLR. – volume: 172 year: 2024 ident: b54 article-title: A multimodal dynamical variational autoencoder for audiovisual speech representation learning publication-title: Neural Netw. – reference: Bulat, A., Tzimiropoulos, G., 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 1021–1030. – year: 2020 ident: b13 article-title: Improved baselines with momentum contrastive learning – year: 2017 ident: b35 article-title: Accurate, large minibatch SGD: Training ImageNet in 1 hour – start-page: 2438 year: 2022 end-page: 2442 ident: b6 article-title: MAE-AST: Masked autoencoding audio spectrogram transformer publication-title: INTERSPEECH – volume: vol. 30 year: 2017 ident: b62 article-title: Neural discrete representation learning publication-title: Advances in Neural Information Processing Systems (NeurIPS) – reference: Ghaleb, E., Niehues, J., Asteriadis, S., 2020. Multimodal attention-mechanism for temporal emotion recognition. In: IEEE International Conference on Image Processing. ICIP, pp. 251–255. – volume: 3 year: 2022 ident: b44 article-title: Audio self-supervised learning: A survey publication-title: Patterns – volume: 5 start-page: 377 year: 2014 end-page: 390 ident: b12 article-title: CREMA-D: Crowd-sourced emotional multimodal actors dataset publication-title: IEEE Trans. Affect. Comput. – reference: Tran, M., Soleymani, M., 2022. A pre-trained audio-visual transformer for emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pp. 4698–4702. – year: 2017 ident: b49 article-title: Nonverbal Communication – reference: Gao, R., Grauman, K., 2019. Co-separating sounds of visual objects. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 3879–3888. – reference: Loshchilov, I., Hutter, F., 2017. Decoupled weight decay regularization. In: International Conference on Learning Representations. ICLR. – volume: 13 start-page: 2156 year: 2022 end-page: 2170 ident: b32 article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features publication-title: IEEE Trans. Affect. Comput. – start-page: 1 year: 2023 end-page: 5 ident: b15 article-title: Exploring Wav2vec 2.0 fine tuning for improved speech emotion recognition publication-title: IEEE International Conference on Acoustics, Speech and Signal Processing – reference: Dib, A., Ahn, J., Thebault, C., Gosselin, P.-H., Chevallier, L., 2023. S2f2: Self-supervised high fidelity face reconstruction from monocular image. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8. – reference: Esser, P., Rombach, R., Ommer, B., 2021. Taming transformers for high-resolution image synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 12873–12883. – volume: 42 start-page: 335 year: 2008 end-page: 359 ident: b11 article-title: IEMOCAP: Interactive emotional dyadic motion capture database publication-title: Lang. Resour. Eval. – reference: Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving Jigsaw puzzles. In: European Conference on Computer Vision. ECCV, pp. 69–84. – reference: Bao, H., Dong, L., Piao, S., Wei, F., 2021. BEiT: BERT Pre-Training of Image Transformers. In: International Conference on Learning Representations. ICLR. – volume: vol. 35 start-page: 28708 year: 2022 end-page: 28720 ident: b38 article-title: Masked autoencoders that listen publication-title: Advances in Neural Information Processing Systems (NeurIPS) – reference: Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020b. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. ICML, pp. 1597–1607. – volume: 21 start-page: 793 year: 2009 end-page: 830 ident: b25 article-title: Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis publication-title: Neural Comput. – reference: Ghaleb, E., Popa, M., Asteriadis, S., 2019. Multimodal and temporal perception of audio-visual cues for emotion recognition. In: International Conference on Affective Computing and Intelligent Interaction. ACII, pp. 552–558. – start-page: 1086 year: 2018 end-page: 1090 ident: b17 article-title: VoxCeleb2: Deep speaker recognition publication-title: INTERSPEECH – reference: Jiang, X., Zong, Y., Zheng, W., Tang, C., Xia, W., Lu, C., Liu, J., 2020. Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: ACM International Conference on Multimedia. pp. 2881–2889. – reference: Chumachenko, K., Iosifidis, A., Gabbouj, M., 2022. Self-attention fusion for audiovisual emotion recognition with incomplete data. In: International Conference on Pattern Recognition. ICPR, pp. 2822–2828. – volume: 34 start-page: 96 year: 2017 end-page: 108 ident: b52 article-title: Deep multimodal learning: A survey on recent advances and trends publication-title: IEEE Signal Process. Mag. – reference: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR). – year: 2018 ident: b42 article-title: Aff-Wild2: Extending the Aff-Wild database for affect recognition – volume: 46 start-page: 1 year: 2023 end-page: 13 ident: b37 article-title: Contrastive masked autoencoders are stronger vision learners publication-title: IEEE Trans. Pattern Anal. Mach. Intell. – year: 2020 ident: b45 article-title: Emotion recognition for in-the-wild videos – year: 2022 ident: b68 article-title: A survey on masked autoencoder for self-supervised learning in vision and beyond – year: 2023 ident: b67 article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild publication-title: IEEE Trans. Circuits Syst. Video Technol. – year: 2021 ident: b41 article-title: A multi-modal and multi-task learning method for action unit and expression recognition – reference: Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. ViViT: A video vision transformer. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 6836–6846. – reference: Gong, Y., Rouditchenko, A., Liu, A.H., Harwath, D., Karlinsky, L., Kuehne, H., Glass, J.R., 2022b. Contrastive audio-visual masked autoencoder. In: International Conference on Learning Representations. ICLR. – volume: 139 start-page: 2852 year: 2016 end-page: 2859 ident: b5 article-title: Influence of lips on the production of vowels based on finite element simulations and experiments publication-title: J. Acoust. Soc. Am. – volume: 146 start-page: 1 year: 2021 end-page: 7 ident: b57 article-title: Leveraging recent advances in deep learning for audio-visual emotion recognition publication-title: Pattern Recognit. Lett. – reference: Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. – volume: 44 start-page: 572 year: 2011 end-page: 587 ident: b22 article-title: Survey on speech emotion recognition: Features, classification schemes, and databases publication-title: Pattern Recognit. – year: 2021 ident: b26 article-title: A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition – year: 2021 ident: b46 article-title: Query2label: A simple transformer way to multi-label classification – reference: Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M., 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning. ICML, pp. 1298–1312. – year: 2021 ident: b64 article-title: A fine-tuned Wav2vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding – volume: 13 year: 2018 ident: b47 article-title: The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English publication-title: PloS One – reference: Sadok, S., Leglaive, S., Séguier, R., 2023. A vector quantized masked autoencoder for speech emotion recognition. In: IEEE ICASSP Workshop on Self-Supervision in Audio, Speech and beyond. SASB, pp. 1–5. – volume: vol. 34 start-page: 24206 year: 2021 end-page: 24221 ident: b1 article-title: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text publication-title: Advances in Neural Information Processing Systems (NeurIPS) – volume: 2019 start-page: 6558 year: 2019 ident: b61 article-title: Multimodal transformer for unaligned multimodal language sequences publication-title: Association for Computational Linguistics. Meeting – reference: Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., 2021. PeCo: Perceptual codebook for BERT pre-training of vision transformers. In: AAAI Conference on Artificial Intelligence. pp. 552–560. – reference: Jegorova, M., Petridis, S., Pantic, M., 2023. SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video. In: IEEE International Conference on Automatic Face and Gesture Recognition. FG, pp. 1–8. – reference: Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A., 2022. MultiMAE: Multi-modal multi-task masked autoencoders. In: European Conference on Computer Vision. ECCV, pp. 348–367. – reference: Ridnik, T., Ben-Baruch, E., Zamir, N., Noy, A., Friedman, I., Protter, M., Zelnik-Manor, L., 2021. Asymmetric Loss For Multi-Label Classification. In: IEEE/CVF International Conference on Computer Vision. ICCV, pp. 82–91. – reference: Gong, Y., Lai, C.-I., Chung, Y.-A., Glass, J., 2022a. SSAST: Self-supervised audio spectrogram transformer. In: AAAI Conference on Artificial Intelligence. 36, (10), pp. 10699–10709. – start-page: 3400 year: 2021 end-page: 3404 ident: b51 article-title: Emotion recognition from speech using Wav2Vec 2.0 embeddings publication-title: INTERSPEECH – volume: vol. 33 start-page: 25 year: 2020 end-page: 37 ident: b2 article-title: Self-supervised multimodal versatile networks publication-title: Advances in Neural Information Processing Systems (NeurIPS) – year: 2021 ident: 10.1016/j.cviu.2025.104362_b26 – volume: 3 issue: 12 year: 2022 ident: 10.1016/j.cviu.2025.104362_b44 article-title: Audio self-supervised learning: A survey publication-title: Patterns doi: 10.1016/j.patter.2022.100616 – ident: 10.1016/j.cviu.2025.104362_b34 – volume: vol. 30 year: 2017 ident: 10.1016/j.cviu.2025.104362_b63 article-title: Attention is all you need – ident: 10.1016/j.cviu.2025.104362_b53 doi: 10.1109/ICCV48922.2021.00015 – start-page: 2438 year: 2022 ident: 10.1016/j.cviu.2025.104362_b6 article-title: MAE-AST: Masked autoencoding audio spectrogram transformer – ident: 10.1016/j.cviu.2025.104362_b69 doi: 10.1109/ICCV.2019.00182 – ident: 10.1016/j.cviu.2025.104362_b10 doi: 10.1109/ICCV.2017.116 – volume: vol. 33 start-page: 25 year: 2020 ident: 10.1016/j.cviu.2025.104362_b2 article-title: Self-supervised multimodal versatile networks – year: 2017 ident: 10.1016/j.cviu.2025.104362_b49 – start-page: 1086 year: 2018 ident: 10.1016/j.cviu.2025.104362_b17 article-title: VoxCeleb2: Deep speaker recognition – volume: 34 start-page: 96 issue: 6 year: 2017 ident: 10.1016/j.cviu.2025.104362_b52 article-title: Deep multimodal learning: A survey on recent advances and trends publication-title: IEEE Signal Process. Mag. doi: 10.1109/MSP.2017.2738401 – year: 2021 ident: 10.1016/j.cviu.2025.104362_b64 – volume: 46 start-page: 1 year: 2023 ident: 10.1016/j.cviu.2025.104362_b37 article-title: Contrastive masked autoencoders are stronger vision learners publication-title: IEEE Trans. Pattern Anal. Mach. Intell. doi: 10.1109/TPAMI.2023.3234160 – ident: 10.1016/j.cviu.2025.104362_b66 – volume: 139 start-page: 2852 issue: 5 year: 2016 ident: 10.1016/j.cviu.2025.104362_b5 article-title: Influence of lips on the production of vowels based on finite element simulations and experiments publication-title: J. Acoust. Soc. Am. doi: 10.1121/1.4950698 – year: 2021 ident: 10.1016/j.cviu.2025.104362_b59 – volume: vol. 35 start-page: 28708 year: 2022 ident: 10.1016/j.cviu.2025.104362_b38 article-title: Masked autoencoders that listen – volume: 13 start-page: 2156 issue: 4 year: 2022 ident: 10.1016/j.cviu.2025.104362_b32 article-title: Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2022.3216993 – ident: 10.1016/j.cviu.2025.104362_b39 doi: 10.1109/FG57933.2023.10042638 – ident: 10.1016/j.cviu.2025.104362_b21 – ident: 10.1016/j.cviu.2025.104362_b23 doi: 10.1109/CVPR46437.2021.01268 – ident: 10.1016/j.cviu.2025.104362_b9 – ident: 10.1016/j.cviu.2025.104362_b31 – ident: 10.1016/j.cviu.2025.104362_b14 – ident: 10.1016/j.cviu.2025.104362_b56 doi: 10.1109/ICASSPW59220.2023.10193151 – ident: 10.1016/j.cviu.2025.104362_b33 doi: 10.1609/aaai.v36i10.21315 – ident: 10.1016/j.cviu.2025.104362_b18 – ident: 10.1016/j.cviu.2025.104362_b36 doi: 10.1109/CVPR52688.2022.01553 – ident: 10.1016/j.cviu.2025.104362_b30 doi: 10.1109/ACII.2019.8925444 – volume: vol. 34 start-page: 24206 year: 2021 ident: 10.1016/j.cviu.2025.104362_b1 article-title: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text – volume: 172 year: 2024 ident: 10.1016/j.cviu.2025.104362_b54 article-title: A multimodal dynamical variational autoencoder for audiovisual speech representation learning publication-title: Neural Netw. doi: 10.1016/j.neunet.2024.106120 – year: 2017 ident: 10.1016/j.cviu.2025.104362_b35 – ident: 10.1016/j.cviu.2025.104362_b29 doi: 10.1109/ICIP40778.2020.9191019 – start-page: 3400 year: 2021 ident: 10.1016/j.cviu.2025.104362_b51 article-title: Emotion recognition from speech using Wav2Vec 2.0 embeddings – ident: 10.1016/j.cviu.2025.104362_b65 doi: 10.1109/CVPR52688.2022.00943 – ident: 10.1016/j.cviu.2025.104362_b19 doi: 10.1109/FG57933.2023.10042713 – volume: 42 start-page: 335 year: 2008 ident: 10.1016/j.cviu.2025.104362_b11 article-title: IEMOCAP: Interactive emotional dyadic motion capture database publication-title: Lang. Resour. Eval. doi: 10.1007/s10579-008-9076-6 – volume: vol. 35 start-page: 35946 year: 2022 ident: 10.1016/j.cviu.2025.104362_b24 article-title: Masked autoencoders as spatiotemporal learners – volume: 13 issue: 5 year: 2018 ident: 10.1016/j.cviu.2025.104362_b47 article-title: The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English publication-title: PloS One doi: 10.1371/journal.pone.0196391 – ident: 10.1016/j.cviu.2025.104362_b8 – ident: 10.1016/j.cviu.2025.104362_b40 doi: 10.1145/3394171.3413620 – volume: vol. 35 start-page: 10078 year: 2022 ident: 10.1016/j.cviu.2025.104362_b58 article-title: VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training – year: 2020 ident: 10.1016/j.cviu.2025.104362_b13 – volume: 146 start-page: 1 year: 2021 ident: 10.1016/j.cviu.2025.104362_b57 article-title: Leveraging recent advances in deep learning for audio-visual emotion recognition publication-title: Pattern Recognit. Lett. doi: 10.1016/j.patrec.2021.03.007 – year: 2021 ident: 10.1016/j.cviu.2025.104362_b46 – ident: 10.1016/j.cviu.2025.104362_b55 doi: 10.1109/ICASSP49660.2025.10887856 – ident: 10.1016/j.cviu.2025.104362_b43 doi: 10.1109/CVPR52729.2023.00213 – year: 2020 ident: 10.1016/j.cviu.2025.104362_b45 – ident: 10.1016/j.cviu.2025.104362_b7 doi: 10.1007/978-3-031-19836-6_20 – ident: 10.1016/j.cviu.2025.104362_b27 doi: 10.1109/ICCV.2019.00398 – start-page: 1 year: 2023 ident: 10.1016/j.cviu.2025.104362_b15 article-title: Exploring Wav2vec 2.0 fine tuning for improved speech emotion recognition – ident: 10.1016/j.cviu.2025.104362_b60 doi: 10.1109/ICASSP43922.2022.9747278 – year: 2023 ident: 10.1016/j.cviu.2025.104362_b67 article-title: Transformer-based multimodal emotional perception for dynamic facial expression recognition in the wild publication-title: IEEE Trans. Circuits Syst. Video Technol. – year: 2021 ident: 10.1016/j.cviu.2025.104362_b41 – ident: 10.1016/j.cviu.2025.104362_b4 doi: 10.1109/ICCV48922.2021.00676 – year: 2022 ident: 10.1016/j.cviu.2025.104362_b28 – ident: 10.1016/j.cviu.2025.104362_b16 doi: 10.1109/ICPR56361.2022.9956592 – ident: 10.1016/j.cviu.2025.104362_b20 doi: 10.1609/aaai.v37i1.25130 – volume: 2019 start-page: 6558 year: 2019 ident: 10.1016/j.cviu.2025.104362_b61 article-title: Multimodal transformer for unaligned multimodal language sequences – volume: 21 start-page: 793 issue: 3 year: 2009 ident: 10.1016/j.cviu.2025.104362_b25 article-title: Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis publication-title: Neural Comput. doi: 10.1162/neco.2008.04-08-771 – ident: 10.1016/j.cviu.2025.104362_b3 doi: 10.1109/ICCVW54120.2021.00407 – year: 2018 ident: 10.1016/j.cviu.2025.104362_b42 – ident: 10.1016/j.cviu.2025.104362_b48 – year: 2022 ident: 10.1016/j.cviu.2025.104362_b68 – volume: vol. 30 year: 2017 ident: 10.1016/j.cviu.2025.104362_b62 article-title: Neural discrete representation learning – volume: 44 start-page: 572 issue: 3 year: 2011 ident: 10.1016/j.cviu.2025.104362_b22 article-title: Survey on speech emotion recognition: Features, classification schemes, and databases publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2010.09.020 – volume: 5 start-page: 377 issue: 4 year: 2014 ident: 10.1016/j.cviu.2025.104362_b12 article-title: CREMA-D: Crowd-sourced emotional multimodal actors dataset publication-title: IEEE Trans. Affect. Comput. doi: 10.1109/TAFFC.2014.2336244 – ident: 10.1016/j.cviu.2025.104362_b50 doi: 10.1007/978-3-319-46466-4_5 |
| SSID | ssj0011491 |
| Score | 2.46077 |
| Snippet | An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a... |
| SourceID | hal crossref elsevier |
| SourceType | Open Access Repository Enrichment Source Index Database Publisher |
| StartPage | 104362 |
| SubjectTerms | Audiovisual speech representation learning Computer Science Computer Vision and Pattern Recognition Emotion recognition Machine Learning Masked autoencoder Multimedia Self-supervised learning Signal and Image Processing Sound |
| Title | A vector quantized masked autoencoder for audiovisual speech emotion recognition |
| URI | https://dx.doi.org/10.1016/j.cviu.2025.104362 https://hal.science/hal-05041905 |
| Volume | 257 |
| WOSCitedRecordID | wos001477485400001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: ScienceDirect Freedom Collection - Elsevier customDbUrl: eissn: 1090-235X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0011491 issn: 1077-3142 databaseCode: AIEXJ dateStart: 19950101 isFulltext: true titleUrlDefault: https://www.sciencedirect.com providerName: Elsevier |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lj9MwELZKlwMceCwglpcsxC1KlcR5HiPYVReqqtIuqLfIdhya3TatNg-t-AP8bcaxk2ZBrJYDFzeaOE3q-ToZj78ZI_TBdiihhAVmxrMIJijCMSOYCJk0o8wJWUBoS5D9Ngvm83C5jBaj0c8uF6ZZB0URXl9Hu_-qapCBsmXq7D-ou_9SEMAxKB1aUDu0d1J8bDRtJF7mSxZV_gM8yg0tL-GD1tVW1q2U5SNa8mQtqah5KTNIyp0QfGUItauP0fOKtNa6WgZ6DwhDpaSrlYeNpP3UwySZPmxD0-2lijtv8p4FPBPgsoORbU_kmz0P4Eyt2n-vc70hsyjgEYdxCcfb86dUsEy_2Qe21QpkSFQV05oILYss0yHecmiQHVWy-g_jruIMFxPe5PVE3lIuUBNtzG9U0p7GZ8ni00kyO51_uXl2QD-cxjNoV3RtWp7lgmPkNTCNPnACLwrH6CA-PV5-7hekYCJpK_qq-g06_0pRBX9_oL_5OPdWXbS-9V7On6BHetqBYwWXp2gkikP0WE9BsDbwJYg6DXeyQ_RwULLyGVrEWMEL9_DCCl54AC8M8MIDeGEFL6zhhQfweo6-nhyff5yaelcOk5PQqsw0A4-S-Jnt0zRknitL1lLBHZrSjHARUPAg3ZBQh4XcD0DqMe6lPrwHZCdByQs0LraFeImwxSxXCN-nARy4NAopszOPO4KFEWPEPUJ2N44J1yXr5c4p66TjJl4kcuwTOfaJGvsjZPTX7FTBllt7e516Eu1yKlcyAdDdet170GV_A1mjHeCUSNkeTK_u0uk1erD_97xB4-qqFm_Rfd5UeXn1TsPwF8FZrpI |
| linkProvider | Elsevier |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+vector+quantized+masked+autoencoder+for+audiovisual+speech+emotion+recognition&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Sadok%2C+Samir&rft.au=Leglaive%2C+Simon&rft.au=S%C3%A9guier%2C+Renaud&rft.date=2025-06-01&rft.pub=Elsevier&rft.issn=1077-3142&rft.eissn=1090-235X&rft.volume=257&rft_id=info:doi/10.1016%2Fj.cviu.2025.104362&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=oai%3AHAL%3Ahal-05041905v1 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon |