Bimodal variational autoencoder for audiovisual speech recognition
Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual fe...
Uloženo v:
| Vydáno v: | Machine learning Ročník 112; číslo 4; s. 1201 - 1226 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.04.2023
Springer Nature B.V |
| Témata: | |
| ISSN: | 0885-6125, 1573-0565 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference
≃
3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference
≃
2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers. |
|---|---|
| AbstractList | Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃ 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃ 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers. Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃ 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃ 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers. |
| Author | ElDeeb, Hesham E. Sayed, Hadeer M. Taie, Shereen A. |
| Author_xml | – sequence: 1 givenname: Hadeer M. orcidid: 0000-0002-6136-4823 surname: Sayed fullname: Sayed, Hadeer M. email: hms08@fayoum.edu.eg organization: Department of Computer Science, Fayoum University – sequence: 2 givenname: Hesham E. surname: ElDeeb fullname: ElDeeb, Hesham E. organization: Department of Computer and Control, Electronics Research Institute – sequence: 3 givenname: Shereen A. surname: Taie fullname: Taie, Shereen A. organization: Department of Computer Science, Fayoum University |
| BookMark | eNp9kE1LAzEQhoMo2Fb_gKeC52gy2WSzR1v8goIXPYc0HzWl3dRkt-C_N-0KgoeeZoZ5n-Gdd4zO29g6hG4ouaOE1PeZkqapMAGKiaAUMD9DI8prhgkX_ByNiJQcCwr8Eo1zXhNCQEgxQrNZ2EarN9O9TkF3Ibal130XXWuidWnqYyqzDXEfcl92eeec-ZwmZ-KqDQfgCl14vcnu-rdO0MfT4_v8BS_enl_nDwtsmGAd9s5bXVnvpWQ12KUpLVs2RnijBXjgtXQWnDFQG11XzGvSgKusLERll5pN0O1wd5fiV-9yp9axT8VvVlA3tGLAgReVHFQmxZyT88qE7vhYl3TYKErUITE1JKZKYuqYmDqg8A_dpbDV6fs0xAYoF3G7cunP1QnqB4Ligoc |
| CitedBy_id | crossref_primary_10_1007_s10994_024_06598_9 crossref_primary_10_1109_ACCESS_2024_3463969 crossref_primary_10_1007_s12598_024_03089_7 crossref_primary_10_1007_s10462_023_10662_6 crossref_primary_10_1109_TII_2024_3359416 crossref_primary_10_1016_j_aej_2025_03_046 |
| Cites_doi | 10.1109/ICCIT51783.2020.9392697 10.1016/j.eswa.2020.113885 10.1145/3065386 10.1109/ICASSP40776.2020.9054127 10.3115/v1/D14-1179 10.1007/11550907_126 10.1109/T-C.1974.223784 10.1109/CVPR.2014.241 10.1109/CVPR.2005.177 10.1109/TMM.2013.2267205 10.1109/TPAMI.2018.2798607 10.21437/Interspeech.2017-860 10.1007/s13244-018-0639-9 10.14445/22315381/IJETT-V48P253 10.1162/089976602760128018 10.1109/RTEICT42901.2018.9012507 10.1016/j.dsp.2018.06.004 10.14569/IJACSA.2020.0110321 10.1109/ICCTCT.2018.8551185 10.1109/TASSP.1980.1163420 10.1038/323533a0 10.1002/aic.690370209 10.21437/Interspeech.2016-595 10.1109/FG.2015.7163155 10.1109/TPAMI.2018.2889052 10.1007/978-3-642-33275-3_46 10.5244/C.29.41 10.18653/v1/N16-1020 10.1109/FG.2018.00020 10.7551/mitpress/7503.003.0024 10.1109/CVPR.2017.538 10.1109/ICCVW.2013.59 10.1109/ICASSP.2002.1006168 10.1016/j.apacoust.2019.107020 10.1109/ICASSP.2019.8682566 10.1007/BF00994018 10.1007/978-3-662-44415-3_16 10.1109/ICACCP.2019.8882943 10.1162/neco_a_01273 10.1007/978-3-642-04898-2_327 10.1109/ACII.2013.58 10.1016/S1007-0214(11)70032-3 10.1162/neco.1997.9.8.1735 10.1109/JPROC.2003.817150 10.1109/ICASSP.2017.7952625 10.1109/tcbb.2007.1015 |
| ContentType | Journal Article |
| Copyright | The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021 The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021. |
| Copyright_xml | – notice: The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021 – notice: The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021. |
| DBID | AAYXX CITATION 3V. 7SC 7XB 88I 8AL 8AO 8FD 8FE 8FG 8FK ABUWG AFKRA ARAPS AZQEC BENPR BGLVJ CCPQU DWQXO GNUQQ HCIFZ JQ2 K7- L7M L~C L~D M0N M2P P5Z P62 PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI PRINS Q9U |
| DOI | 10.1007/s10994-021-06112-5 |
| DatabaseName | CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ProQuest Central (purchase pre-March 2016) Science Database (Alumni Edition) Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni Edition) ProQuest Central UK/Ireland Advanced Technologies & Computer Science Collection ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One Community College ProQuest Central Korea ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection Computer Science Database Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional Computing Database Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest Central Basic |
| DatabaseTitle | CrossRef Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Pharma Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest Central Korea ProQuest Central (New) Advanced Technologies Database with Aerospace Advanced Technologies & Aerospace Collection ProQuest Computing ProQuest Science Journals (Alumni Edition) ProQuest Central Basic ProQuest Science Journals ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Academic ProQuest Central (Alumni) ProQuest One Academic (New) |
| DatabaseTitleList | Computer Science Database |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1573-0565 |
| EndPage | 1226 |
| ExternalDocumentID | 10_1007_s10994_021_06112_5 |
| GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C -~X .4S .86 .DC .VR 06D 0R~ 0VY 199 1N0 1SB 2.D 203 28- 29M 2J2 2JN 2JY 2KG 2KM 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 6TJ 78A 88I 8AO 8FE 8FG 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAEWM AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTV ABHLI ABHQN ABIVO ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACGOD ACHSB ACHXU ACKNC ACMDZ ACMLO ACNCT ACOKC ACOMO ACPIV ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFKRA AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BGLVJ BGNMA BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO EBLON EBS EIOEI EJD ESBYG F5P FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IHE IJ- IKXTQ ITG ITH ITM IWAJR IXC IZIGR IZQ I~X I~Y I~Z J-C J0Z JBSCW JCJTX JZLTJ K6V K7- KDC KOV KOW LAK LLZTM M0N M2P M4Y MA- MVM N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P2P P62 P9O PF- PQQKQ PROAC PT4 Q2X QF4 QM1 QN7 QO4 QOK QOS R4E R89 R9I RHV RIG RNI RNS ROL RPX RSV RZC RZE S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TAE TEORI TN5 TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW VXZ W23 W48 WH7 WIP WK8 XJT YLTOR Z45 Z7R Z7S Z7U Z7V Z7W Z7X Z7Y Z7Z Z81 Z83 Z85 Z86 Z87 Z88 Z8M Z8N Z8O Z8P Z8Q Z8R Z8S Z8T Z8U Z8W Z8Z Z91 Z92 ZMTXR AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG ADKFA AEZWR AFDZB AFFHD AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP AMVHM ATHPR AYFIA CITATION PHGZM PHGZT PQGLB 7SC 7XB 8AL 8FD 8FK JQ2 L7M L~C L~D PKEHL PQEST PQUKI PRINS Q9U |
| ID | FETCH-LOGICAL-c363t-fefda4dff88372dbcdff3b9c6fca62f2578ed2ecc27ca743fa092e4d8ff84dba3 |
| IEDL.DBID | RSV |
| ISICitedReferencesCount | 13 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000722108900002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0885-6125 |
| IngestDate | Wed Nov 05 03:12:00 EST 2025 Tue Nov 18 21:55:08 EST 2025 Sat Nov 29 01:43:28 EST 2025 Fri Feb 21 02:43:39 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 4 |
| Keywords | Deep learning Variational autoencoder Cross-modality Multimodal data fusion Audiovisual speech recognition |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c363t-fefda4dff88372dbcdff3b9c6fca62f2578ed2ecc27ca743fa092e4d8ff84dba3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-6136-4823 |
| OpenAccessLink | http://dx.doi.org/10.1007/s10994-021-06112-5 |
| PQID | 2791432525 |
| PQPubID | 54194 |
| PageCount | 26 |
| ParticipantIDs | proquest_journals_2791432525 crossref_citationtrail_10_1007_s10994_021_06112_5 crossref_primary_10_1007_s10994_021_06112_5 springer_journals_10_1007_s10994_021_06112_5 |
| PublicationCentury | 2000 |
| PublicationDate | 20230400 2023-04-00 20230401 |
| PublicationDateYYYYMMDD | 2023-04-01 |
| PublicationDate_xml | – month: 4 year: 2023 text: 20230400 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York – name: Dordrecht |
| PublicationTitle | Machine learning |
| PublicationTitleAbbrev | Mach Learn |
| PublicationYear | 2023 |
| Publisher | Springer US Springer Nature B.V |
| Publisher_xml | – name: Springer US – name: Springer Nature B.V |
| References | AhmedNNatarajanTRaoKRDiscrete cosine transformIEEE Transactions on Computers19741001909335655510.1109/T-C.1974.2237840273.65097 KarlikBOlgacAVPerformance analysis of various activation functions in generalized mlp architectures of neural networksInternational Journal of Artificial Intelligence and Expert Systems201114111122 AfourasTChungJSSeniorAVinyalsOZissermanADeep audio-visual speech recognitionIEEE Transactions on Pattern Analysis and Machine Intelligence201810.1109/TPAMI.2018.2889052 Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5447–5455). Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face and gesture recognition (FG 2018) (pp. 67–74). IEEE ZhuJChenNXingEPBayesian inference with posterior regularization and applications to infinite latent svmsThe Journal of Machine Learning Research20141511799184732252411319.62067 Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, CONF. YamashitaRNishioMDoRKGTogashiKConvolutional neural networks: An overview and application in radiologyInsights into Imaging20189461162910.1007/s13244-018-0639-9 Kingma, D. P, & Welling, M. (2014). Auto-encoding variational bayes. CoRR arXiv:1312.6114 EvangelopoulosGZlatintsiAPotamianosAMaragosPRapantzikosKSkoumasGAvrithisYMultimodal saliency and fusion for movie summarization based on aural, visual, and textual attentionIEEE Transactions on Multimedia20131571553156810.1109/TMM.2013.2267205 Morvant, E., Habrard, A., & Ayache, S. (2014). Majority vote of diverse classifiers for late fusion. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (pp 153–162). Springer. Yu, J., Zhang, S. X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 6984–6988). IEEE. KimSChoKFast calculation of histogram of oriented gradient feature by removing redundancy in overlapping blockJ Inf Sci Eng201430617191731 DavisSMermelsteinPComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesIEEE Transactions on Acoustics, Speech, and Signal Processing198028435736610.1109/TASSP.1980.1163420 Shutova, E., Kiela, D., & Maillard, J. (2016). Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 160–170). HintonGETraining products of experts by minimizing contrastive divergenceNeural Computation20021481771180010.1162/0899766027601280181010.68111 Garg, A., Noyola, J., Bagadia, S. (2016). Lip reading using cnn and lstm. Technical report, Stanford University, CS231 n project report. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems (pp. 3–10). Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics, PMLR (pp. 814–822). Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153–160). ShuCDingXFangCHistogram of the oriented gradient for face recognitionTsinghua Science and Technology201116221622410.1016/S1007-0214(11)70032-3 Bokade, R., Navato, A., Ouyang, R., Jin, X., Chou, C. A., Ostadabbas, S., & Mueller, A. V. (2020). A cross-disciplinary comparison of multimodal data fusion approaches and applications: Accelerating learning through trans-disciplinary information sharing. In Expert Systems with Applications (pp. 113885). HochreiterSSchmidhuberJLong short-term memoryNeural Computation1997981735178010.1162/neco.1997.9.8.1735 JoyceJMKullback–Leibler divergence2011BerlinSpringer72072210.1007/978-3-642-04898-2_327 PotamianosGNetiCGravierGGargASeniorAWRecent advances in the automatic recognition of audiovisual speechProceedings of the IEEE20039191306132610.1109/JPROC.2003.817150 Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874). Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshops (pp. 397–403). BaltrušaitisTAhujaCMorencyLPMultimodal machine learning: A survey and taxonomyIEEE Transactions on Pattern Analysis and Machine Intelligence201841242344310.1109/TPAMI.2018.2798607 KramerMANonlinear principal component analysis using autoassociative neural networksAIChE Journal199137223324310.1002/aic.690370209 Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, Doha, Qatar (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179, https://aclanthology.org/D14-1179 CortesCVapnikVSupport-vector networksMachine Learning199520327329710.1007/BF009940180831.68098 SharmaGUmapathyKKrishnanSTrends in audio signal feature extraction methodsApplied Acoustics202015810702010.1016/j.apacoust.2019.107020 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org RahmaniMHAlmasganjFSeyyedsalehiSAAudio-visual feature fusion via deep neural networks for automatic speech recognitionDigital Signal Processing201882546310.1016/j.dsp.2018.06.004 Zaytseva, E., Seguí, S., & Vitria, J. (2012). Sketchable histograms of oriented gradients for object detection. In Iberoamerican congress on pattern recognition (pp. 374–381). Springer. Abdelaziz, A. H. (2017). Ntcd-timit: A new database and baseline for noise-robust audio-visual speech recognition. In: INTERSPEECH (pp. 3752–3756). Jogin, M., Madhulika, M., Divya, G., Meghana, R., Apoorva, S., et al. (2018). Feature extraction using convolution neural networks (cnn) and deep learning. In 2018 3rd IEEE international conference on recent trends in electronics, information and communication technology (RTEICT) (pp. 2319–2323). IEEE. ThireouTReczkoMBidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteinsIEEE/ACM Transactions on Computational Biology and Bioinformatics20074344144610.1109/tcbb.2007.1015 Amberkar, A., Awasarmol, P., Deshmukh, G., & Dave, P. (2018). Speech recognition using recurrent neural networks. In: 2018 international conference on current trends towards converging technologies (ICCTCT) (pp. 1–4). IEEE. Doersch, C. (2016). Tutorial on variational autoencoders. arXiv:160605908 RumelhartDEHintonGEWilliamsRJLearning representations by back-propagating errorsNature1986323608853353610.1038/323533a01369.68284 TarwaniKMEdemSSurvey on recurrent neural network in natural language processingInternational Journal of Engineering Trends and Technology20174830130410.14445/22315381/IJETT-V48P253 Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional lstm networks for improved phoneme classification and recognition. In International conference on artificial neural networks (pp. 799–804). Springer. AdnanSAliFAbdulmunemAAFacial feature extraction for face recognitionJournal of Physics: Conference Series, IOP Publishing20201664012050 FathimaRRaseenaPGammatone cepstral coefficient for speaker identificationInternational Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering201321540545 Shekar, B., & Dagnew, G. (2019). Grid search-based hyperparameter tuning and classification of microarray cancer data. In 2019 second international conference on advanced computational and communication paradigms (ICACCP) (pp. 1–8). IEEE. KrizhevskyASutskeverIHintonGEImagenet classification with deep convolutional neural networksCommunications of the ACM2017606849010.1145/3065386 Petridis, S., Li, Z., & Pantic, M. (2017). End-to-end visual speech recognition with lstms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 2592–2596). IEEE. GaoJLiPChenZZhangJA survey on deep learning for multimodal data fusionNeural Computation202032829864410116410.1162/neco_a_012731468.68182 Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In British machine vision conference (pp. 41.1–41.12,). BMVA Press. https://doi.org/10.5244/C.29.41 LakshmiKPSolankiMDaraJSKompalliABVideo genre classification using convolutional recurrent neural networksInternational Journal of Advanced Computer Science and Applications202010.14569/IJACSA.2020.0110321 Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., Gonzalez, I., Valentin, E., & Sahli, H. (2013). Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In 2013 Humaine association conference on affective computing and intelligent interaction (pp. 312–317). IEEE. Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE inte S Davis (6112_CR14) 1980; 28 MH Rahmani (6112_CR45) 2018; 82 6112_CR30 JM Joyce (6112_CR28) 2011 6112_CR38 R Yamashita (6112_CR55) 2018; 9 6112_CR37 6112_CR36 C Shu (6112_CR51) 2011; 16 6112_CR32 6112_CR5 JS Deery (6112_CR15) 2007; 41 6112_CR9 6112_CR8 J Zhu (6112_CR60) 2014; 15 6112_CR6 6112_CR39 C Cortes (6112_CR12) 1995; 20 A Krizhevsky (6112_CR34) 2017; 60 R Fathima (6112_CR19) 2013; 2 6112_CR41 S Kim (6112_CR31) 2014; 30 6112_CR40 G Evangelopoulos (6112_CR17) 2013; 15 DE Rumelhart (6112_CR47) 1986; 323 6112_CR48 6112_CR46 T Thireou (6112_CR54) 2007; 4 6112_CR44 6112_CR43 G Potamianos (6112_CR42) 2003; 91 MA Kramer (6112_CR33) 1991; 37 N Ahmed (6112_CR4) 1974; 100 J Gao (6112_CR20) 2020; 32 6112_CR52 6112_CR50 B Karlik (6112_CR29) 2011; 1 6112_CR16 6112_CR59 G Sharma (6112_CR49) 2020; 158 6112_CR58 6112_CR13 6112_CR57 6112_CR56 6112_CR11 6112_CR10 S Adnan (6112_CR2) 2020; 1664 T Baltrušaitis (6112_CR7) 2018; 41 6112_CR18 KP Lakshmi (6112_CR35) 2020 GE Hinton (6112_CR24) 2002; 14 6112_CR1 S Hochreiter (6112_CR26) 1997; 9 KM Tarwani (6112_CR53) 2017; 48 6112_CR27 6112_CR25 6112_CR23 6112_CR22 6112_CR21 T Afouras (6112_CR3) 2018 |
| References_xml | – reference: Jogin, M., Madhulika, M., Divya, G., Meghana, R., Apoorva, S., et al. (2018). Feature extraction using convolution neural networks (cnn) and deep learning. In 2018 3rd IEEE international conference on recent trends in electronics, information and communication technology (RTEICT) (pp. 2319–2323). IEEE. – reference: Petridis, S., Li, Z., & Pantic, M. (2017). End-to-end visual speech recognition with lstms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 2592–2596). IEEE. – reference: Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, CONF. – reference: ThireouTReczkoMBidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteinsIEEE/ACM Transactions on Computational Biology and Bioinformatics20074344144610.1109/tcbb.2007.1015 – reference: Bokade, R., Navato, A., Ouyang, R., Jin, X., Chou, C. A., Ostadabbas, S., & Mueller, A. V. (2020). A cross-disciplinary comparison of multimodal data fusion approaches and applications: Accelerating learning through trans-disciplinary information sharing. In Expert Systems with Applications (pp. 113885). – reference: KarlikBOlgacAVPerformance analysis of various activation functions in generalized mlp architectures of neural networksInternational Journal of Artificial Intelligence and Expert Systems201114111122 – reference: Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., Gonzalez, I., Valentin, E., & Sahli, H. (2013). Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In 2013 Humaine association conference on affective computing and intelligent interaction (pp. 312–317). IEEE. – reference: ShuCDingXFangCHistogram of the oriented gradient for face recognitionTsinghua Science and Technology201116221622410.1016/S1007-0214(11)70032-3 – reference: HochreiterSSchmidhuberJLong short-term memoryNeural Computation1997981735178010.1162/neco.1997.9.8.1735 – reference: GaoJLiPChenZZhangJA survey on deep learning for multimodal data fusionNeural Computation202032829864410116410.1162/neco_a_012731468.68182 – reference: DavisSMermelsteinPComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesIEEE Transactions on Acoustics, Speech, and Signal Processing198028435736610.1109/TASSP.1980.1163420 – reference: Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face and gesture recognition (FG 2018) (pp. 67–74). IEEE – reference: CortesCVapnikVSupport-vector networksMachine Learning199520327329710.1007/BF009940180831.68098 – reference: Abdelaziz, A. H. (2017). Ntcd-timit: A new database and baseline for noise-robust audio-visual speech recognition. In: INTERSPEECH (pp. 3752–3756). – reference: Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In British machine vision conference (pp. 41.1–41.12,). BMVA Press. https://doi.org/10.5244/C.29.41 – reference: Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshops (pp. 397–403). – reference: Garg, A., Noyola, J., Bagadia, S. (2016). Lip reading using cnn and lstm. Technical report, Stanford University, CS231 n project report. – reference: AdnanSAliFAbdulmunemAAFacial feature extraction for face recognitionJournal of Physics: Conference Series, IOP Publishing20201664012050 – reference: Doersch, C. (2016). Tutorial on variational autoencoders. arXiv:160605908 – reference: Zhang, S., Lei, M., Ma, B., & Xie, L. (2019). Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regularization. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6570–6574). IEEE. – reference: KimSChoKFast calculation of histogram of oriented gradient feature by removing redundancy in overlapping blockJ Inf Sci Eng201430617191731 – reference: Morvant, E., Habrard, A., & Ayache, S. (2014). Majority vote of diverse classifiers for late fusion. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (pp 153–162). Springer. – reference: Kingma, D. P, & Welling, M. (2014). Auto-encoding variational bayes. CoRR arXiv:1312.6114 – reference: Yu, J., Zhang, S. X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 6984–6988). IEEE. – reference: Shekar, B., & Dagnew, G. (2019). Grid search-based hyperparameter tuning and classification of microarray cancer data. In 2019 second international conference on advanced computational and communication paradigms (ICACCP) (pp. 1–8). IEEE. – reference: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org – reference: KramerMANonlinear principal component analysis using autoassociative neural networksAIChE Journal199137223324310.1002/aic.690370209 – reference: PotamianosGNetiCGravierGGargASeniorAWRecent advances in the automatic recognition of audiovisual speechProceedings of the IEEE20039191306132610.1109/JPROC.2003.817150 – reference: AfourasTChungJSSeniorAVinyalsOZissermanADeep audio-visual speech recognitionIEEE Transactions on Pattern Analysis and Machine Intelligence201810.1109/TPAMI.2018.2889052 – reference: Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics, PMLR (pp. 814–822). – reference: SharmaGUmapathyKKrishnanSTrends in audio signal feature extraction methodsApplied Acoustics202015810702010.1016/j.apacoust.2019.107020 – reference: Faruk, A., Faraby, H. A., Azad, M. M., Fedous, M. R., & Morol, M. K. (2020). Image to Bengali caption generation using deep cnn and bidirectional gated recurrent unit. In 2020 23rd international conference on computer and information technology (ICCIT) (pp. 1–6). – reference: Zaytseva, E., Seguí, S., & Vitria, J. (2012). Sketchable histograms of oriented gradients for object detection. In Iberoamerican congress on pattern recognition (pp. 374–381). Springer. – reference: Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free mmi. In Interspeech (pp. 2751–2755). – reference: LakshmiKPSolankiMDaraJSKompalliABVideo genre classification using convolutional recurrent neural networksInternational Journal of Advanced Computer Science and Applications202010.14569/IJACSA.2020.0110321 – reference: Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. II–2017). IEEE. – reference: YamashitaRNishioMDoRKGTogashiKConvolutional neural networks: An overview and application in radiologyInsights into Imaging20189461162910.1007/s13244-018-0639-9 – reference: Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In L. Getoor & T. Scheffer (Eds.), ICML (pp. 689–696). Omnipress. http://dblp.uni-trier.de/db/conf/icml/icml2011.html#NgiamKKNLN11. – reference: RahmaniMHAlmasganjFSeyyedsalehiSAAudio-visual feature fusion via deep neural networks for automatic speech recognitionDigital Signal Processing201882546310.1016/j.dsp.2018.06.004 – reference: AhmedNNatarajanTRaoKRDiscrete cosine transformIEEE Transactions on Computers19741001909335655510.1109/T-C.1974.2237840273.65097 – reference: ZhuJChenNXingEPBayesian inference with posterior regularization and applications to infinite latent svmsThe Journal of Machine Learning Research20141511799184732252411319.62067 – reference: FathimaRRaseenaPGammatone cepstral coefficient for speaker identificationInternational Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering201321540545 – reference: DeeryJSThe ‘real’ history of real-time spectrum analyzers a 50-year trip down memory laneSound and Vibration2007415459 – reference: Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, Doha, Qatar (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179, https://aclanthology.org/D14-1179 – reference: TarwaniKMEdemSSurvey on recurrent neural network in natural language processingInternational Journal of Engineering Trends and Technology20174830130410.14445/22315381/IJETT-V48P253 – reference: Shutova, E., Kiela, D., & Maillard, J. (2016). Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 160–170). – reference: Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5447–5455). – reference: BaltrušaitisTAhujaCMorencyLPMultimodal machine learning: A survey and taxonomyIEEE Transactions on Pattern Analysis and Machine Intelligence201841242344310.1109/TPAMI.2018.2798607 – reference: Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874). – reference: KrizhevskyASutskeverIHintonGEImagenet classification with deep convolutional neural networksCommunications of the ACM2017606849010.1145/3065386 – reference: Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional lstm networks for improved phoneme classification and recognition. In International conference on artificial neural networks (pp. 799–804). Springer. – reference: RumelhartDEHintonGEWilliamsRJLearning representations by back-propagating errorsNature1986323608853353610.1038/323533a01369.68284 – reference: Amberkar, A., Awasarmol, P., Deshmukh, G., & Dave, P. (2018). Speech recognition using recurrent neural networks. In: 2018 international conference on current trends towards converging technologies (ICCTCT) (pp. 1–4). IEEE. – reference: Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153–160). – reference: HintonGETraining products of experts by minimizing contrastive divergenceNeural Computation20021481771180010.1162/0899766027601280181010.68111 – reference: EvangelopoulosGZlatintsiAPotamianosAMaragosPRapantzikosKSkoumasGAvrithisYMultimodal saliency and fusion for movie summarization based on aural, visual, and textual attentionIEEE Transactions on Multimedia20131571553156810.1109/TMM.2013.2267205 – reference: Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems (pp. 3–10). – reference: JoyceJMKullback–Leibler divergence2011BerlinSpringer72072210.1007/978-3-642-04898-2_327 – reference: Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR05) (Vol. 1, pp. 886–893). IEEE. – reference: Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015). Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) (vol. 1, pp. 1–5). IEEE. – ident: 6112_CR18 doi: 10.1109/ICCIT51783.2020.9392697 – ident: 6112_CR9 doi: 10.1016/j.eswa.2020.113885 – volume: 60 start-page: 84 issue: 6 year: 2017 ident: 6112_CR34 publication-title: Communications of the ACM doi: 10.1145/3065386 – ident: 6112_CR43 – ident: 6112_CR57 doi: 10.1109/ICASSP40776.2020.9054127 – ident: 6112_CR11 doi: 10.3115/v1/D14-1179 – ident: 6112_CR23 doi: 10.1007/11550907_126 – volume: 100 start-page: 90 issue: 1 year: 1974 ident: 6112_CR4 publication-title: IEEE Transactions on Computers doi: 10.1109/T-C.1974.223784 – ident: 6112_CR30 doi: 10.1109/CVPR.2014.241 – ident: 6112_CR13 doi: 10.1109/CVPR.2005.177 – volume: 15 start-page: 1553 issue: 7 year: 2013 ident: 6112_CR17 publication-title: IEEE Transactions on Multimedia doi: 10.1109/TMM.2013.2267205 – volume: 41 start-page: 423 issue: 2 year: 2018 ident: 6112_CR7 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2018.2798607 – ident: 6112_CR1 doi: 10.21437/Interspeech.2017-860 – volume: 9 start-page: 611 issue: 4 year: 2018 ident: 6112_CR55 publication-title: Insights into Imaging doi: 10.1007/s13244-018-0639-9 – volume: 15 start-page: 1799 issue: 1 year: 2014 ident: 6112_CR60 publication-title: The Journal of Machine Learning Research – volume: 48 start-page: 301 year: 2017 ident: 6112_CR53 publication-title: International Journal of Engineering Trends and Technology doi: 10.14445/22315381/IJETT-V48P253 – volume: 14 start-page: 1771 issue: 8 year: 2002 ident: 6112_CR24 publication-title: Neural Computation doi: 10.1162/089976602760128018 – volume: 30 start-page: 1719 issue: 6 year: 2014 ident: 6112_CR31 publication-title: J Inf Sci Eng – ident: 6112_CR27 doi: 10.1109/RTEICT42901.2018.9012507 – volume: 82 start-page: 54 year: 2018 ident: 6112_CR45 publication-title: Digital Signal Processing doi: 10.1016/j.dsp.2018.06.004 – year: 2020 ident: 6112_CR35 publication-title: International Journal of Advanced Computer Science and Applications doi: 10.14569/IJACSA.2020.0110321 – ident: 6112_CR5 doi: 10.1109/ICCTCT.2018.8551185 – volume: 28 start-page: 357 issue: 4 year: 1980 ident: 6112_CR14 publication-title: IEEE Transactions on Acoustics, Speech, and Signal Processing doi: 10.1109/TASSP.1980.1163420 – volume: 323 start-page: 533 issue: 6088 year: 1986 ident: 6112_CR47 publication-title: Nature doi: 10.1038/323533a0 – volume: 37 start-page: 233 issue: 2 year: 1991 ident: 6112_CR33 publication-title: AIChE Journal doi: 10.1002/aic.690370209 – ident: 6112_CR44 doi: 10.21437/Interspeech.2016-595 – volume: 41 start-page: 54 year: 2007 ident: 6112_CR15 publication-title: Sound and Vibration – ident: 6112_CR6 doi: 10.1109/FG.2015.7163155 – year: 2018 ident: 6112_CR3 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2018.2889052 – ident: 6112_CR58 doi: 10.1007/978-3-642-33275-3_46 – ident: 6112_CR39 doi: 10.5244/C.29.41 – volume: 1 start-page: 111 issue: 4 year: 2011 ident: 6112_CR29 publication-title: International Journal of Artificial Intelligence and Expert Systems – ident: 6112_CR16 – ident: 6112_CR52 doi: 10.18653/v1/N16-1020 – ident: 6112_CR22 – ident: 6112_CR10 doi: 10.1109/FG.2018.00020 – ident: 6112_CR8 doi: 10.7551/mitpress/7503.003.0024 – ident: 6112_CR56 doi: 10.1109/CVPR.2017.538 – ident: 6112_CR48 doi: 10.1109/ICCVW.2013.59 – ident: 6112_CR32 – ident: 6112_CR40 doi: 10.1109/ICASSP.2002.1006168 – volume: 158 start-page: 107020 year: 2020 ident: 6112_CR49 publication-title: Applied Acoustics doi: 10.1016/j.apacoust.2019.107020 – volume: 1664 start-page: 012050 year: 2020 ident: 6112_CR2 publication-title: Journal of Physics: Conference Series, IOP Publishing – ident: 6112_CR59 doi: 10.1109/ICASSP.2019.8682566 – volume: 20 start-page: 273 issue: 3 year: 1995 ident: 6112_CR12 publication-title: Machine Learning doi: 10.1007/BF00994018 – ident: 6112_CR37 doi: 10.1007/978-3-662-44415-3_16 – ident: 6112_CR50 doi: 10.1109/ICACCP.2019.8882943 – ident: 6112_CR21 – volume: 32 start-page: 829 year: 2020 ident: 6112_CR20 publication-title: Neural Computation doi: 10.1162/neco_a_01273 – start-page: 720 volume-title: Kullback–Leibler divergence year: 2011 ident: 6112_CR28 doi: 10.1007/978-3-642-04898-2_327 – ident: 6112_CR46 – ident: 6112_CR36 doi: 10.1109/ACII.2013.58 – ident: 6112_CR25 – ident: 6112_CR38 – volume: 16 start-page: 216 issue: 2 year: 2011 ident: 6112_CR51 publication-title: Tsinghua Science and Technology doi: 10.1016/S1007-0214(11)70032-3 – volume: 9 start-page: 1735 issue: 8 year: 1997 ident: 6112_CR26 publication-title: Neural Computation doi: 10.1162/neco.1997.9.8.1735 – volume: 91 start-page: 1306 issue: 9 year: 2003 ident: 6112_CR42 publication-title: Proceedings of the IEEE doi: 10.1109/JPROC.2003.817150 – ident: 6112_CR41 doi: 10.1109/ICASSP.2017.7952625 – volume: 2 start-page: 540 issue: 1 year: 2013 ident: 6112_CR19 publication-title: International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering – volume: 4 start-page: 441 issue: 3 year: 2007 ident: 6112_CR54 publication-title: IEEE/ACM Transactions on Computational Biology and Bioinformatics doi: 10.1109/tcbb.2007.1015 |
| SSID | ssj0002686 |
| Score | 2.4826493 |
| Snippet | Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the... |
| SourceID | proquest crossref springer |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 1201 |
| SubjectTerms | Accuracy Artificial Intelligence Artificial neural networks Classifiers Computer Science Control Discovery Science 2020 Evaluation Machine Learning Mechatronics Natural Language Processing (NLP) Robotics Simulation and Modeling Speech recognition Support vector machines Video signals Voice recognition |
| SummonAdditionalLinks | – databaseName: Computer Science Database dbid: K7- link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF60evBifWK1Sg7edLGZZLPZk1ixCErxoNJbSPaBBWlqk_b3O5tuGhTsxVvCPgj5dh67MzsfIZcqkyJQBmjUU0BDlkU0EzKggZEi5VL4WRVof3_mw2E8GokXd-BWuLTKWidWilrl0p6R3wAXaNqBAbudflHLGmWjq45CY5Ns-QC-XedPnK40MUQV0yMKEqPWkrtLM-7qXFUUF2zej2_5AH4apsbb_BUgrezOoP3fL94ju87j9O6WS2SfbOjJAWnXbA6eE-5D0u8jbAp7LnD77I4IvXRe5rbUpcKe6N7iu81eHRdzbCumWssPb5WClE-OyNvg4fX-kTqGBSqDKCip0UaloTImxn0qIG74GCBWkZFpBMaKs1aAKAOXKfoaJu0J0KGKcUSosjQ4Jq1JPtEnxMtAaskZFwZ3KAY46k0h_YylIGMmtOgQv_69iXTlxy0LxmfSFE62kCQISVJBkrAOuVqNmS6Lb6zt3a1xSJwgFkkDQodc10g2zX_Pdrp-tjOyY4nnlzk8XdIqZ3N9TrblohwXs4tqGX4DOCzkew priority: 102 providerName: ProQuest |
| Title | Bimodal variational autoencoder for audiovisual speech recognition |
| URI | https://link.springer.com/article/10.1007/s10994-021-06112-5 https://www.proquest.com/docview/2791432525 |
| Volume | 112 |
| WOSCitedRecordID | wos000722108900002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVPQU databaseName: AAdvanced Technologies & Aerospace Database (subscription) customDbUrl: eissn: 1573-0565 dateEnd: 20241209 omitProxy: false ssIdentifier: ssj0002686 issn: 0885-6125 databaseCode: P5Z dateStart: 20230101 isFulltext: true titleUrlDefault: https://search.proquest.com/hightechjournals providerName: ProQuest – providerCode: PRVPQU databaseName: Computer Science Database customDbUrl: eissn: 1573-0565 dateEnd: 20241209 omitProxy: false ssIdentifier: ssj0002686 issn: 0885-6125 databaseCode: K7- dateStart: 20230101 isFulltext: true titleUrlDefault: http://search.proquest.com/compscijour providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 1573-0565 dateEnd: 20241209 omitProxy: false ssIdentifier: ssj0002686 issn: 0885-6125 databaseCode: BENPR dateStart: 20230101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: Science Database (subscription) customDbUrl: eissn: 1573-0565 dateEnd: 20241209 omitProxy: false ssIdentifier: ssj0002686 issn: 0885-6125 databaseCode: M2P dateStart: 20230101 isFulltext: true titleUrlDefault: https://search.proquest.com/sciencejournals providerName: ProQuest – providerCode: PRVAVX databaseName: SpringerLink customDbUrl: eissn: 1573-0565 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002686 issn: 0885-6125 databaseCode: RSV dateStart: 19970101 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnR3LSsNAcLCtBy_WJ1ZrycGbLrSb5x6ttAhqCVVL8RI2-8CCpKVJ-_3OpkmrooJeloSdXcLMzis7D4ALGQtmS02J15aUOG7skZgJm9haMO4L1onzi_bRvT8YBOMxC4uksLSMdi-vJHNJ_SHZLS9jS02kTsdU8K9ADdVdYNhx-Dhay1_q5f0dkX1cYvR3kSrz_R6f1dHGxvxyLZprm379f9-5B7uFdWldr47DPmyp5ADqZecGq2DkQ-h2kUQSIZfoKhe_Ay2-yKamrKVESDRl8d1Eqk7SBc6lM6XEq7UON5omR_Dc7z3d3JKimwIRtmdnRCstuSO1DtAnpUgjfLSRLp4W3KPasK6SFClKfcHRrtC8zahyZIArHBlz-xiqyTRRJ2DFVCjhuz7T6I1o6qOMZKITu5yKwGWKNaBTIjUSRalx0_HiLdoUSTZIihBJUY6kyG3A5XrNbFVo41foZkmrqGC6NKI-Q-uPuhSnr0rabKZ_3u30b-BnsGOazq_id5pQzeYLdQ7bYplN0nkLat3eIBy2oHLnExwfaIhj6L608mP6Dij73xo |
| linkProvider | Springer Nature |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTxsxEB5BqASXBmgRKaHsAU7FajL78PqAEKFFQQlRD4C4bb1-iEgoSbNJKv4Uv5HxZjdRkeDGobddrW3ZO59nxvZ4PoBDnSrha4ssamhkQZhGLBXKZ75VQnIlmml-0H7b5b1efHcnfq3AU3kXxoVVljoxV9R6qNwe-Xfkgkw7hhiejv4wxxrlTldLCo05LDrm8S8t2bKTyx8k3yPEi5_X521WsAow5Uf-hFljtQy0tTGtzZD6So8-9S-ySkZoHYSNRhoZciXJvlrZEGgCHVONQKfSp3ZXYS3wY-5y9Xc4W2h-jHJmSZq4IXOeQ3FJp7iqlyfhRRdn1HT8A_8awqV3--JANrdzF9X_7Q9twsfCo_bO5lNgC1bMYBuqJVuFVyivT9BqESw1lZzJcb_YAvXkdDJ0qTw1lST3nd5ddG4_m9K3bGSMuvcWIVbDwWe4eZeR7EBlMByYXfBSVEbxkAtLKzCLnOyCUM00lKjiUBhRg2YpzkQV6dUdy8dDskwM7SCQEASSHAJJWINvizqjeXKRN0vXS7knhaLJkqXQa3BcImf5-fXWvrzd2gGst6-vukn3stfZgw0k124er1SHymQ8NfvwQc0m_Wz8NZ8CHvx-b0Q9A_QaSDY |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3JTsMwEB2xCXFhR5Q1BziBBZ3USX1AiK0CFVUcAHELjhdRCbWlaUH8Gl_HOHVagQQ3DtwSeVEcv1lsj-cB7OhUiVBbZNGhRlbhacRSoUIWWiVkrEQ5zQ_a76_jRqP68CBuxuCjuAvjwioLnZgrat1Wbo_8AGNBph058gPrwyJuzmvHnRfmGKTcSWtBpzGASN28v9HyLTu6Oqe53kWsXdyeXTLPMMBUGIU9Zo3VsqKtrdI6Dem76TGkb42skhFaB2ejkUaJsZJka608FGgqukotKjqVIfU7DpNUwp2M1WM2tAIY5SyTJMScOS_CX9jx1_byhLzoYo7Kjovgq1EcebrfDmdzm1eb-89_ax5mvacdnAxEYwHGTGsR5goWi8ArtSU4PSW4aqr5KrtNvzUayH6v7VJ8aqpJbj29u6jdZtansqxjjHoKhqFX7dYy3P3JSFZgotVumVUIUlRGxTwWllZmFmOyF0KVUy5RVbkwogTlYmoT5dOuO_aP52SUMNrBISE4JDkcEl6CvWGbziDpyK-1NwoMJF4BZckIACXYL1A0Kv65t7Xfe9uGaQJScn3VqK_DDJLHNwhj2oCJXrdvNmFKvfaaWXcrl4YAHv8aUJ9Wp1Dw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Bimodal+variational+autoencoder+for+audiovisual+speech+recognition&rft.jtitle=Machine+learning&rft.au=Sayed%2C+Hadeer+M.&rft.au=ElDeeb%2C+Hesham+E.&rft.au=Taie%2C+Shereen+A.&rft.date=2023-04-01&rft.issn=0885-6125&rft.eissn=1573-0565&rft.volume=112&rft.issue=4&rft.spage=1201&rft.epage=1226&rft_id=info:doi/10.1007%2Fs10994-021-06112-5&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s10994_021_06112_5 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-6125&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-6125&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-6125&client=summon |