Bimodal variational autoencoder for audiovisual speech recognition

Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual fe...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Machine learning Ročník 112; číslo 4; s. 1201 - 1226
Hlavní autori: Sayed, Hadeer M., ElDeeb, Hesham E., Taie, Shereen A.
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York Springer US 01.04.2023
Springer Nature B.V
Predmet:
ISSN:0885-6125, 1573-0565
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃  3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃  2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.
AbstractList Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃ 3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃ 2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.
Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the accuracy of results from classification or regression tasks. This paper proposes a Bimodal Variational Autoencoder (BiVAE) model for audiovisual features fusion. Reliance on audiovisual signals in a speech recognition task increases the recognition accuracy, especially when an audio signal is corrupted. The BiVAE model is trained and validated on the CUAVE dataset. Three classifiers have evaluated the fused audiovisual features: Long-short Term Memory, Deep Neural Network, and Support Vector Machine. The experiment involves the evaluation of the fused features in the case of whether two modalities are available or there is only one modality available (i.e., cross-modality). The experimental results display the superiority of the proposed model (BiVAE) of audiovisual features fusion over the state-of-the-art models by an average accuracy difference ≃  3.28% and 13.28% for clean and noisy, respectively. Additionally, BiVAE outperforms the state-of-the-art models in the case of cross-modality by an accuracy difference ≃  2.79% when the only audio signal is available and 1.88% when the only video signal is available. Furthermore, SVM satisfies the best recognition accuracy compared with other classifiers.
Author ElDeeb, Hesham E.
Sayed, Hadeer M.
Taie, Shereen A.
Author_xml – sequence: 1
  givenname: Hadeer M.
  orcidid: 0000-0002-6136-4823
  surname: Sayed
  fullname: Sayed, Hadeer M.
  email: hms08@fayoum.edu.eg
  organization: Department of Computer Science, Fayoum University
– sequence: 2
  givenname: Hesham E.
  surname: ElDeeb
  fullname: ElDeeb, Hesham E.
  organization: Department of Computer and Control, Electronics Research Institute
– sequence: 3
  givenname: Shereen A.
  surname: Taie
  fullname: Taie, Shereen A.
  organization: Department of Computer Science, Fayoum University
BookMark eNp9kE1LAzEQhoMo2Fb_gKeC52gy2WSzR1v8goIXPYc0HzWl3dRkt-C_N-0KgoeeZoZ5n-Gdd4zO29g6hG4ouaOE1PeZkqapMAGKiaAUMD9DI8prhgkX_ByNiJQcCwr8Eo1zXhNCQEgxQrNZ2EarN9O9TkF3Ibal130XXWuidWnqYyqzDXEfcl92eeec-ZwmZ-KqDQfgCl14vcnu-rdO0MfT4_v8BS_enl_nDwtsmGAd9s5bXVnvpWQ12KUpLVs2RnijBXjgtXQWnDFQG11XzGvSgKusLERll5pN0O1wd5fiV-9yp9axT8VvVlA3tGLAgReVHFQmxZyT88qE7vhYl3TYKErUITE1JKZKYuqYmDqg8A_dpbDV6fs0xAYoF3G7cunP1QnqB4Ligoc
CitedBy_id crossref_primary_10_1007_s10994_024_06598_9
crossref_primary_10_1109_ACCESS_2024_3463969
crossref_primary_10_1007_s12598_024_03089_7
crossref_primary_10_1007_s10462_023_10662_6
crossref_primary_10_1109_TII_2024_3359416
crossref_primary_10_1016_j_aej_2025_03_046
Cites_doi 10.1109/ICCIT51783.2020.9392697
10.1016/j.eswa.2020.113885
10.1145/3065386
10.1109/ICASSP40776.2020.9054127
10.3115/v1/D14-1179
10.1007/11550907_126
10.1109/T-C.1974.223784
10.1109/CVPR.2014.241
10.1109/CVPR.2005.177
10.1109/TMM.2013.2267205
10.1109/TPAMI.2018.2798607
10.21437/Interspeech.2017-860
10.1007/s13244-018-0639-9
10.14445/22315381/IJETT-V48P253
10.1162/089976602760128018
10.1109/RTEICT42901.2018.9012507
10.1016/j.dsp.2018.06.004
10.14569/IJACSA.2020.0110321
10.1109/ICCTCT.2018.8551185
10.1109/TASSP.1980.1163420
10.1038/323533a0
10.1002/aic.690370209
10.21437/Interspeech.2016-595
10.1109/FG.2015.7163155
10.1109/TPAMI.2018.2889052
10.1007/978-3-642-33275-3_46
10.5244/C.29.41
10.18653/v1/N16-1020
10.1109/FG.2018.00020
10.7551/mitpress/7503.003.0024
10.1109/CVPR.2017.538
10.1109/ICCVW.2013.59
10.1109/ICASSP.2002.1006168
10.1016/j.apacoust.2019.107020
10.1109/ICASSP.2019.8682566
10.1007/BF00994018
10.1007/978-3-662-44415-3_16
10.1109/ICACCP.2019.8882943
10.1162/neco_a_01273
10.1007/978-3-642-04898-2_327
10.1109/ACII.2013.58
10.1016/S1007-0214(11)70032-3
10.1162/neco.1997.9.8.1735
10.1109/JPROC.2003.817150
10.1109/ICASSP.2017.7952625
10.1109/tcbb.2007.1015
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021
The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021.
Copyright_xml – notice: The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021
– notice: The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2021.
DBID AAYXX
CITATION
3V.
7SC
7XB
88I
8AL
8AO
8FD
8FE
8FG
8FK
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
GNUQQ
HCIFZ
JQ2
K7-
L7M
L~C
L~D
M0N
M2P
P5Z
P62
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
Q9U
DOI 10.1007/s10994-021-06112-5
DatabaseName CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ProQuest Central (purchase pre-March 2016)
Science Database (Alumni Edition)
Computing Database (Alumni Edition)
ProQuest Pharma Collection
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni Edition)
ProQuest Central UK/Ireland
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials
ProQuest Central
ProQuest Technology Collection
ProQuest One Community College
ProQuest Central Korea
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
Computer Science Database
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
Computing Database
Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest Central Basic
DatabaseTitle CrossRef
Computer Science Database
ProQuest Central Student
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Pharma Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Central Korea
ProQuest Central (New)
Advanced Technologies Database with Aerospace
Advanced Technologies & Aerospace Collection
ProQuest Computing
ProQuest Science Journals (Alumni Edition)
ProQuest Central Basic
ProQuest Science Journals
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest SciTech Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest Central (Alumni)
ProQuest One Academic (New)
DatabaseTitleList Computer Science Database

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1573-0565
EndPage 1226
ExternalDocumentID 10_1007_s10994_021_06112_5
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
-~X
.4S
.86
.DC
.VR
06D
0R~
0VY
199
1N0
1SB
2.D
203
28-
29M
2J2
2JN
2JY
2KG
2KM
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6NX
6TJ
78A
88I
8AO
8FE
8FG
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAEWM
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTV
ABHLI
ABHQN
ABIVO
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFS
ACGOD
ACHSB
ACHXU
ACKNC
ACMDZ
ACMLO
ACNCT
ACOKC
ACOMO
ACPIV
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFZKB
AGAYW
AGDGC
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
AZQEC
B-.
BA0
BBWZM
BDATZ
BENPR
BGLVJ
BGNMA
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
EBLON
EBS
EIOEI
EJD
ESBYG
F5P
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IHE
IJ-
IKXTQ
ITG
ITH
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Y
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K6V
K7-
KDC
KOV
KOW
LAK
LLZTM
M0N
M2P
M4Y
MA-
MVM
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P2P
P62
P9O
PF-
PQQKQ
PROAC
PT4
Q2X
QF4
QM1
QN7
QO4
QOK
QOS
R4E
R89
R9I
RHV
RIG
RNI
RNS
ROL
RPX
RSV
RZC
RZE
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TAE
TEORI
TN5
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
VXZ
W23
W48
WH7
WIP
WK8
XJT
YLTOR
Z45
Z7R
Z7S
Z7U
Z7V
Z7W
Z7X
Z7Y
Z7Z
Z81
Z83
Z85
Z86
Z87
Z88
Z8M
Z8N
Z8O
Z8P
Z8Q
Z8R
Z8S
Z8T
Z8U
Z8W
Z8Z
Z91
Z92
ZMTXR
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
ADKFA
AEZWR
AFDZB
AFFHD
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
AMVHM
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
PQGLB
7SC
7XB
8AL
8FD
8FK
JQ2
L7M
L~C
L~D
PKEHL
PQEST
PQUKI
PRINS
Q9U
ID FETCH-LOGICAL-c363t-fefda4dff88372dbcdff3b9c6fca62f2578ed2ecc27ca743fa092e4d8ff84dba3
IEDL.DBID K7-
ISICitedReferencesCount 13
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000722108900002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0885-6125
IngestDate Wed Nov 05 03:12:00 EST 2025
Tue Nov 18 21:55:08 EST 2025
Sat Nov 29 01:43:28 EST 2025
Fri Feb 21 02:43:39 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords Deep learning
Variational autoencoder
Cross-modality
Multimodal data fusion
Audiovisual speech recognition
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c363t-fefda4dff88372dbcdff3b9c6fca62f2578ed2ecc27ca743fa092e4d8ff84dba3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-6136-4823
OpenAccessLink http://dx.doi.org/10.1007/s10994-021-06112-5
PQID 2791432525
PQPubID 54194
PageCount 26
ParticipantIDs proquest_journals_2791432525
crossref_citationtrail_10_1007_s10994_021_06112_5
crossref_primary_10_1007_s10994_021_06112_5
springer_journals_10_1007_s10994_021_06112_5
PublicationCentury 2000
PublicationDate 20230400
2023-04-00
20230401
PublicationDateYYYYMMDD 2023-04-01
PublicationDate_xml – month: 4
  year: 2023
  text: 20230400
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
– name: Dordrecht
PublicationTitle Machine learning
PublicationTitleAbbrev Mach Learn
PublicationYear 2023
Publisher Springer US
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer Nature B.V
References AhmedNNatarajanTRaoKRDiscrete cosine transformIEEE Transactions on Computers19741001909335655510.1109/T-C.1974.2237840273.65097
KarlikBOlgacAVPerformance analysis of various activation functions in generalized mlp architectures of neural networksInternational Journal of Artificial Intelligence and Expert Systems201114111122
AfourasTChungJSSeniorAVinyalsOZissermanADeep audio-visual speech recognitionIEEE Transactions on Pattern Analysis and Machine Intelligence201810.1109/TPAMI.2018.2889052
Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5447–5455).
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face and gesture recognition (FG 2018) (pp. 67–74). IEEE
ZhuJChenNXingEPBayesian inference with posterior regularization and applications to infinite latent svmsThe Journal of Machine Learning Research20141511799184732252411319.62067
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, CONF.
YamashitaRNishioMDoRKGTogashiKConvolutional neural networks: An overview and application in radiologyInsights into Imaging20189461162910.1007/s13244-018-0639-9
Kingma, D. P, & Welling, M. (2014). Auto-encoding variational bayes. CoRR arXiv:1312.6114
EvangelopoulosGZlatintsiAPotamianosAMaragosPRapantzikosKSkoumasGAvrithisYMultimodal saliency and fusion for movie summarization based on aural, visual, and textual attentionIEEE Transactions on Multimedia20131571553156810.1109/TMM.2013.2267205
Morvant, E., Habrard, A., & Ayache, S. (2014). Majority vote of diverse classifiers for late fusion. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (pp 153–162). Springer.
Yu, J., Zhang, S. X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 6984–6988). IEEE.
KimSChoKFast calculation of histogram of oriented gradient feature by removing redundancy in overlapping blockJ Inf Sci Eng201430617191731
DavisSMermelsteinPComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesIEEE Transactions on Acoustics, Speech, and Signal Processing198028435736610.1109/TASSP.1980.1163420
Shutova, E., Kiela, D., & Maillard, J. (2016). Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 160–170).
HintonGETraining products of experts by minimizing contrastive divergenceNeural Computation20021481771180010.1162/0899766027601280181010.68111
Garg, A., Noyola, J., Bagadia, S. (2016). Lip reading using cnn and lstm. Technical report, Stanford University, CS231 n project report.
Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems (pp. 3–10).
Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics, PMLR (pp. 814–822).
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153–160).
ShuCDingXFangCHistogram of the oriented gradient for face recognitionTsinghua Science and Technology201116221622410.1016/S1007-0214(11)70032-3
Bokade, R., Navato, A., Ouyang, R., Jin, X., Chou, C. A., Ostadabbas, S., & Mueller, A. V. (2020). A cross-disciplinary comparison of multimodal data fusion approaches and applications: Accelerating learning through trans-disciplinary information sharing. In Expert Systems with Applications (pp. 113885).
HochreiterSSchmidhuberJLong short-term memoryNeural Computation1997981735178010.1162/neco.1997.9.8.1735
JoyceJMKullback–Leibler divergence2011BerlinSpringer72072210.1007/978-3-642-04898-2_327
PotamianosGNetiCGravierGGargASeniorAWRecent advances in the automatic recognition of audiovisual speechProceedings of the IEEE20039191306132610.1109/JPROC.2003.817150
Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874).
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshops (pp. 397–403).
BaltrušaitisTAhujaCMorencyLPMultimodal machine learning: A survey and taxonomyIEEE Transactions on Pattern Analysis and Machine Intelligence201841242344310.1109/TPAMI.2018.2798607
KramerMANonlinear principal component analysis using autoassociative neural networksAIChE Journal199137223324310.1002/aic.690370209
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, Doha, Qatar (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179, https://aclanthology.org/D14-1179
CortesCVapnikVSupport-vector networksMachine Learning199520327329710.1007/BF009940180831.68098
SharmaGUmapathyKKrishnanSTrends in audio signal feature extraction methodsApplied Acoustics202015810702010.1016/j.apacoust.2019.107020
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org
RahmaniMHAlmasganjFSeyyedsalehiSAAudio-visual feature fusion via deep neural networks for automatic speech recognitionDigital Signal Processing201882546310.1016/j.dsp.2018.06.004
Zaytseva, E., Seguí, S., & Vitria, J. (2012). Sketchable histograms of oriented gradients for object detection. In Iberoamerican congress on pattern recognition (pp. 374–381). Springer.
Abdelaziz, A. H. (2017). Ntcd-timit: A new database and baseline for noise-robust audio-visual speech recognition. In: INTERSPEECH (pp. 3752–3756).
Jogin, M., Madhulika, M., Divya, G., Meghana, R., Apoorva, S., et al. (2018). Feature extraction using convolution neural networks (cnn) and deep learning. In 2018 3rd IEEE international conference on recent trends in electronics, information and communication technology (RTEICT) (pp. 2319–2323). IEEE.
ThireouTReczkoMBidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteinsIEEE/ACM Transactions on Computational Biology and Bioinformatics20074344144610.1109/tcbb.2007.1015
Amberkar, A., Awasarmol, P., Deshmukh, G., & Dave, P. (2018). Speech recognition using recurrent neural networks. In: 2018 international conference on current trends towards converging technologies (ICCTCT) (pp. 1–4). IEEE.
Doersch, C. (2016). Tutorial on variational autoencoders. arXiv:160605908
RumelhartDEHintonGEWilliamsRJLearning representations by back-propagating errorsNature1986323608853353610.1038/323533a01369.68284
TarwaniKMEdemSSurvey on recurrent neural network in natural language processingInternational Journal of Engineering Trends and Technology20174830130410.14445/22315381/IJETT-V48P253
Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional lstm networks for improved phoneme classification and recognition. In International conference on artificial neural networks (pp. 799–804). Springer.
AdnanSAliFAbdulmunemAAFacial feature extraction for face recognitionJournal of Physics: Conference Series, IOP Publishing20201664012050
FathimaRRaseenaPGammatone cepstral coefficient for speaker identificationInternational Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering201321540545
Shekar, B., & Dagnew, G. (2019). Grid search-based hyperparameter tuning and classification of microarray cancer data. In 2019 second international conference on advanced computational and communication paradigms (ICACCP) (pp. 1–8). IEEE.
KrizhevskyASutskeverIHintonGEImagenet classification with deep convolutional neural networksCommunications of the ACM2017606849010.1145/3065386
Petridis, S., Li, Z., & Pantic, M. (2017). End-to-end visual speech recognition with lstms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 2592–2596). IEEE.
GaoJLiPChenZZhangJA survey on deep learning for multimodal data fusionNeural Computation202032829864410116410.1162/neco_a_012731468.68182
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In British machine vision conference (pp. 41.1–41.12,). BMVA Press. https://doi.org/10.5244/C.29.41
LakshmiKPSolankiMDaraJSKompalliABVideo genre classification using convolutional recurrent neural networksInternational Journal of Advanced Computer Science and Applications202010.14569/IJACSA.2020.0110321
Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., Gonzalez, I., Valentin, E., & Sahli, H. (2013). Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In 2013 Humaine association conference on affective computing and intelligent interaction (pp. 312–317). IEEE.
Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE inte
S Davis (6112_CR14) 1980; 28
MH Rahmani (6112_CR45) 2018; 82
6112_CR30
JM Joyce (6112_CR28) 2011
6112_CR38
R Yamashita (6112_CR55) 2018; 9
6112_CR37
6112_CR36
C Shu (6112_CR51) 2011; 16
6112_CR32
6112_CR5
JS Deery (6112_CR15) 2007; 41
6112_CR9
6112_CR8
J Zhu (6112_CR60) 2014; 15
6112_CR6
6112_CR39
C Cortes (6112_CR12) 1995; 20
A Krizhevsky (6112_CR34) 2017; 60
R Fathima (6112_CR19) 2013; 2
6112_CR41
S Kim (6112_CR31) 2014; 30
6112_CR40
G Evangelopoulos (6112_CR17) 2013; 15
DE Rumelhart (6112_CR47) 1986; 323
6112_CR48
6112_CR46
T Thireou (6112_CR54) 2007; 4
6112_CR44
6112_CR43
G Potamianos (6112_CR42) 2003; 91
MA Kramer (6112_CR33) 1991; 37
N Ahmed (6112_CR4) 1974; 100
J Gao (6112_CR20) 2020; 32
6112_CR52
6112_CR50
B Karlik (6112_CR29) 2011; 1
6112_CR16
6112_CR59
G Sharma (6112_CR49) 2020; 158
6112_CR58
6112_CR13
6112_CR57
6112_CR56
6112_CR11
6112_CR10
S Adnan (6112_CR2) 2020; 1664
T Baltrušaitis (6112_CR7) 2018; 41
6112_CR18
KP Lakshmi (6112_CR35) 2020
GE Hinton (6112_CR24) 2002; 14
6112_CR1
S Hochreiter (6112_CR26) 1997; 9
KM Tarwani (6112_CR53) 2017; 48
6112_CR27
6112_CR25
6112_CR23
6112_CR22
6112_CR21
T Afouras (6112_CR3) 2018
References_xml – reference: Jogin, M., Madhulika, M., Divya, G., Meghana, R., Apoorva, S., et al. (2018). Feature extraction using convolution neural networks (cnn) and deep learning. In 2018 3rd IEEE international conference on recent trends in electronics, information and communication technology (RTEICT) (pp. 2319–2323). IEEE.
– reference: Petridis, S., Li, Z., & Pantic, M. (2017). End-to-end visual speech recognition with lstms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 2592–2596). IEEE.
– reference: Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, CONF.
– reference: ThireouTReczkoMBidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteinsIEEE/ACM Transactions on Computational Biology and Bioinformatics20074344144610.1109/tcbb.2007.1015
– reference: Bokade, R., Navato, A., Ouyang, R., Jin, X., Chou, C. A., Ostadabbas, S., & Mueller, A. V. (2020). A cross-disciplinary comparison of multimodal data fusion approaches and applications: Accelerating learning through trans-disciplinary information sharing. In Expert Systems with Applications (pp. 113885).
– reference: KarlikBOlgacAVPerformance analysis of various activation functions in generalized mlp architectures of neural networksInternational Journal of Artificial Intelligence and Expert Systems201114111122
– reference: Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., Gonzalez, I., Valentin, E., & Sahli, H. (2013). Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition. In 2013 Humaine association conference on affective computing and intelligent interaction (pp. 312–317). IEEE.
– reference: ShuCDingXFangCHistogram of the oriented gradient for face recognitionTsinghua Science and Technology201116221622410.1016/S1007-0214(11)70032-3
– reference: HochreiterSSchmidhuberJLong short-term memoryNeural Computation1997981735178010.1162/neco.1997.9.8.1735
– reference: GaoJLiPChenZZhangJA survey on deep learning for multimodal data fusionNeural Computation202032829864410116410.1162/neco_a_012731468.68182
– reference: DavisSMermelsteinPComparison of parametric representations for monosyllabic word recognition in continuously spoken sentencesIEEE Transactions on Acoustics, Speech, and Signal Processing198028435736610.1109/TASSP.1980.1163420
– reference: Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face and gesture recognition (FG 2018) (pp. 67–74). IEEE
– reference: CortesCVapnikVSupport-vector networksMachine Learning199520327329710.1007/BF009940180831.68098
– reference: Abdelaziz, A. H. (2017). Ntcd-timit: A new database and baseline for noise-robust audio-visual speech recognition. In: INTERSPEECH (pp. 3752–3756).
– reference: Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In British machine vision conference (pp. 41.1–41.12,). BMVA Press. https://doi.org/10.5244/C.29.41
– reference: Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshops (pp. 397–403).
– reference: Garg, A., Noyola, J., Bagadia, S. (2016). Lip reading using cnn and lstm. Technical report, Stanford University, CS231 n project report.
– reference: AdnanSAliFAbdulmunemAAFacial feature extraction for face recognitionJournal of Physics: Conference Series, IOP Publishing20201664012050
– reference: Doersch, C. (2016). Tutorial on variational autoencoders. arXiv:160605908
– reference: Zhang, S., Lei, M., Ma, B., & Xie, L. (2019). Robust audio-visual speech recognition using bimodal dfsmn with multi-condition training and dropout regularization. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6570–6574). IEEE.
– reference: KimSChoKFast calculation of histogram of oriented gradient feature by removing redundancy in overlapping blockJ Inf Sci Eng201430617191731
– reference: Morvant, E., Habrard, A., & Ayache, S. (2014). Majority vote of diverse classifiers for late fusion. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (pp 153–162). Springer.
– reference: Kingma, D. P, & Welling, M. (2014). Auto-encoding variational bayes. CoRR arXiv:1312.6114
– reference: Yu, J., Zhang, S. X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 6984–6988). IEEE.
– reference: Shekar, B., & Dagnew, G. (2019). Grid search-based hyperparameter tuning and classification of microarray cancer data. In 2019 second international conference on advanced computational and communication paradigms (ICACCP) (pp. 1–8). IEEE.
– reference: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org
– reference: KramerMANonlinear principal component analysis using autoassociative neural networksAIChE Journal199137223324310.1002/aic.690370209
– reference: PotamianosGNetiCGravierGGargASeniorAWRecent advances in the automatic recognition of audiovisual speechProceedings of the IEEE20039191306132610.1109/JPROC.2003.817150
– reference: AfourasTChungJSSeniorAVinyalsOZissermanADeep audio-visual speech recognitionIEEE Transactions on Pattern Analysis and Machine Intelligence201810.1109/TPAMI.2018.2889052
– reference: Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics, PMLR (pp. 814–822).
– reference: SharmaGUmapathyKKrishnanSTrends in audio signal feature extraction methodsApplied Acoustics202015810702010.1016/j.apacoust.2019.107020
– reference: Faruk, A., Faraby, H. A., Azad, M. M., Fedous, M. R., & Morol, M. K. (2020). Image to Bengali caption generation using deep cnn and bidirectional gated recurrent unit. In 2020 23rd international conference on computer and information technology (ICCIT) (pp. 1–6).
– reference: Zaytseva, E., Seguí, S., & Vitria, J. (2012). Sketchable histograms of oriented gradients for object detection. In Iberoamerican congress on pattern recognition (pp. 374–381). Springer.
– reference: Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., & Khudanpur, S. (2016). Purely sequence-trained neural networks for asr based on lattice-free mmi. In Interspeech (pp. 2751–2755).
– reference: LakshmiKPSolankiMDaraJSKompalliABVideo genre classification using convolutional recurrent neural networksInternational Journal of Advanced Computer Science and Applications202010.14569/IJACSA.2020.0110321
– reference: Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. II–2017). IEEE.
– reference: YamashitaRNishioMDoRKGTogashiKConvolutional neural networks: An overview and application in radiologyInsights into Imaging20189461162910.1007/s13244-018-0639-9
– reference: Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In L. Getoor & T. Scheffer (Eds.), ICML (pp. 689–696). Omnipress. http://dblp.uni-trier.de/db/conf/icml/icml2011.html#NgiamKKNLN11.
– reference: RahmaniMHAlmasganjFSeyyedsalehiSAAudio-visual feature fusion via deep neural networks for automatic speech recognitionDigital Signal Processing201882546310.1016/j.dsp.2018.06.004
– reference: AhmedNNatarajanTRaoKRDiscrete cosine transformIEEE Transactions on Computers19741001909335655510.1109/T-C.1974.2237840273.65097
– reference: ZhuJChenNXingEPBayesian inference with posterior regularization and applications to infinite latent svmsThe Journal of Machine Learning Research20141511799184732252411319.62067
– reference: FathimaRRaseenaPGammatone cepstral coefficient for speaker identificationInternational Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering201321540545
– reference: DeeryJSThe ‘real’ history of real-time spectrum analyzers a 50-year trip down memory laneSound and Vibration2007415459
– reference: Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, Doha, Qatar (pp. 1724–1734). https://doi.org/10.3115/v1/D14-1179, https://aclanthology.org/D14-1179
– reference: TarwaniKMEdemSSurvey on recurrent neural network in natural language processingInternational Journal of Engineering Trends and Technology20174830130410.14445/22315381/IJETT-V48P253
– reference: Shutova, E., Kiela, D., & Maillard, J. (2016). Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 160–170).
– reference: Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5447–5455).
– reference: BaltrušaitisTAhujaCMorencyLPMultimodal machine learning: A survey and taxonomyIEEE Transactions on Pattern Analysis and Machine Intelligence201841242344310.1109/TPAMI.2018.2798607
– reference: Kazemi, V., & Sullivan, J. (2014). One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1867–1874).
– reference: KrizhevskyASutskeverIHintonGEImagenet classification with deep convolutional neural networksCommunications of the ACM2017606849010.1145/3065386
– reference: Graves, A., Fernández, S., & Schmidhuber, J. (2005). Bidirectional lstm networks for improved phoneme classification and recognition. In International conference on artificial neural networks (pp. 799–804). Springer.
– reference: RumelhartDEHintonGEWilliamsRJLearning representations by back-propagating errorsNature1986323608853353610.1038/323533a01369.68284
– reference: Amberkar, A., Awasarmol, P., Deshmukh, G., & Dave, P. (2018). Speech recognition using recurrent neural networks. In: 2018 international conference on current trends towards converging technologies (ICCTCT) (pp. 1–4). IEEE.
– reference: Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153–160).
– reference: HintonGETraining products of experts by minimizing contrastive divergenceNeural Computation20021481771180010.1162/0899766027601280181010.68111
– reference: EvangelopoulosGZlatintsiAPotamianosAMaragosPRapantzikosKSkoumasGAvrithisYMultimodal saliency and fusion for movie summarization based on aural, visual, and textual attentionIEEE Transactions on Multimedia20131571553156810.1109/TMM.2013.2267205
– reference: Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems (pp. 3–10).
– reference: JoyceJMKullback–Leibler divergence2011BerlinSpringer72072210.1007/978-3-642-04898-2_327
– reference: Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR05) (Vol. 1, pp. 886–893). IEEE.
– reference: Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015). Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In 2015 11th IEEE international conference and workshops on automatic face and gesture recognition (FG) (vol. 1, pp. 1–5). IEEE.
– ident: 6112_CR18
  doi: 10.1109/ICCIT51783.2020.9392697
– ident: 6112_CR9
  doi: 10.1016/j.eswa.2020.113885
– volume: 60
  start-page: 84
  issue: 6
  year: 2017
  ident: 6112_CR34
  publication-title: Communications of the ACM
  doi: 10.1145/3065386
– ident: 6112_CR43
– ident: 6112_CR57
  doi: 10.1109/ICASSP40776.2020.9054127
– ident: 6112_CR11
  doi: 10.3115/v1/D14-1179
– ident: 6112_CR23
  doi: 10.1007/11550907_126
– volume: 100
  start-page: 90
  issue: 1
  year: 1974
  ident: 6112_CR4
  publication-title: IEEE Transactions on Computers
  doi: 10.1109/T-C.1974.223784
– ident: 6112_CR30
  doi: 10.1109/CVPR.2014.241
– ident: 6112_CR13
  doi: 10.1109/CVPR.2005.177
– volume: 15
  start-page: 1553
  issue: 7
  year: 2013
  ident: 6112_CR17
  publication-title: IEEE Transactions on Multimedia
  doi: 10.1109/TMM.2013.2267205
– volume: 41
  start-page: 423
  issue: 2
  year: 2018
  ident: 6112_CR7
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2018.2798607
– ident: 6112_CR1
  doi: 10.21437/Interspeech.2017-860
– volume: 9
  start-page: 611
  issue: 4
  year: 2018
  ident: 6112_CR55
  publication-title: Insights into Imaging
  doi: 10.1007/s13244-018-0639-9
– volume: 15
  start-page: 1799
  issue: 1
  year: 2014
  ident: 6112_CR60
  publication-title: The Journal of Machine Learning Research
– volume: 48
  start-page: 301
  year: 2017
  ident: 6112_CR53
  publication-title: International Journal of Engineering Trends and Technology
  doi: 10.14445/22315381/IJETT-V48P253
– volume: 14
  start-page: 1771
  issue: 8
  year: 2002
  ident: 6112_CR24
  publication-title: Neural Computation
  doi: 10.1162/089976602760128018
– volume: 30
  start-page: 1719
  issue: 6
  year: 2014
  ident: 6112_CR31
  publication-title: J Inf Sci Eng
– ident: 6112_CR27
  doi: 10.1109/RTEICT42901.2018.9012507
– volume: 82
  start-page: 54
  year: 2018
  ident: 6112_CR45
  publication-title: Digital Signal Processing
  doi: 10.1016/j.dsp.2018.06.004
– year: 2020
  ident: 6112_CR35
  publication-title: International Journal of Advanced Computer Science and Applications
  doi: 10.14569/IJACSA.2020.0110321
– ident: 6112_CR5
  doi: 10.1109/ICCTCT.2018.8551185
– volume: 28
  start-page: 357
  issue: 4
  year: 1980
  ident: 6112_CR14
  publication-title: IEEE Transactions on Acoustics, Speech, and Signal Processing
  doi: 10.1109/TASSP.1980.1163420
– volume: 323
  start-page: 533
  issue: 6088
  year: 1986
  ident: 6112_CR47
  publication-title: Nature
  doi: 10.1038/323533a0
– volume: 37
  start-page: 233
  issue: 2
  year: 1991
  ident: 6112_CR33
  publication-title: AIChE Journal
  doi: 10.1002/aic.690370209
– ident: 6112_CR44
  doi: 10.21437/Interspeech.2016-595
– volume: 41
  start-page: 54
  year: 2007
  ident: 6112_CR15
  publication-title: Sound and Vibration
– ident: 6112_CR6
  doi: 10.1109/FG.2015.7163155
– year: 2018
  ident: 6112_CR3
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2018.2889052
– ident: 6112_CR58
  doi: 10.1007/978-3-642-33275-3_46
– ident: 6112_CR39
  doi: 10.5244/C.29.41
– volume: 1
  start-page: 111
  issue: 4
  year: 2011
  ident: 6112_CR29
  publication-title: International Journal of Artificial Intelligence and Expert Systems
– ident: 6112_CR16
– ident: 6112_CR52
  doi: 10.18653/v1/N16-1020
– ident: 6112_CR22
– ident: 6112_CR10
  doi: 10.1109/FG.2018.00020
– ident: 6112_CR8
  doi: 10.7551/mitpress/7503.003.0024
– ident: 6112_CR56
  doi: 10.1109/CVPR.2017.538
– ident: 6112_CR48
  doi: 10.1109/ICCVW.2013.59
– ident: 6112_CR32
– ident: 6112_CR40
  doi: 10.1109/ICASSP.2002.1006168
– volume: 158
  start-page: 107020
  year: 2020
  ident: 6112_CR49
  publication-title: Applied Acoustics
  doi: 10.1016/j.apacoust.2019.107020
– volume: 1664
  start-page: 012050
  year: 2020
  ident: 6112_CR2
  publication-title: Journal of Physics: Conference Series, IOP Publishing
– ident: 6112_CR59
  doi: 10.1109/ICASSP.2019.8682566
– volume: 20
  start-page: 273
  issue: 3
  year: 1995
  ident: 6112_CR12
  publication-title: Machine Learning
  doi: 10.1007/BF00994018
– ident: 6112_CR37
  doi: 10.1007/978-3-662-44415-3_16
– ident: 6112_CR50
  doi: 10.1109/ICACCP.2019.8882943
– ident: 6112_CR21
– volume: 32
  start-page: 829
  year: 2020
  ident: 6112_CR20
  publication-title: Neural Computation
  doi: 10.1162/neco_a_01273
– start-page: 720
  volume-title: Kullback–Leibler divergence
  year: 2011
  ident: 6112_CR28
  doi: 10.1007/978-3-642-04898-2_327
– ident: 6112_CR46
– ident: 6112_CR36
  doi: 10.1109/ACII.2013.58
– ident: 6112_CR25
– ident: 6112_CR38
– volume: 16
  start-page: 216
  issue: 2
  year: 2011
  ident: 6112_CR51
  publication-title: Tsinghua Science and Technology
  doi: 10.1016/S1007-0214(11)70032-3
– volume: 9
  start-page: 1735
  issue: 8
  year: 1997
  ident: 6112_CR26
  publication-title: Neural Computation
  doi: 10.1162/neco.1997.9.8.1735
– volume: 91
  start-page: 1306
  issue: 9
  year: 2003
  ident: 6112_CR42
  publication-title: Proceedings of the IEEE
  doi: 10.1109/JPROC.2003.817150
– ident: 6112_CR41
  doi: 10.1109/ICASSP.2017.7952625
– volume: 2
  start-page: 540
  issue: 1
  year: 2013
  ident: 6112_CR19
  publication-title: International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering
– volume: 4
  start-page: 441
  issue: 3
  year: 2007
  ident: 6112_CR54
  publication-title: IEEE/ACM Transactions on Computational Biology and Bioinformatics
  doi: 10.1109/tcbb.2007.1015
SSID ssj0002686
Score 2.4826493
Snippet Multimodal fusion is the idea of combining information in a joint representation of multiple modalities. The goal of multimodal fusion is to improve the...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1201
SubjectTerms Accuracy
Artificial Intelligence
Artificial neural networks
Classifiers
Computer Science
Control
Discovery Science 2020
Evaluation
Machine Learning
Mechatronics
Natural Language Processing (NLP)
Robotics
Simulation and Modeling
Speech recognition
Support vector machines
Video signals
Voice recognition
SummonAdditionalLinks – databaseName: SpringerLink Contemporary
  dbid: RSV
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT8MwDLZgcODCeIrBQD1wg0pr0qXpkSEmDmhCvLRb1OYhJqFuWrv9fpws3QABEtxaxYkqO46_NM5ngHNJcppzdEBucKcTd1QUpolmIZecaYrxlylHmX-XDAZ8OEzv_aWwss52r48k3Ur94bKbo7ElNlMnsgz-67CB4Y5bd3x4fFmuv4S5-o7oPt3Qxm9_Veb7MT6HoxXG_HIs6qJNv_m_79yBbY8ug6vFdNiFNV3sQbOu3BB4R96HXg9NpFByjltl_zswyGbV2NJaKpREKIvvNlN1VM6wrZxoLV-DZbrRuDiA5_7N0_Vt6KsphJIyWoVGG5XFyhiOe1KicomPNE8lMzJjxFjX1YqgRUkiM8QVJuukRMeKY49Y5Rk9hEYxLvQRBBHniihEFhIBTR6blNIsZQnNoyhRPKItiGqlCumpxm3FizexIkm2ShKoJOGUJLotuFj2mSyINn6Vbte2Et7pSkGSFNEf6RJsvqxts2r-ebTjv4mfwJYtOr_I32lDo5rO9Clsynk1KqdnbjK-A7QO2Jk
  priority: 102
  providerName: Springer Nature
Title Bimodal variational autoencoder for audiovisual speech recognition
URI https://link.springer.com/article/10.1007/s10994-021-06112-5
https://www.proquest.com/docview/2791432525
Volume 112
WOSCitedRecordID wos000722108900002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: Advanced Technologies & Aerospace Database
  customDbUrl:
  eissn: 1573-0565
  dateEnd: 20241209
  omitProxy: false
  ssIdentifier: ssj0002686
  issn: 0885-6125
  databaseCode: P5Z
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/hightechjournals
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Computer Science Database
  customDbUrl:
  eissn: 1573-0565
  dateEnd: 20241209
  omitProxy: false
  ssIdentifier: ssj0002686
  issn: 0885-6125
  databaseCode: K7-
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/compscijour
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 1573-0565
  dateEnd: 20241209
  omitProxy: false
  ssIdentifier: ssj0002686
  issn: 0885-6125
  databaseCode: BENPR
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Science Database
  customDbUrl:
  eissn: 1573-0565
  dateEnd: 20241209
  omitProxy: false
  ssIdentifier: ssj0002686
  issn: 0885-6125
  databaseCode: M2P
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/sciencejournals
  providerName: ProQuest
– providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 1573-0565
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002686
  issn: 0885-6125
  databaseCode: RSV
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEB609eDF-sRqLTl402B3N4_NSay0CGop9UHxEpJ9YEHa2qT9_c6mmxYFvXhZEvZByLezM7M7Ox_AuaApSzkKINfo6XgtSdwoVIHLBQ8UQ_0byCJl_kPY6_HhMOrbDbfMhlWWa2KxUMuJMHvkVzSMULVTn_rX00_XsEaZ01VLobEJVUIpMfP8PnRXKzENCqZHFCTfNZrcXpqxV-eKpLjUxP0QwwfwXTGtrc0fB6SF3unW_vvFu7BjLU7nZjlF9mBDjfehVrI5OFa4D6DdRtgktlyg-2y3CJ1knk9MqkuJLdG8xXcTvTrK5liXTZUS784qBGkyPoSXbuf59s61DAuuYAHLXa20TDypNUc_lcpU4CNLIxFokQRUG3FWkiLKNBQJ2ho6aUVUeZJjD0-mCTuCyngyVsfgEM4llWhtCDRyUk9HjCVRELKUkFBywupAyt8bC5t-3LBgfMTrxMkGkhghiQtIYr8OF6s-02XyjT9bN0ocYiuIWbwGoQ6XJZLr6t9HO_l7tFPYNsTzyxieBlTy2VydwZZY5KNs1oRqu9PrD5rFdMTykfax7PtvWA6eXr8AfeXm-g
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT8JAEJ4gmuhFfEYUtQc9aSPdLe32YIyoBAISD5hwq-0-IokpSAHjn_I3OltaiCZy4-CtTXc32c43M9_uzs4AnHES0pChAjKFKx27LCzTc6VjMs4cSdH_OiJJmd9y223W7XpPOfjK7sLosMrMJiaGWvS53iO_Iq6Hrp1USOVm8G7qqlH6dDUroTGFRVN-fuCSLb5u3KN8zwmpPXTu6mZaVcDk1KEjU0klAlsoxXBtRkTI8ZGGHncUDxyiNISlIDgz4vIA_asKyh6RtmDYwxZhQHHcFVi1KXN1rv6ma84sP3GSypKouBVTM4f0kk56VS9Jwkt0nJGl6w_8dIRzdvvrQDbxc7XCf_tDW7CZMmrjdqoC25CT0Q4UsmoVRmq8dqFaRVgKbDkJhr10C9QIxqO-TuUpsCXSd3zX0bm9eIzf4oGU_NWYhVj1oz14XspM9iEf9SN5AIbFmCAC2RRHEhfayqM08ByXhpblCmbRIliZOH2eplfXVT7e_HliaA0BHyHgJxDwK0W4mPUZTJOLLGxdyuTup4Ym9udCL8Jlhpz5579HO1w82ims1zuPLb_VaDePYIMgtZvGK5UgPxqO5TGs8cmoFw9PEhUw4GXZiPoGQkFGDw
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LS8NAEB7qA_FifWK1ag560tBmN4_NQcSqRWkpHhS8xWQfWJC2NqniX_PXOZtsWhT05sFbwj5gd7-Z-XZ3dgbgkJOEJgwFkCnc6bhN4dhhIH2bceZLivbXF3nI_G7Q67GHh_C2Ah_lWxjtVlnqxFxRiyHXZ-QNEoRo2olHvIYybhG3l-2z0YutM0jpm9YynUYBkY58f8PtW3p6c4lrfURI--ru4to2GQZsTn2a2UoqEbtCKYb7NCISjp80CbmveOwTpeEsBcFRkoDHaGtV3AyJdAXDFq5IYor9zsEClnhaxjqBPbUCxM-zTKIQe7ZmEebBjnm2lwfkJdrnyNG5CL4axRnT_XY5m9u8dvU_z9YqrBimbZ0XorEGFTlYh2qZxcIySm0DWi2Eq8Car_G4b45GrXiSDXWIT4E1kdbjv_ba7acTLEtHUvIna-p6NRxswv2fjGQL5gfDgdwGy2FMEIEsiyO5S1wVUhqHfkATxwkEc2gNnHJpI27CruvsH8_RLGC0hkOEcIhyOEReDY6nbUZF0JFfa9dLDERGAaXRDAA1OClRNCv-ubed33s7gCUEUtS96XV2YZkg4yvcmOown40ncg8W-WvWT8f7uTRY8PjXgPoEkPZOyQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Bimodal+variational+autoencoder+for+audiovisual+speech+recognition&rft.jtitle=Machine+learning&rft.au=Sayed%2C+Hadeer+M&rft.au=ElDeeb%2C+Hesham+E&rft.au=Taie%2C+Shereen+A&rft.date=2023-04-01&rft.pub=Springer+Nature+B.V&rft.issn=0885-6125&rft.eissn=1573-0565&rft.volume=112&rft.issue=4&rft.spage=1201&rft.epage=1226&rft_id=info:doi/10.1007%2Fs10994-021-06112-5&rft.externalDBID=HAS_PDF_LINK
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-6125&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-6125&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-6125&client=summon