Unconstrained vocal pattern recognition algorithm based on attention mechanism

Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences wi...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Digital signal processing Ročník 136; s. 103973
Hlavní autori: Li, Yaqian, Zhang, Xiaolong, Zhang, Xuyao, Li, Haibin, Zhang, Wenming
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Elsevier Inc 01.05.2023
Predmet:
ISSN:1051-2004, 1095-4333
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences with the address collected in natural application environments. For few Chinese datasets used, this paper proposes and produces an unconstrained Chinese speech dataset with richer data types closer to those collected in a natural environment. To address the inadequate extraction of acoustic features in the unconstrained speech dataset, a new two-dimensional convolutional residual network structure based on the attention mechanism is designed and applied to acoustic feature extraction. The residual block structure in the residual network is improved by the SE module and the CBAM module to obtain the SE-Cov2d and CSA-Cov2d models respectively. Finally, it is experimentally demonstrated that the attention mechanism can help the network focus on more critical feature information and fuse more differentiated features in feature extraction. •An unconstrained Chinese speech dataset proposed in a natural environment.•A new two-dimensional convolutional residual network structure designed and applied to acoustic feature extraction.•SE-Cov2d obtains the smallest EER values of 2.72% and 6.76% on both VoxCeleb and CN-Human datasets.
ISSN:1051-2004
1095-4333
DOI:10.1016/j.dsp.2023.103973