Unconstrained vocal pattern recognition algorithm based on attention mechanism

Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences wi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Digital signal processing Jg. 136; S. 103973
Hauptverfasser: Li, Yaqian, Zhang, Xiaolong, Zhang, Xuyao, Li, Haibin, Zhang, Wenming
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Inc 01.05.2023
Schlagworte:
ISSN:1051-2004, 1095-4333
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences with the address collected in natural application environments. For few Chinese datasets used, this paper proposes and produces an unconstrained Chinese speech dataset with richer data types closer to those collected in a natural environment. To address the inadequate extraction of acoustic features in the unconstrained speech dataset, a new two-dimensional convolutional residual network structure based on the attention mechanism is designed and applied to acoustic feature extraction. The residual block structure in the residual network is improved by the SE module and the CBAM module to obtain the SE-Cov2d and CSA-Cov2d models respectively. Finally, it is experimentally demonstrated that the attention mechanism can help the network focus on more critical feature information and fuse more differentiated features in feature extraction. •An unconstrained Chinese speech dataset proposed in a natural environment.•A new two-dimensional convolutional residual network structure designed and applied to acoustic feature extraction.•SE-Cov2d obtains the smallest EER values of 2.72% and 6.76% on both VoxCeleb and CN-Human datasets.
ISSN:1051-2004
1095-4333
DOI:10.1016/j.dsp.2023.103973