Unconstrained vocal pattern recognition algorithm based on attention mechanism

Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences wi...

Full description

Saved in:
Bibliographic Details
Published in:Digital signal processing Vol. 136; p. 103973
Main Authors: Li, Yaqian, Zhang, Xiaolong, Zhang, Xuyao, Li, Haibin, Zhang, Wenming
Format: Journal Article
Language:English
Published: Elsevier Inc 01.05.2023
Subjects:
ISSN:1051-2004, 1095-4333
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deep learning-based voiceprint recognition methods rely heavily on adequate datasets, especially those closer to the natural environment and more complex under unconstrained conditions. Yet, the data types of open-source speech datasets are too homogeneous nowadays, and there are some differences with the address collected in natural application environments. For few Chinese datasets used, this paper proposes and produces an unconstrained Chinese speech dataset with richer data types closer to those collected in a natural environment. To address the inadequate extraction of acoustic features in the unconstrained speech dataset, a new two-dimensional convolutional residual network structure based on the attention mechanism is designed and applied to acoustic feature extraction. The residual block structure in the residual network is improved by the SE module and the CBAM module to obtain the SE-Cov2d and CSA-Cov2d models respectively. Finally, it is experimentally demonstrated that the attention mechanism can help the network focus on more critical feature information and fuse more differentiated features in feature extraction. •An unconstrained Chinese speech dataset proposed in a natural environment.•A new two-dimensional convolutional residual network structure designed and applied to acoustic feature extraction.•SE-Cov2d obtains the smallest EER values of 2.72% and 6.76% on both VoxCeleb and CN-Human datasets.
ISSN:1051-2004
1095-4333
DOI:10.1016/j.dsp.2023.103973