Speaker Clustering by Co-Optimizing Deep Representation Learning and Cluster Estimation

Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on multimedia Vol. 23; pp. 3377 - 3387
Main Authors: Li, Yanxiong, Wang, Wucheng, Liu, Mingle, Jiang, Zhongjie, He, Qianhua
Format: Journal Article
Language:English
Published: Piscataway IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1520-9210, 1941-0077
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speaker clustering is a task to merge speech segments uttered by the same speaker into a single cluster, which is an effective tool for alleviating the management of massive amount of audio documents. In this paper, we present a work for co-optimizing the two main steps of speaker clustering, namely, feature learning and cluster estimation. In our method, the deep representation feature is learned by a deep convolutional autoencoder network (DCAN), while the cluster estimation is realized by a softmax layer that is combined with the DCAN. We devise an integrated loss function to simultaneously minimize the reconstruction loss (for deep representation learning) and the clustering loss (for cluster estimation). Many state-of-the-art audio features and clustering methods are evaluated on experimental datasets selected from two publicly available speech corpora (the AISHELL-2 and the VoxCeleb1). The results show that the proposed method exceeds other speaker clustering methods in regard to the normalized mutual information (NMI) and the clustering accuracy (CA). Additionally, the proposed deep representation feature outperforms other features that were widely used in previous works, in terms of both NMI and CA.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2020.3024667