Exact memory–constrained UPGMA for large scale speaker clustering

•We focus on exact hierarchical clustering of large sets of utterances.•Hierarchical clustering is challenging due to memory constraints.•We propose an efficient, exact and parallel implementation of UPGMA clustering.•We extend the Clustering Features concept to speaker recognition scoring functions...

Full description

Saved in:
Bibliographic Details
Published in:Pattern recognition Vol. 95; pp. 235 - 246
Main Authors: Cumani, Sandro, Laface, Pietro
Format: Journal Article
Language:English
Published: Elsevier Ltd 01.11.2019
Subjects:
ISSN:0031-3203, 1873-5142
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•We focus on exact hierarchical clustering of large sets of utterances.•Hierarchical clustering is challenging due to memory constraints.•We propose an efficient, exact and parallel implementation of UPGMA clustering.•We extend the Clustering Features concept to speaker recognition scoring functions.•We assess the efficiency of our method on datasets including 4 million utterances. This work focuses on clustering large sets of utterances collected from an unknown number of speakers. Since the number of speakers is unknown, we focus on exact hierarchical agglomerative clustering, followed by automatic selection of the number of clusters. Exact hierarchical clustering of a large number of vectors, however, is a challenging task due to memory constraints, which make it ineffective or unfeasible for large datasets. We propose an exact memory–constrained and parallel implementation of average linkage clustering for large scale datasets, showing that its computational complexity is approximately O(N2), but is much faster (up to 40 times in our experiments), than the Reciprocal Nearest Neighbor chain algorithm, which has O(N2) complexity. We also propose a very fast silhouette computation procedure that, in linear time, determines the set of clusters. The computational efficiency of our approach is demonstrated on datasets including up to 4 million speaker vectors.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2019.06.018