Near-optimal large-scale k-medoids clustering

The k-medoids (k-median) problem is one of the best known unsupervised clustering problems. Due to its complexity, finding high-quality solutions for huge-scale datasets remains extremely challenging. The application of many approaches finding optimal or quality solutions is limited to only small an...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences Jg. 545; S. 344 - 362
Hauptverfasser: Ushakov, Anton V., Vasilyev, Igor
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Inc 04.02.2021
Schlagworte:
ISSN:0020-0255, 1872-6291
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The k-medoids (k-median) problem is one of the best known unsupervised clustering problems. Due to its complexity, finding high-quality solutions for huge-scale datasets remains extremely challenging. The application of many approaches finding optimal or quality solutions is limited to only small and medium-size instances. On the other hand, many parallel, distributed algorithms that can handle huge-scale datasets usually provide very poor solutions. In this paper, we develop a first parallel, distributed primal–dual heuristic algorithm for the k-medoids problem. Its main component is a very efficient parallel subgradient column generation that solves a Lagrangian dual problem and finds a tight bound on solution quality. High-quality solutions are then produced by a parallel core selection technique. We considerably reduce computational burden and memory load by employing a nearest neighbor strategy to approximate the dissimilarity matrix. We demonstrate that our algorithm finds very close to optimal solutions, confirmed by the tightness of dual bounds, of instances that are much larger than those considered in the literature to date. Our experiments include clustering large-scale collections of face images into several thousand of clusters. We show that our approach outperforms parallel improved versions of the most popular k-medoids clustering algorithms, achieving nearly linear parallel speedup.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2020.08.121