PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication

Gespeichert in:
Bibliographische Detailangaben
Titel: PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication
Autoren: Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri
Quelle: Proceedings of the VLDB Endowment. 15:1132-1145
Verlagsinformationen: Association for Computing Machinery (ACM), 2022.
Publikationsjahr: 2022
Schlagwörter: 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0101 mathematics, 01 natural sciences
Beschreibung: The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
Publikationsart: Article
Sprache: English
ISSN: 2150-8097
DOI: 10.14778/3514061.3514062
Dokumentencode: edsair.doi...........1b18933ab4877fbe7ec4e22eb9af3dcd
Datenbank: OpenAIRE
Beschreibung
Abstract:The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
ISSN:21508097
DOI:10.14778/3514061.3514062