PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication

Uloženo v:
Podrobná bibliografie
Název: PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication
Autoři: Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri
Zdroj: Proceedings of the VLDB Endowment. 15:1132-1145
Informace o vydavateli: Association for Computing Machinery (ACM), 2022.
Rok vydání: 2022
Témata: 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0101 mathematics, 01 natural sciences
Popis: The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
Druh dokumentu: Article
Jazyk: English
ISSN: 2150-8097
DOI: 10.14778/3514061.3514062
Přístupové číslo: edsair.doi...........1b18933ab4877fbe7ec4e22eb9af3dcd
Databáze: OpenAIRE
Popis
Abstrakt:The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
ISSN:21508097
DOI:10.14778/3514061.3514062