Zobrazit v EDS

PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication

Uloženo v:

Podrobná bibliografie
Název:	PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication
Autoři:	Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri
Zdroj:	Proceedings of the VLDB Endowment. 15:1132-1145
Informace o vydavateli:	Association for Computing Machinery (ACM), 2022.
Rok vydání:	2022
Témata:	0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0101 mathematics, 01 natural sciences
Popis:	The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
Druh dokumentu:	Article
Jazyk:	English
ISSN:	2150-8097
DOI:	10.14778/3514061.3514062
Přístupové číslo:	edsair.doi...........1b18933ab4877fbe7ec4e22eb9af3dcd
Databáze:	OpenAIRE

Popis
Abstrakt:	The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
ISSN:	21508097
DOI:	10.14778/3514061.3514062