In EDS ansehen

PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication

Gespeichert in:

Bibliographische Detailangaben
Titel:	PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication
Autoren:	Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri
Quelle:	Proceedings of the VLDB Endowment. 15:1132-1145
Verlagsinformationen:	Association for Computing Machinery (ACM), 2022.
Publikationsjahr:	2022
Schlagwörter:	0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0101 mathematics, 01 natural sciences
Beschreibung:	The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
Publikationsart:	Article
Sprache:	English
ISSN:	2150-8097
DOI:	10.14778/3514061.3514062
Dokumentencode:	edsair.doi...........1b18933ab4877fbe7ec4e22eb9af3dcd
Datenbank:	OpenAIRE

Beschreibung
Abstract:	The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets.
ISSN:	21508097
DOI:	10.14778/3514061.3514062