Bibliographische Detailangaben
| Titel: |
PACk: an efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication |
| Autoren: |
Yue Wang, Vivek Narasayya, Yeye He, Surajit Chaudhuri |
| Quelle: |
Proceedings of the VLDB Endowment. 15:1132-1145 |
| Verlagsinformationen: |
Association for Computing Machinery (ACM), 2022. |
| Publikationsjahr: |
2022 |
| Schlagwörter: |
0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology, 0101 mathematics, 01 natural sciences |
| Beschreibung: |
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2X to 19X (median=9X) speedup across a variety of synthetic and real-world datasets. |
| Publikationsart: |
Article |
| Sprache: |
English |
| ISSN: |
2150-8097 |
| DOI: |
10.14778/3514061.3514062 |
| Dokumentencode: |
edsair.doi...........1b18933ab4877fbe7ec4e22eb9af3dcd |
| Datenbank: |
OpenAIRE |