SparkSNN: A density-based clustering algorithm on spark
Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely...
Gespeichert in:
| Veröffentlicht in: | 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA) S. 433 - 437 |
|---|---|
| Hauptverfasser: | , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
IEEE
01.03.2018
|
| Schlagworte: | |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely large nowadays, it is impossible for large-scale data to be processed on a single machine. Therefore, the scalability problem of traditional clustering algorithms running on a single machine must be addressed. In this paper, we improve the traditional density-based clustering algorithm by utilizing powerful programming platform (Spark) and distributed computing clusters. In particular, we design and implement Spark-based shared nearest neighbor clustering algorithm called SparkSNN, a scalable density-based clustering algorithm on Spark for big data analysis. We conduct our experiments using real data, i.e., Maryland crime data, to evaluate the performance of the proposed algorithm with respect to speed-up and scale-up. The experimental results well confirm the effectiveness and efficiency of the proposed SparkSNN clustering algorithm. |
|---|---|
| DOI: | 10.1109/ICBDA.2018.8367722 |