SparkSNN: A density-based clustering algorithm on spark
Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely...
Saved in:
| Published in: | 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA) pp. 433 - 437 |
|---|---|
| Main Authors: | , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
01.03.2018
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely large nowadays, it is impossible for large-scale data to be processed on a single machine. Therefore, the scalability problem of traditional clustering algorithms running on a single machine must be addressed. In this paper, we improve the traditional density-based clustering algorithm by utilizing powerful programming platform (Spark) and distributed computing clusters. In particular, we design and implement Spark-based shared nearest neighbor clustering algorithm called SparkSNN, a scalable density-based clustering algorithm on Spark for big data analysis. We conduct our experiments using real data, i.e., Maryland crime data, to evaluate the performance of the proposed algorithm with respect to speed-up and scale-up. The experimental results well confirm the effectiveness and efficiency of the proposed SparkSNN clustering algorithm. |
|---|---|
| DOI: | 10.1109/ICBDA.2018.8367722 |