SparkSNN: A density-based clustering algorithm on spark

Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA) S. 433 - 437
Hauptverfasser:	Aryal, Amar Mani, Wang, Sujing
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.03.2018
Schlagworte:	Big Data Clustering algorithms data mining density-based clustering algorithm Indexes Merging Partitioning algorithms shared nearest neighbor clustering Silicon Spark Sparks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely large nowadays, it is impossible for large-scale data to be processed on a single machine. Therefore, the scalability problem of traditional clustering algorithms running on a single machine must be addressed. In this paper, we improve the traditional density-based clustering algorithm by utilizing powerful programming platform (Spark) and distributed computing clusters. In particular, we design and implement Spark-based shared nearest neighbor clustering algorithm called SparkSNN, a scalable density-based clustering algorithm on Spark for big data analysis. We conduct our experiments using real data, i.e., Maryland crime data, to evaluate the performance of the proposed algorithm with respect to speed-up and scale-up. The experimental results well confirm the effectiveness and efficiency of the proposed SparkSNN clustering algorithm.
DOI:	10.1109/ICBDA.2018.8367722