A Distributed Density-Grid Clustering Algorithm for Multi-Dimensional Data

In recent years there have been many massive leaps in technology that have also resulted in large advancements in how we collect and use data. These advancements have caused a rise in the prominence of the field of Big Data. Organizations and businesses rely heavily on data analysis in almost every...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	2020 10th Annual Computing and Communication Workshop and Conference (CCWC) S. 0001 - 0008
Hauptverfasser:	Brown, Daniel, Shi, Yong
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 01.01.2020
Schlagworte:	Apache Spark Clustering Density-Based Clustering Distributed Computing Grid-Based Clustering Parallel Computing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years there have been many massive leaps in technology that have also resulted in large advancements in how we collect and use data. These advancements have caused a rise in the prominence of the field of Big Data. Organizations and businesses rely heavily on data analysis in almost every field of work. This need for data analysis combined with larger and more complex datasets has caused many challenges for these groups as they seek to keep up. Clustering is a field of data analysis, specifically unsupervised machine learning, that is heavily used in many different industries. Traditional clustering algorithms typically suffer in performance and accuracy as datasets increase in size and dimensionality. We previously proposed a new clustering algorithm called the Fast Density-Grid clustering algorithm that successfully alleviated some of the problems related to runtimes. In modern data analysis however, serial algorithms are still too slow to be of much use. The Fast Density-Grid algorithm was originally designed with parallelization in mind, and this paper discusses the steps taken to implement this. Our experimental results show that, when the number of records in the dataset exceed a certain amount, the parallel form of the algorithm overtakes the traditional in performance. Studying this critical point allows us to determine whether or not the algorithm is suitable for real world use.
DOI:	10.1109/CCWC47524.2020.9031132