Efficient algorithm for big data clustering on single machine

Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k-means algorithm. It significantly...

Full description

Saved in:

Bibliographic Details
Published in:	CAAI Transactions on Intelligence Technology Vol. 5; no. 1; pp. 9 - 14
Main Authors:	Alguliyev, Rasim M, Aliguliyev, Ramiz M, Sukhostat, Lyudmila V
Format:	Journal Article
Language:	English
Published:	Beijing The Institution of Engineering and Technology 01.03.2020 John Wiley & Sons, Inc Wiley
Subjects:	Accelerometers Algorithms Big Data big data analysis big data clustering C6130 Data handling techniques Censuses Centroids cluster centroids Clustering clustering algorithms clustering speed computing powers Data analysis Data points Data processing Datasets Efficiency Experiments initial dataset k-means algorithm Massive data points pattern clustering Performance evaluation Personal computers Research Article single machine Standard deviation United States > US data analysis computing powers single machine big data clustering data processing Big Data data points cluster centroids initial dataset pattern clustering clustering algorithms clustering speed k-means algorithm big data analysis
ISSN:	2468-2322, 2468-6557, 2468-2322
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k-means algorithm. It significantly reduces the exponential growth of computations. The proposed algorithm splits a dataset into batches while preserving the characteristics of the initial dataset and increasing the clustering speed. The idea is to define cluster centroids, which are also clustered, for each batch. According to the obtained centroids, the data points belong to the cluster with the nearest centroid. Real large datasets are used to conduct the experiments to evaluate the effectiveness of the proposed approach. The proposed approach is compared with k-means and its modification. The experiments show that the proposed algorithm is a promising tool for clustering large datasets in comparison with the k-means algorithm.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2468-2322 2468-6557 2468-2322
DOI:	10.1049/trit.2019.0048