How to Use K-means for Big Data Clustering?

•We suggest a new parallel big data clustering scheme based on K-means and K-means++ algorithms;•By decomposing the dataset, the proposed global search scheme efficiently finds quality clustering solutions processing significantly less data;•Requirements for “true big data” clustering algorithms are...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Pattern recognition Ročník 137; s. 109269
Hlavní autoři:	Mussabayev, Rustam, Mladenovic, Nenad, Jarboui, Bassem, Mussabayev, Ravil
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier Ltd 01.05.2023
Témata:	Big data Clustering Decomposition Divide and conquer algorithm Global optimization K-means Minimum sum-of-squares Unsupervised learning Minimum sum-of-squares K-means Global optimization Big data Decomposition Clustering Divide and conquer algorithm Unsupervised learning
ISSN:	0031-3203, 1873-5142
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	•We suggest a new parallel big data clustering scheme based on K-means and K-means++ algorithms;•By decomposing the dataset, the proposed global search scheme efficiently finds quality clustering solutions processing significantly less data;•Requirements for “true big data” clustering algorithms are formulated;•Extensive experiments on real-world datasets show the superiority of the proposed scheme to the competitive algorithms;•According to “the more data, the better” concept, the larger the analyzed dataset is, the more advantages our algorithm provides over other algorithms. [Display omitted] K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a “true big data” algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2022.109269