A survey on parallel clustering algorithms for Big Data

Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithm...

Full description

Saved in:

Bibliographic Details
Published in:	The Artificial intelligence review Vol. 54; no. 4; pp. 2411 - 2443
Main Authors:	Dafir, Zineb, Lamari, Yasmine, Slaoui, Said Chah
Format:	Journal Article
Language:	English
Published:	Dordrecht Springer Netherlands 01.04.2021 Springer Springer Nature B.V
Subjects:	Algorithms Artificial Intelligence Big Data Classification Clustering Computer Science Context Data mining Digital integrated circuits Field programmable gate arrays Information retrieval Microprocessors Peer relationships Peer to peer computing Peers Social networks Surveys means Algorithms FPGA MPI Big Data DBSCAN Spark Clustering Data mining GPU Multi-cores CPU MapReduce
ISSN:	0269-2821, 1573-7462
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0269-2821 1573-7462
DOI:	10.1007/s10462-020-09918-2