Parallel and scalable Dunn Index for the validation of big data clusters

Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Parallel computing Ročník 102; s. 102751
Hlavní autoři: Ben Ncir, Chiheb-Eddine, Hamza, Abdallah, Bouaguel, Waad
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.05.2021
Témata:
ISSN:0167-8191, 1872-7336
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2021.102751