An adaptive algorithm for clustering cumulative probability distribution functions using the Kolmogorov–Smirnov two-sample test

•An adaptive clustering algorithm has been proposed.•The measure distance proposed is Kolmogorov–Smirnov statistics.•A practical application of the algorithm proves its power.•The proposed algorithm allows better clustering solar spectra data than classical k-means. This paper proposes an adaptive a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Expert systems with applications Jg. 42; H. 8; S. 4016 - 4021
Hauptverfasser: Mora-López, Llanos, Mora, Juan
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier Ltd 15.05.2015
Schlagworte:
ISSN:0957-4174, 1873-6793
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•An adaptive clustering algorithm has been proposed.•The measure distance proposed is Kolmogorov–Smirnov statistics.•A practical application of the algorithm proves its power.•The proposed algorithm allows better clustering solar spectra data than classical k-means. This paper proposes an adaptive algorithm for clustering cumulative probability distribution functions (c.p.d.f.) of a continuous random variable, observed in different populations, into the minimum homogeneous clusters, making no parametric assumptions about the c.p.d.f.’s. The distance function for clustering c.p.d.f.’s that is proposed is based on the Kolmogorov–Smirnov two sample statistic. This test is able to detect differences in position, dispersion or shape of the c.p.d.f.’s. In our context, this statistic allows us to cluster the recorded data with a homogeneity criterion based on the whole distribution of each data set, and to decide whether it is necessary to add more clusters or not. In this sense, the proposed algorithm is adaptive as it automatically increases the number of clusters only as necessary; therefore, there is no need to fix in advance the number of clusters. The output of the algorithm are the common c.p.d.f. of all observed data in the cluster (the centroid) and, for each cluster, the Kolmogorov–Smirnov statistic between the centroid and the most distant c.p.d.f. The proposed algorithm has been used for a large data set of solar global irradiation spectra distributions. The results obtained enable to reduce all the information of more than 270,000 c.p.d.f.’s in only 6 different clusters that correspond to 6 different c.p.d.f.’s.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2014.12.027