Cure: an efficient clustering algorithm for large databases

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Information systems (Oxford) Ročník 26; číslo 1; s. 35 - 58
Hlavní autori:	Guha, Sudipto, Rastogi, Rajeev, Shim, Kyuseok
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier Ltd 01.03.2001
Predmet:	Clustering Clustering Algorithms Computer applications Data Mining Knowledge Discovery Clustering Algorithms Data Mining Knowledge Discovery
ISSN:	0306-4379, 1873-6076
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Bibliografia:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0306-4379 1873-6076
DOI:	10.1016/S0306-4379(01)00008-4