Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning

The data clustering problem can be described as the task of organizing data into groups, where in each group the objects share some similar attributes. Most of the problems clustering algorithms address do not have a prior solution. This paper addresses the algorithm selection challenge for data clu...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Information sciences Ročník 574; s. 473 - 489
Hlavní autoři:	Gabbay, Itay, Shapira, Bracha, Rokach, Lior
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier Inc 01.10.2021
Témata:	Algorithm selection Clustering Dataset embedding Meta-knowledge Meta-learning systems Problem characterization Algorithm selection Dataset embedding Meta-learning systems Clustering Meta-knowledge Problem characterization
ISSN:	0020-0255, 1872-6291
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	The data clustering problem can be described as the task of organizing data into groups, where in each group the objects share some similar attributes. Most of the problems clustering algorithms address do not have a prior solution. This paper addresses the algorithm selection challenge for data clustering, while taking the difficulty in evaluating clustering solutions into account. We present a new meta-learning method for recommending the most suitable clustering algorithm for a dataset. Based on concepts from the isolation forest algorithm, we propose a new similarity measure between datasets. Our proposed dataset characterization methods generate an embedding for a dataset using this similarity measure, which is then used to improve the quality of the problem’s characterization. The method utilizes landmarking concepts to characterize the dataset and then, inspired by the DeepFM algorithm, applies meta-learning to rank the candidate algorithms that are expected to perform the best for the current dataset. This ranking could, among other things, support AutoML systems. Our approach is evaluated on a corpus of 100 publicly available benchmark datasets. We compare our method’s ranking performance to that of existing meta-learning methods and show the dominance of our method in terms of predictive performance and computational complexity.
ISSN:	0020-0255 1872-6291
DOI:	10.1016/j.ins.2021.06.033