Isolation forests and landmarking-based representations for clustering algorithm recommendation using meta-learning

The data clustering problem can be described as the task of organizing data into groups, where in each group the objects share some similar attributes. Most of the problems clustering algorithms address do not have a prior solution. This paper addresses the algorithm selection challenge for data clu...

Full description

Saved in:
Bibliographic Details
Published in:Information sciences Vol. 574; pp. 473 - 489
Main Authors: Gabbay, Itay, Shapira, Bracha, Rokach, Lior
Format: Journal Article
Language:English
Published: Elsevier Inc 01.10.2021
Subjects:
ISSN:0020-0255, 1872-6291
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The data clustering problem can be described as the task of organizing data into groups, where in each group the objects share some similar attributes. Most of the problems clustering algorithms address do not have a prior solution. This paper addresses the algorithm selection challenge for data clustering, while taking the difficulty in evaluating clustering solutions into account. We present a new meta-learning method for recommending the most suitable clustering algorithm for a dataset. Based on concepts from the isolation forest algorithm, we propose a new similarity measure between datasets. Our proposed dataset characterization methods generate an embedding for a dataset using this similarity measure, which is then used to improve the quality of the problem’s characterization. The method utilizes landmarking concepts to characterize the dataset and then, inspired by the DeepFM algorithm, applies meta-learning to rank the candidate algorithms that are expected to perform the best for the current dataset. This ranking could, among other things, support AutoML systems. Our approach is evaluated on a corpus of 100 publicly available benchmark datasets. We compare our method’s ranking performance to that of existing meta-learning methods and show the dominance of our method in terms of predictive performance and computational complexity.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2021.06.033