New optimization models for optimal classification trees

The most efficient state-of-the-art methods to build classification trees are greedy heuristics (e.g. CART) that may fail to find underlying characteristics in datasets. Recently, exact linear formulations that have shown better accuracy were introduced. However they do not scale up to datasets with...

Full description

Saved in:
Bibliographic Details
Published in:Computers & operations research Vol. 164; p. 106515
Main Authors: Ales, Zacharie, Huré, Valentine, Lambert, Amélie
Format: Journal Article
Language:English
Published: Elsevier Ltd 01.04.2024
Elsevier
Subjects:
ISSN:0305-0548, 1873-765X
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The most efficient state-of-the-art methods to build classification trees are greedy heuristics (e.g. CART) that may fail to find underlying characteristics in datasets. Recently, exact linear formulations that have shown better accuracy were introduced. However they do not scale up to datasets with more than a few thousands data points. In this paper, we introduce four new formulations for building optimal trees. The first one is a quadratic model based on the well-known formulation of Bertsimas et al. We then propose two different linearizations of this new quadratic model. The last model is an extension to real-valued datasets of a flow-formulation limited to binary datasets (Aghaei et al.). Each model is introduced for both axis-aligned and oblique splits. We further prove that our new formulations have stronger continuous relaxations than existing models. Finally, we present computational results on 22 standard datasets with up to thousands of data points. Our exact models have reduced solution times with learning performances as strong or significantly better than state of the art exact approaches. •New quadratic formulation linearized in two different ways.•Extension of a flow formulation to non-binary datasets.•Our 3 linear formulations have stronger linear relaxations than other formulations.•Iterative algorithm to fit our models’ parameters with a reduced number of iterations.•Experimental results showing the efficiency of our formulations and our algorithm.
ISSN:0305-0548
1873-765X
DOI:10.1016/j.cor.2023.106515