A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification

Oversampling is an efficient technique in dealing with class-imbalance problem. It addresses the problem by reduplicating or generating the minority class samples to balance the distribution between the samples of the majority and the minority class. Synthetic minority oversampling technique (SMOTE)...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 7; pp. 23537 - 23548
Main Authors: Yan, Yuanting, Liu, Ruiqing, Ding, Zihan, Du, Xiuquan, Chen, Jie, Zhang, Yanping
Format: Journal Article
Language:English
Published: Piscataway IEEE 2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:2169-3536, 2169-3536
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Oversampling is an efficient technique in dealing with class-imbalance problem. It addresses the problem by reduplicating or generating the minority class samples to balance the distribution between the samples of the majority and the minority class. Synthetic minority oversampling technique (SMOTE) is one of the typical representatives. During the past decade, researchers have proposed many variants of SMOTE. However, the existing oversampling methods may generate wrong minority class samples in some scenarios. Furthermore, how to effectively mine the inherent complex characteristics of imbalanced data remains a challenge. To this end, this paper proposes a parameter-free data cleaning method to improve SMOTE based on constructive covering algorithm. The dataset generated by SMOTE is first partitioned into a group of covers, then the hard-to-learn samples can be detected based on the characteristics of sample space distribution. Finally, a pair-wise deletion strategy is proposed to remove the hard-to-learn samples. The experimental results on 25 imbalanced datasets show that our proposed method is superior to the comparison methods in terms of various metrics, such as <inline-formula> <tex-math notation="LaTeX">{F} </tex-math></inline-formula>-measure, <inline-formula> <tex-math notation="LaTeX">{G} </tex-math></inline-formula>-mean, and Recall. Our method not only can reduce the complexity of the dataset but also can improve the performance of the classification model.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2019.2899467