Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection

Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual...

Full description

Saved in:
Bibliographic Details
Published in:Sistemasi : jurnal sistem informasi (Online) Vol. 14; no. 5; pp. 2198 - 2214
Main Authors: Zizilia, Regitha, Chrisnanto, Yulison Herry, Abdillah, Gunawan
Format: Journal Article
Language:English
Indonesian
Published: Islamic University of Indragiri 01.09.2025
Subjects:
ISSN:2302-8149, 2540-9719
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset.
ISSN:2302-8149
2540-9719
DOI:10.32520/stmsi.v14i5.5345