Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection
Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual...
Saved in:
| Published in: | Sistemasi : jurnal sistem informasi (Online) Vol. 14; no. 5; pp. 2198 - 2214 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English Indonesian |
| Published: |
Islamic University of Indragiri
01.09.2025
|
| Subjects: | |
| ISSN: | 2302-8149, 2540-9719 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset. |
|---|---|
| AbstractList | Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate the impact of feature selection using Mutual Information on the performance of lung cancer classification with the XGBoost algorithm. Mutual Information is employed to select relevant features, including those with linear and non-linear relationships with the target variable, while XGBoost is chosen for its ability to handle large datasets and reduce overfitting. The study was conducted on a dataset containing 30,000 data entries, with data split scenarios of 90:10, 80:20, 70:30, and 60:40. The results show that the testing accuracy before applying Mutual Information ranged from 93.42% to 93.83%, while K-Fold Cross-Validation accuracy ranged from 94.59% to 94.76%. After feature selection, testing accuracy remained stable for the 70:30 and 60:40 split scenarios, at 93.60% and 93.42% respectively. However, K-Fold Cross-Validation accuracy decreased to 89.26% and 90.88%. In contrast, for the 90:10 and 80:20 split scenarios, a decline in accuracy was observed — testing accuracy dropped to 88.63% and 88.85%, and K-Fold Cross-Validation accuracy fell to 88.87% and 90.24%. Feature selection using Mutual Information improves computational efficiency by reducing the number of features, and it can be effectively applied to simplify feature sets without significantly compromising model performance in certain data scenarios, depending on the characteristics of the dataset. |
| Author | Chrisnanto, Yulison Herry Zizilia, Regitha Abdillah, Gunawan |
| Author_xml | – sequence: 1 givenname: Regitha surname: Zizilia fullname: Zizilia, Regitha – sequence: 2 givenname: Yulison Herry surname: Chrisnanto fullname: Chrisnanto, Yulison Herry – sequence: 3 givenname: Gunawan surname: Abdillah fullname: Abdillah, Gunawan |
| BookMark | eNo9kUtLJDEURoM44GP8AbPLUhfVk2c9ltpoT0OLixlhdiGPmzZSVZEkLbryr1tdLa7ux_24By7nDB2PcQSEflGy4Ewy8juXIYfFKxVBLiQX8gidMilI1TW0O54yJ6xqqehO0EXOwRDJ65aJrjlFH5vduMVLPVpIeNnrqfbB6hLiiB9zmLryBPj2rSQYAK-SdgHGgm9izGXfXv5fzfkKX_fbmEJ5GrAeHb7flZ3u8Xr0MQ0H3JTwHeiyS4D_Qg92v_2JfnjdZ7j4mufo8e723_JPtXlYrZfXm8pSSWXlrGusFrVo6taxWtbOddIKsIJw4ac3QQP3lDNeW0G9IMbYtm49CG6EkZqfo_WB66J-Vi8pDDq9q6iDmhcxbZVOJdgeVNs4Zpk107kTlnvja2NMSzrrKBHOTyx6YNkUc07gv3mUqFmImoWoWYjaC-GfdWWE0Q |
| ContentType | Journal Article |
| DBID | AAYXX CITATION DOA |
| DOI | 10.32520/stmsi.v14i5.5345 |
| DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ url: https://www.doaj.org/ sourceTypes: Open Website |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2540-9719 |
| EndPage | 2214 |
| ExternalDocumentID | oai_doaj_org_article_87d2c2cb68fd4c3fbf6bbb809cd104df 10_32520_stmsi_v14i5_5345 |
| GroupedDBID | AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ |
| ID | FETCH-LOGICAL-c1515-dcd7ca464768d2656dd95c4ec4034f971eae3f13236c41f40bbc868fe43b4b5a3 |
| IEDL.DBID | DOA |
| ISSN | 2302-8149 |
| IngestDate | Fri Oct 03 12:44:01 EDT 2025 Sat Nov 29 07:40:46 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 5 |
| Language | English Indonesian |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c1515-dcd7ca464768d2656dd95c4ec4034f971eae3f13236c41f40bbc868fe43b4b5a3 |
| OpenAccessLink | https://doaj.org/article/87d2c2cb68fd4c3fbf6bbb809cd104df |
| PageCount | 17 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_87d2c2cb68fd4c3fbf6bbb809cd104df crossref_primary_10_32520_stmsi_v14i5_5345 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-09-01 |
| PublicationDateYYYYMMDD | 2025-09-01 |
| PublicationDate_xml | – month: 09 year: 2025 text: 2025-09-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationTitle | Sistemasi : jurnal sistem informasi (Online) |
| PublicationYear | 2025 |
| Publisher | Islamic University of Indragiri |
| Publisher_xml | – name: Islamic University of Indragiri |
| SSID | ssib053682497 ssj0002875155 |
| Score | 2.301983 |
| Snippet | Lung cancer is one of the deadliest types of cancer worldwide and is often detected too late due to the absence of early symptoms. This study aims to evaluate... |
| SourceID | doaj crossref |
| SourceType | Open Website Index Database |
| StartPage | 2198 |
| SubjectTerms | classification k-fold cross validation lung cancer mutual information xgboost |
| Title | Lung Cancer Classification Using the Extreme Gradient Boosting (XGBoost) Algorithm and Mutual Information for Feature Selection |
| URI | https://doaj.org/article/87d2c2cb68fd4c3fbf6bbb809cd104df |
| Volume | 14 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ customDbUrl: eissn: 2540-9719 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002875155 issn: 2302-8149 databaseCode: DOA dateStart: 20160101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2540-9719 dateEnd: 99991231 omitProxy: false ssIdentifier: ssib053682497 issn: 2302-8149 databaseCode: M~E dateStart: 20120101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T-QwELVOiOIaODgQcICmoDiQAtnYSewSVgsUgJD40HZR_AWR2F2UzSKqu7_OjBNW29HQRJYVWdHz2J7nzMxj7MDGibLO2EhZSslxLouobHnERZLjAaOzMtzpPl7lNzdyOFS3C1JfFBPWlgdugTuRuU1MYnQmvRWGe-0zrbWMlbHIJKyn3TfO1QKZQktKeSaRV-Tz2xbkBaRlEpTmaAtAXtD-4uRJmsQn02Y0rY7feqJKj1NOqU0Lh9RCLf9w6Jz_Yiudtwin7VeusR-VXWern0oM0C3M3-z_Fa5Z6NMM1hB0LikCKIAOISgA0M-DwXtDt4FwUYc4rwbOJpMphT3D3-FFaB_C6cvTpK6a5xGUYwvXM0ovgS5nKQyHLSC_cVY7uAsiOti7wR7OB_f9y6iTVogMeTCRNTY3pcgEsg2boE9nrUqNcEbEXHiV91zpuEemyjMjel7EWhuJU-AE10KnJd9kS-PJ2G0xyLRAChN7K6VDtuWUc0pqL0ttclzdvW129Ill8dpW0CiQeQTgiwB8EYAvCPhtdkZoz1-k4tehA02i6Eyi-Mokdr5jkD_sZ0JSvyGcbJctNfXM7bFl89ZU03o_WBs-r_8NPgBvNtyb |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Lung+Cancer+Classification+Using+the+Extreme+Gradient+Boosting+%28XGBoost%29+Algorithm+and+Mutual+Information+for+Feature+Selection&rft.jtitle=Sistemasi+%3A+jurnal+sistem+informasi+%28Online%29&rft.au=Regitha+Zizilia&rft.au=Yulison+Herry+Chrisnanto&rft.au=Gunawan+Abdillah&rft.date=2025-09-01&rft.pub=Islamic+University+of+Indragiri&rft.issn=2302-8149&rft.eissn=2540-9719&rft.volume=14&rft.issue=5&rft.spage=2198&rft.epage=2214&rft_id=info:doi/10.32520%2Fstmsi.v14i5.5345&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_87d2c2cb68fd4c3fbf6bbb809cd104df |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2302-8149&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2302-8149&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2302-8149&client=summon |