Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering

Uloženo v:
Podrobná bibliografie
Název: Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering
Autoři: Chinwe Miracle Chituru, Sin-Ban Ho, Ian Chai
Zdroj: Journal of Informatics and Web Engineering, Vol 4, Iss 2, Pp 18-35 (2025)
Informace o vydavateli: MMU Press, 2025.
Rok vydání: 2025
Sbírka: LCC:Electronic computers. Computer science
LCC:Information technology
Témata: diabetes risk prediction, decision tree algorithm, additive explanations, feature engineering, data visualization, Electronic computers. Computer science, QA75.5-76.95, Information technology, T58.5-58.64
Popis: Diabetes is prevalent globally, expected to increase in the next few years. This includes people with different types of diabetes including type 1 diabetes and type 2 diabetes. There are several causes for the increase: dietary decisions and lack of exercise as the main ones. This global health challenge calls for effective prediction and early management of the disease. This research focuses on the decision tree algorithm utilization to predict the risk of diabetes and model interpretability with the integration of SHapley Additive exPlanations (SHAP) for feature engineering. Random forest and gradient boosting models were developed to identify the risk factors and compare the prediction with the decision tree model. The performance of these classifiers was evaluated using the metrics for accuracy, f1-score, precision, and recall. Understanding the features that drive predictions can enhance clinical decision-making as much as predictive accuracy. With the use of a comprehensive dataset having 520 instances with 17 features including the target output, the proposed decision tree model had an accuracy of 97%. The decision tree model’s categorical variables enable straightforward data visualization. The SHAP tool was applied to interpret the model’s prediction after developing the model. This is crucial for healthcare practitioners as it provides specific health metrics to identify high-risk diabetic patients. Preliminary results indicate that a combination of polyuria, polydipsia, and age are predictors of diabetes risk. This study highlights the benefits that the integration of SHAP and decision trees algorithm provides predictive capability and transparent model interpretability. It also contributes to the growing body of literature on machine learning in the healthcare industry. The results advocate for the application of this methodology in clinical settings for prediction fostering trust between the approach and practitioners and patients alike.
Druh dokumentu: article
Popis souboru: electronic resource
Jazyk: English
ISSN: 2821-370X
Relation: https://journals.mmupress.com/index.php/jiwe/article/view/1387; https://doaj.org/toc/2821-370X
DOI: 10.33093/jiwe.2025.4.2.2
Přístupová URL adresa: https://doaj.org/article/51efe5b545bb42fbb252885db4a6b77c
Přístupové číslo: edsdoj.51efe5b545bb42fbb252885db4a6b77c
Databáze: Directory of Open Access Journals
Popis
Abstrakt:Diabetes is prevalent globally, expected to increase in the next few years. This includes people with different types of diabetes including type 1 diabetes and type 2 diabetes. There are several causes for the increase: dietary decisions and lack of exercise as the main ones. This global health challenge calls for effective prediction and early management of the disease. This research focuses on the decision tree algorithm utilization to predict the risk of diabetes and model interpretability with the integration of SHapley Additive exPlanations (SHAP) for feature engineering. Random forest and gradient boosting models were developed to identify the risk factors and compare the prediction with the decision tree model. The performance of these classifiers was evaluated using the metrics for accuracy, f1-score, precision, and recall. Understanding the features that drive predictions can enhance clinical decision-making as much as predictive accuracy. With the use of a comprehensive dataset having 520 instances with 17 features including the target output, the proposed decision tree model had an accuracy of 97%. The decision tree model’s categorical variables enable straightforward data visualization. The SHAP tool was applied to interpret the model’s prediction after developing the model. This is crucial for healthcare practitioners as it provides specific health metrics to identify high-risk diabetic patients. Preliminary results indicate that a combination of polyuria, polydipsia, and age are predictors of diabetes risk. This study highlights the benefits that the integration of SHAP and decision trees algorithm provides predictive capability and transparent model interpretability. It also contributes to the growing body of literature on machine learning in the healthcare industry. The results advocate for the application of this methodology in clinical settings for prediction fostering trust between the approach and practitioners and patients alike.
ISSN:2821370X
DOI:10.33093/jiwe.2025.4.2.2