Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering

Saved in:
Bibliographic Details
Title: Diabetes Risk Prediction using Shapley Additive Explanations for Feature Engineering
Authors: Chinwe Miracle Chituru, Sin-Ban Ho, Ian Chai
Source: Journal of Informatics and Web Engineering, Vol 4, Iss 2, Pp 18-35 (2025)
Publisher Information: MMU Press, 2025.
Publication Year: 2025
Collection: LCC:Electronic computers. Computer science
LCC:Information technology
Subject Terms: diabetes risk prediction, decision tree algorithm, additive explanations, feature engineering, data visualization, Electronic computers. Computer science, QA75.5-76.95, Information technology, T58.5-58.64
Description: Diabetes is prevalent globally, expected to increase in the next few years. This includes people with different types of diabetes including type 1 diabetes and type 2 diabetes. There are several causes for the increase: dietary decisions and lack of exercise as the main ones. This global health challenge calls for effective prediction and early management of the disease. This research focuses on the decision tree algorithm utilization to predict the risk of diabetes and model interpretability with the integration of SHapley Additive exPlanations (SHAP) for feature engineering. Random forest and gradient boosting models were developed to identify the risk factors and compare the prediction with the decision tree model. The performance of these classifiers was evaluated using the metrics for accuracy, f1-score, precision, and recall. Understanding the features that drive predictions can enhance clinical decision-making as much as predictive accuracy. With the use of a comprehensive dataset having 520 instances with 17 features including the target output, the proposed decision tree model had an accuracy of 97%. The decision tree model’s categorical variables enable straightforward data visualization. The SHAP tool was applied to interpret the model’s prediction after developing the model. This is crucial for healthcare practitioners as it provides specific health metrics to identify high-risk diabetic patients. Preliminary results indicate that a combination of polyuria, polydipsia, and age are predictors of diabetes risk. This study highlights the benefits that the integration of SHAP and decision trees algorithm provides predictive capability and transparent model interpretability. It also contributes to the growing body of literature on machine learning in the healthcare industry. The results advocate for the application of this methodology in clinical settings for prediction fostering trust between the approach and practitioners and patients alike.
Document Type: article
File Description: electronic resource
Language: English
ISSN: 2821-370X
Relation: https://journals.mmupress.com/index.php/jiwe/article/view/1387; https://doaj.org/toc/2821-370X
DOI: 10.33093/jiwe.2025.4.2.2
Access URL: https://doaj.org/article/51efe5b545bb42fbb252885db4a6b77c
Accession Number: edsdoj.51efe5b545bb42fbb252885db4a6b77c
Database: Directory of Open Access Journals
Description
Abstract:Diabetes is prevalent globally, expected to increase in the next few years. This includes people with different types of diabetes including type 1 diabetes and type 2 diabetes. There are several causes for the increase: dietary decisions and lack of exercise as the main ones. This global health challenge calls for effective prediction and early management of the disease. This research focuses on the decision tree algorithm utilization to predict the risk of diabetes and model interpretability with the integration of SHapley Additive exPlanations (SHAP) for feature engineering. Random forest and gradient boosting models were developed to identify the risk factors and compare the prediction with the decision tree model. The performance of these classifiers was evaluated using the metrics for accuracy, f1-score, precision, and recall. Understanding the features that drive predictions can enhance clinical decision-making as much as predictive accuracy. With the use of a comprehensive dataset having 520 instances with 17 features including the target output, the proposed decision tree model had an accuracy of 97%. The decision tree model’s categorical variables enable straightforward data visualization. The SHAP tool was applied to interpret the model’s prediction after developing the model. This is crucial for healthcare practitioners as it provides specific health metrics to identify high-risk diabetic patients. Preliminary results indicate that a combination of polyuria, polydipsia, and age are predictors of diabetes risk. This study highlights the benefits that the integration of SHAP and decision trees algorithm provides predictive capability and transparent model interpretability. It also contributes to the growing body of literature on machine learning in the healthcare industry. The results advocate for the application of this methodology in clinical settings for prediction fostering trust between the approach and practitioners and patients alike.
ISSN:2821370X
DOI:10.33093/jiwe.2025.4.2.2