Machine Learning-Based Application for Predicting Risk of Type 2 Diabetes Mellitus (T2DM) in Saudi Arabia: A Retrospective Cross-Sectional Study

Earlier detection of individuals at the highest risk of developing diabetes is crucial to avoid the disease's prevalence and progression. Therefore, we aim to build a data-driven predictive application for screening subjects at a high risk of developing Type 2 Diabetes mellitus (T2DM) in the we...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE access Ročník 8; s. 199539 - 199561
Hlavní autori: Syed, Asif Hassan, Khan, Tabrej
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Piscataway IEEE 2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:2169-3536, 2169-3536
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Earlier detection of individuals at the highest risk of developing diabetes is crucial to avoid the disease's prevalence and progression. Therefore, we aim to build a data-driven predictive application for screening subjects at a high risk of developing Type 2 Diabetes mellitus (T2DM) in the western region of Saudi Arabia. In this context, we designed and implemented a questionnaire-based cross-sectional study using conventional diabetes risk factors for studying the prevalence and the association between the outcomes and exposure (s). We used the Chi-Squared test and binary logistic regression to analyze and screen the most significant diabetes risk factor for T2DM risk prediction. Synthetic Minority Over-sampling Technique (SMOTE), a class-balancer, was used to balance the cross-sectional data. We used the balanced class data to screen the best performing classification algorithm to classify patients at high risk of diabetes with a higher F1 Score. The best performing classifier's hyper-parameters were further tuned using 10-fold cross-validation for achieving an improved F1 Score. Additionally, we validated our proposed model with the existing models built using the National Health and Nutrition Examination Survey (NHANES) dataset and Pima Indian Diabetes (PID) dataset. The results of the Chi-squared test and binary logistic regression showed that the exposures, namely Smoking, Healthy diet, Blood-Pressure (BP), Body Mass Index (BMI), Gender, and Region, contributed significantly (p < 0.05) to the prediction of the Response variable (subjects at high risk of diabetes). The tuned two-class Decision Forest (DF) model showed better performance with an average F1score of 0.8453 ± 0.0268. Moreover, the DF based model adapted reasonably well in different diabetes dataset. An Application Programming Interface (API) of the tuned DF model was implemented and deployed as a web service at https://type2-diabetes-risk-predictor.herokuapp.com , and the implementation codes are available at https://github.com/SAH-ML/T2DM-Risk-Predictor .
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3035026