Interpretable ensemble machine learning framework for cardiovascular disease prediction using EMR data and large language models in Ethiopia.
Saved in:
| Title: | Interpretable ensemble machine learning framework for cardiovascular disease prediction using EMR data and large language models in Ethiopia. |
|---|---|
| Authors: | Tegegnie AK; Faculty of Computing, Bahir Dar University Institute of Technology, Bahir Dar, Ethiopia., Tewolde K; Faculty of Computing, Bahir Dar University Institute of Technology, Bahir Dar, Ethiopia. |
| Source: | PloS one [PLoS One] 2026 Feb 09; Vol. 21 (2), pp. e0342256. Date of Electronic Publication: 2026 Feb 09 (Print Publication: 2026). |
| Publication Type: | Journal Article |
| Language: | English |
| Journal Info: | Publisher: Public Library of Science Country of Publication: United States NLM ID: 101285081 Publication Model: eCollection Cited Medium: Internet ISSN: 1932-6203 (Electronic) Linking ISSN: 19326203 NLM ISO Abbreviation: PLoS One Subsets: MEDLINE |
| Imprint Name(s): | Original Publication: San Francisco, CA : Public Library of Science |
| MeSH Terms: | Electronic Health Records* , Heart Disease Risk Factors* , Boosting Machine Learning Algorithms*, Humans |
| Abstract: | Cardiovascular diseases (CVDs) are leading causes of morbidity and mortality globally, with a growing burden in low- and middle-income countries such as Ethiopia. Early detection is limited by resource constraints, low screening uptake, and a lack of predictive tools tailored to local healthcare systems. This study presents an interpretable ensemble machine learning framework for predicting CVD risk via structured electronic medical record (EMR) data from public hospitals in Addis Ababa. We trained an XGBoost classifier on 20,960 anonymized records containing demographic, clinical, and physiological attributes. Preprocessing involves handling missing values, outlier capping, one-hot encoding, rare-category grouping, and dimensionality reduction. SHapley additive explanations (SHAPs) were used for feature attribution, and a large language model (Gemini) was used to translate SHAP outputs into plain-language narratives to enhance interpretability. The model achieved an accuracy of 0.99, with strong precision (0.99), recall (0.98), and F1-scores across both classes. SHAP analysis identified general_plan, history of present illness (HPI), musculoskeletal system (MSS) and diagnosis as key predictors. The integration of SHAP and LLMs provided transparent, clinician-friendly insights into model outputs, supporting adoption in resource-limited settings. This study demonstrates that combining ensemble learning with explainability techniques can yield highly accurate and interpretable CVD prediction models, offering potential for integration into clinical decision-support systems in Ethiopia. (Copyright: © 2026 Tegegnie, Tewolde. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.) |
| Competing Interests: | NO authors have competing interests. |
| References: | Int Health. 2021 Jul 3;13(4):318-326. (PMID: 32945840) Nat Mach Intell. 2020 Jan;2(1):56-67. (PMID: 32607472) J Am Coll Cardiol. 2019 Mar 26;73(11):1317-1335. (PMID: 30898208) Cardiovasc J Afr. 2021 Jan-Feb;32(1):37-46. (PMID: 33646240) Prev Chronic Dis. 2012;9:E84. (PMID: 22498035) Interact J Med Res. 2023 Jan 11;12:e40721. (PMID: 36630161) NPJ Digit Med. 2018 May 8;1:18. (PMID: 31304302) |
| Entry Date(s): | Date Created: 20260209 Date Completed: 20260319 Latest Revision: 20260319 |
| Update Code: | 20260320 |
| PubMed Central ID: | PMC12885298 |
| DOI: | 10.1371/journal.pone.0342256 |
| PMID: | 41662408 |
| Database: | MEDLINE |
| Abstract: | Cardiovascular diseases (CVDs) are leading causes of morbidity and mortality globally, with a growing burden in low- and middle-income countries such as Ethiopia. Early detection is limited by resource constraints, low screening uptake, and a lack of predictive tools tailored to local healthcare systems. This study presents an interpretable ensemble machine learning framework for predicting CVD risk via structured electronic medical record (EMR) data from public hospitals in Addis Ababa. We trained an XGBoost classifier on 20,960 anonymized records containing demographic, clinical, and physiological attributes. Preprocessing involves handling missing values, outlier capping, one-hot encoding, rare-category grouping, and dimensionality reduction. SHapley additive explanations (SHAPs) were used for feature attribution, and a large language model (Gemini) was used to translate SHAP outputs into plain-language narratives to enhance interpretability. The model achieved an accuracy of 0.99, with strong precision (0.99), recall (0.98), and F1-scores across both classes. SHAP analysis identified general_plan, history of present illness (HPI), musculoskeletal system (MSS) and diagnosis as key predictors. The integration of SHAP and LLMs provided transparent, clinician-friendly insights into model outputs, supporting adoption in resource-limited settings. This study demonstrates that combining ensemble learning with explainability techniques can yield highly accurate and interpretable CVD prediction models, offering potential for integration into clinical decision-support systems in Ethiopia.<br /> (Copyright: © 2026 Tegegnie, Tewolde. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.) |
|---|---|
| ISSN: | 1932-6203 |
| DOI: | 10.1371/journal.pone.0342256 |
Full Text Finder
Nájsť tento článok vo Web of Science