Developing predictive models for COVID-19 positive tests based on the XGBoost and random forest algorithms with internet search data.

Saved in:
Bibliographic Details
Title: Developing predictive models for COVID-19 positive tests based on the XGBoost and random forest algorithms with internet search data.
Authors: Chang Y; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Chen J; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Chen X; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Wu Y; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Tang H; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Wu G; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Sun J; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Liao Y; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Chen H; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Cai S; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China., Hao Y; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China. haoyt@bjmu.edu.cn.; Peking University Center for Public Health and Epidemic Preparedness & Response, Beijing, 100191, China. haoyt@bjmu.edu.cn.; Department of Epidemiology & Biostatistics, School of Public Health, Peking University, Beijing, 100191, China. haoyt@bjmu.edu.cn.; Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, 100191, China. haoyt@bjmu.edu.cn., Zhang W; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China. zhangwj227@mail.sysu.edu.cn., Du Z; Department of Medical Statistics, School of Public Health & Sun Yat-sen Global Health Institute & Center for Health Information Research, Sun Yat- sen University, Guangzhou, China. duzhch5@mail.sysu.edu.cn.; Guangzhou Joint Research Center for Disease Surveillance and Risk Assessment, Sun Yat-sen University & Guangzhou Center for Disease Control and Prevention, Guangzhou, China. duzhch5@mail.sysu.edu.cn.
Source: BMC public health [BMC Public Health] 2025 Nov 28; Vol. 25 (1), pp. 4189. Date of Electronic Publication: 2025 Nov 28.
Publication Type: Journal Article
Language: English
Journal Info: Publisher: BioMed Central Country of Publication: England NLM ID: 100968562 Publication Model: Electronic Cited Medium: Internet ISSN: 1471-2458 (Electronic) Linking ISSN: 14712458 NLM ISO Abbreviation: BMC Public Health Subsets: MEDLINE
Imprint Name(s): Original Publication: London : BioMed Central, [2001-
MeSH Terms: COVID-19*/diagnosis , COVID-19*/epidemiology , Internet* , Algorithms* , Models, Statistical* , COVID-19 Testing*/statistics & numerical data , Search Engine*, Humans ; Forecasting/methods ; SARS-CoV-2 ; Random Forest ; Boosting Machine Learning Algorithms
Abstract: Competing Interests: Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.
Background: Although strategies for COVID-19 have shifted towards normalized measures globally, establishing predictive models based on Internet search data remains crucial for swiftly controlling and preventing future outbreaks. This study aims to utilize Internet search data for early epidemic surveillance and warning.
Methods: We collected the daily number of COVID-19 positive tests and the daily Baidu Search Index (BSI) of COVID-19 related keywords. First, we screened keywords with a maximum correlation coefficient exceeding 0.9 by time-lagged correlation analysis. Then, we used the original and lagged BSI to construct XGBoost and Random Forest (RF) models for short-term prediction of the COVID-19, respectively. Next, we selected top 5 important predictors according to the importance gain in XGBoost model and constructed a comprehensive search index (CSI) weighted by the importance gain. Finally, we used the distributed lagged nonlinear model (DLNM) to evaluate the relationship between the CSI and the number of COVID-19 positive tests.
Results: We identified 20 keywords had a maximum correlation coefficient exceeding 0.9 with lag days of 1-10 days. Then, we found that the predictive performance of the XGBoost models was better than that of the RF models. And XGBoost models using lagged BSI (compared to original BSI) had a better predictive performance for forecasting 3 days, with an RMSE of 803.85 and a MAPE of 9.96%. Finally, we observed that the CSI was statistically associated with the number of COVID-19 positive tests, with the maximum relative risks (RR) at lags of 0, 3, 5, and 7 days being 2.18 (95%CI 1.60-2.97), 1.94 (95%CI 1.10-3.43), 1.86 (95%CI 1.01-3.44), and 2.03 (95%CI 1.00-4.11), respectively.
Conclusions: The XGBoost model with the lagged BSI can predict COVID-19 epidemics, which make it a powerful addition to the traditional surveillance systems.
(© 2025. The Author(s).)
References: BMC Infect Dis. 2021 Jan 21;21(1):98. (PMID: 33478425)
Healthc Inform Res. 2015 Apr;21(2):67-73. (PMID: 25995958)
J Glob Health. 2020 Dec;10(2):020511. (PMID: 33110594)
Heliyon. 2023 Mar;9(3):e13782. (PMID: 36845036)
JMIR Public Health Surveill. 2020 May 21;6(2):e19702. (PMID: 32401211)
J Med Internet Res. 2021 Jun 14;23(6):e24285. (PMID: 34081607)
Acta Biomed. 2020 Sep 07;91(3):e2020006. (PMID: 32921704)
Int J Infect Dis. 2020 May;94:116-118. (PMID: 32320809)
Ann Med. 2021 Dec;53(1):257-266. (PMID: 33410720)
JMIR Public Health Surveill. 2020 Oct 22;6(4):e23098. (PMID: 32960177)
BMC Med Res Methodol. 2021 Jan 11;21(1):15. (PMID: 33423669)
Spat Spatiotemporal Epidemiol. 2022 Jun;41:100498. (PMID: 35691655)
Int J Environ Res Public Health. 2022 Sep 29;19(19):. (PMID: 36231693)
JMIR Public Health Surveill. 2016 Jun 01;2(1):e30. (PMID: 27251981)
Sci Rep. 2022 Jun 11;12(1):9661. (PMID: 35690619)
Clin Infect Dis. 2009 Nov 15;49(10):1557-64. (PMID: 19845471)
Arch Pathol Lab Med. 2020 Dec 1;144(12):1465-1474. (PMID: 32818235)
Malar J. 2013 Nov 04;12:390. (PMID: 24188069)
J Stat Softw. 2011 Jul;43(8):1-20. (PMID: 22003319)
BMC Infect Dis. 2022 May 25;22(1):495. (PMID: 35614387)
JMIR Public Health Surveill. 2022 Nov 8;8(11):e36424. (PMID: 36240022)
Front Public Health. 2021 May 05;9:685141. (PMID: 34026721)
Sci Rep. 2023 Nov 13;13(1):19843. (PMID: 37963932)
Philos Trans A Math Phys Eng Sci. 2022 Jan 10;380(2214):20210125. (PMID: 34802278)
J Med Internet Res. 2021 May 3;23(5):e22933. (PMID: 33878015)
Front Public Health. 2022 Jun 22;10:926069. (PMID: 35812523)
Spat Spatiotemporal Epidemiol. 2022 Feb;40:100471. (PMID: 35120681)
J Med Internet Res. 2023 Oct 27;25:e48789. (PMID: 37889532)
Euro Surveill. 2009 Nov 05;14(44):. (PMID: 19941777)
Zhonghua Er Bi Yan Hou Tou Jing Wai Ke Za Zhi. 2020 Jun 7;55(6):569-575. (PMID: 32186171)
J Med Internet Res. 2021 Aug 11;23(8):e28876. (PMID: 34156966)
Explor Res Hypothesis Med. 2020 Apr 18;5(2):1-6. (PMID: 32348380)
Eur J Cardiothorac Surg. 2020 Feb 1;57(2):350-358. (PMID: 31280308)
PLoS Comput Biol. 2019 Aug 2;15(8):e1007258. (PMID: 31374088)
J Med Internet Res. 2021 Mar 23;23(3):e24925. (PMID: 33621186)
Euro Surveill. 2020 Mar;25(10):. (PMID: 32183935)
Math Biosci Eng. 2020 Apr 8;17(4):3040-3051. (PMID: 32987515)
J Dig Dis. 2020 Apr;21(4):199-204. (PMID: 32267098)
Curr Med Sci. 2021 Feb;41(1):62-68. (PMID: 33582907)
PLoS One. 2018 Apr 19;13(4):e0195875. (PMID: 29672639)
Healthcare (Basel). 2021 Sep 06;9(9):. (PMID: 34574946)
Am J Emerg Med. 2014 Sep;32(9):1016-23. (PMID: 25037278)
JMIR Public Health Surveill. 2020 Apr 14;6(2):e18828. (PMID: 32234709)
J Med Internet Res. 2020 Mar 13;22(3):e13680. (PMID: 32167477)
Euro Surveill. 2020 May;25(21):. (PMID: 32489174)
Grant Information: 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 82103947 National Natural Science Foundation of China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China; 202206080003 Science and Technology Program of Guangzhou, China
Contributed Indexing: Keywords: Baidu search index; Feature selection; Machine learning; Time-lagged correlation analysis
Entry Date(s): Date Created: 20251129 Date Completed: 20251129 Latest Revision: 20251201
Update Code: 20251201
PubMed Central ID: PMC12664191
DOI: 10.1186/s12889-025-25569-w
PMID: 41316087
Database: MEDLINE
Description
Abstract:Competing Interests: Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.<br />Background: Although strategies for COVID-19 have shifted towards normalized measures globally, establishing predictive models based on Internet search data remains crucial for swiftly controlling and preventing future outbreaks. This study aims to utilize Internet search data for early epidemic surveillance and warning.<br />Methods: We collected the daily number of COVID-19 positive tests and the daily Baidu Search Index (BSI) of COVID-19 related keywords. First, we screened keywords with a maximum correlation coefficient exceeding 0.9 by time-lagged correlation analysis. Then, we used the original and lagged BSI to construct XGBoost and Random Forest (RF) models for short-term prediction of the COVID-19, respectively. Next, we selected top 5 important predictors according to the importance gain in XGBoost model and constructed a comprehensive search index (CSI) weighted by the importance gain. Finally, we used the distributed lagged nonlinear model (DLNM) to evaluate the relationship between the CSI and the number of COVID-19 positive tests.<br />Results: We identified 20 keywords had a maximum correlation coefficient exceeding 0.9 with lag days of 1-10 days. Then, we found that the predictive performance of the XGBoost models was better than that of the RF models. And XGBoost models using lagged BSI (compared to original BSI) had a better predictive performance for forecasting 3 days, with an RMSE of 803.85 and a MAPE of 9.96%. Finally, we observed that the CSI was statistically associated with the number of COVID-19 positive tests, with the maximum relative risks (RR) at lags of 0, 3, 5, and 7 days being 2.18 (95%CI 1.60-2.97), 1.94 (95%CI 1.10-3.43), 1.86 (95%CI 1.01-3.44), and 2.03 (95%CI 1.00-4.11), respectively.<br />Conclusions: The XGBoost model with the lagged BSI can predict COVID-19 epidemics, which make it a powerful addition to the traditional surveillance systems.<br /> (© 2025. The Author(s).)
ISSN:1471-2458
DOI:10.1186/s12889-025-25569-w