Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone.
Gespeichert in:
| Titel: | Natural language processing improves reliable identification of COVID-19 compared to diagnostic codes alone. |
|---|---|
| Autoren: | Hendrix N; Center for Professionalism and Value in Health Care, American Board of Family Medicine, Washington, DC 20036, United States., Parikh RV; Department of Epidemiology and Population Health, Stanford School of Medicine, Palo Alto, CA 94304, United States., Taskier M; Center for Professionalism and Value in Health Care, American Board of Family Medicine, Washington, DC 20036, United States., Walter G; Robert Graham Center, American Academy of Family Physicians, Washington, DC 20036, United States., Phillips RL; Center for Professionalism and Value in Health Care, American Board of Family Medicine, Washington, DC 20036, United States., Rehkopf DH; Department of Epidemiology and Population Health, Stanford School of Medicine, Palo Alto, CA 94304, United States. |
| Quelle: | American journal of epidemiology [Am J Epidemiol] 2025 Nov 04; Vol. 194 (11), pp. 3348-3354. |
| Publikationsart: | Journal Article |
| Sprache: | English |
| Info zur Zeitschrift: | Publisher: Oxford University Press Country of Publication: United States NLM ID: 7910653 Publication Model: Print Cited Medium: Internet ISSN: 1476-6256 (Electronic) Linking ISSN: 00029262 NLM ISO Abbreviation: Am J Epidemiol Subsets: MEDLINE |
| Imprint Name(s): | Publication: Cary, NC : Oxford University Press Original Publication: Baltimore, School of Hygiene and Public Health of Johns Hopkins Univ. |
| MeSH-Schlagworte: | Natural Language Processing* , COVID-19*/diagnosis , COVID-19*/epidemiology , COVID-19*/ethnology, Humans ; Male ; Middle Aged ; Female ; Adult ; Aged ; SARS-CoV-2 ; International Classification of Diseases ; Adolescent ; Young Adult ; Primary Health Care ; Sensitivity and Specificity ; Electronic Health Records ; United States/epidemiology |
| Abstract: | Observational COVID-19 studies often rely on diagnostic codes, but their accuracy and potential for differential misclassification across patient subgroups are unclear. In this proof of concept study, we examined age, race, and ethnicity as predictors of differential misclassification by comparing the classification accuracy of diagnostic codes to classifiers based on natural language processing (NLP) of clinical notes. We assessed differential misclassification in two primary care-based samples from the American Family Cohort: first, a cohort of 5000 patients with COVID-19 status assessed by physicians based on notes; and second, 21 659 patients (of 1 560 564) who received COVID-specific antivirals. Using annotated note data, we trained and tested three NLP classifiers (tree-based, recurrent neural network, and transformer-based). Approximately 63% of likely COVID-19 patients in the two samples had a documented ICD-10 code for COVID-19. Sensitivity was highest among younger patients (68.6% for <18 years versus 60.6% for those 75+), and for Hispanic patients (68.0% vs 58.5% for Black/African American patients). The tree-based classifier had the highest area under the ROC curve (0.92), although it was less accurate among older patients. NLP performance drastically worsened predicting data collected post-training. While NLP may improve cohort identification, frequent retraining is likely needed to capture changing documentation. (© The Author(s) 2025. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For commercial re-use, please contact reprints@oup.com for reprints and translation rights for reprints. All other permissions can be obtained through our RightsLink service via the Permissions link on the article page on our site-for further information please contact journals.permissions@oup.com.) |
| References: | Int J Med Inform. 2023 Sep;177:105136. (PMID: 37392712) Stat Methods Med Res. 2016 Oct;25(5):2377-2393. (PMID: 25217446) Curr Probl Cardiol. 2023 Feb;48(2):101440. (PMID: 36216202) BMC Med Res Methodol. 2022 May 12;22(1):136. (PMID: 35549854) Clin Infect Dis. 2020 Apr 27;:. (PMID: 32338708) Chest. 2021 Jun;159(6):2346-2355. (PMID: 33345951) J Am Med Inform Assoc. 2022 Jun 14;29(7):1191-1199. (PMID: 35438796) BMC Med Inform Decis Mak. 2014 Jun 11;14:51. (PMID: 24916006) PLoS Med. 2018 Nov 6;15(11):e1002683. (PMID: 30399157) JACC Heart Fail. 2023 Jul;11(7):852-854. (PMID: 36939660) PLoS One. 2022 Aug 18;17(8):e0273196. (PMID: 35980905) Rheumatology (Oxford). 2020 May 1;59(5):1059-1065. (PMID: 31535693) Mach Learn Knowl Discov Databases. 2014;8725:225-239. (PMID: 26023687) Lancet. 2022 Jun 11;399(10342):2191-2199. (PMID: 35691322) Stat Med. 2010 Apr 30;29(9):994-1003. (PMID: 20087839) JCO Clin Cancer Inform. 2022 Oct;6:e2100071. (PMID: 36215673) JAMA Netw Open. 2021 Nov 1;4(11):e2134147. (PMID: 34762110) JAMA Netw Open. 2023 Jul 3;6(7):e2322299. (PMID: 37418261) Int J Med Inform. 2022 Mar 7;162:104736. (PMID: 35316697) J Gen Intern Med. 2021 Aug;36(8):2532-2535. (PMID: 34100236) Clin Epidemiol. 2021 Oct 27;13:1011-1018. (PMID: 34737645) Lancet Digit Health. 2022 Jul;4(7):e532-e541. (PMID: 35589549) BMC Med Res Methodol. 2021 Oct 27;21(1):234. (PMID: 34706667) |
| Grant Information: | U01 FD007879 United States FD FDA HHS; U01FD007879 US Food & Drug Administration |
| Contributed Indexing: | Keywords: COVID-19; cohort identification; natural language processing; sample sizes |
| Entry Date(s): | Date Created: 20250730 Date Completed: 20251120 Latest Revision: 20251123 |
| Update Code: | 20251123 |
| PubMed Central ID: | PMC12335755 |
| DOI: | 10.1093/aje/kwaf162 |
| PMID: | 40731247 |
| Datenbank: | MEDLINE |
Schreiben Sie den ersten Kommentar!
Nájsť tento článok vo Web of Science