Large Language Models Improve Coding Accuracy and Reimbursement in a Neonatal Intensive Care Unit

Saved in:
Bibliographic Details
Title: Large Language Models Improve Coding Accuracy and Reimbursement in a Neonatal Intensive Care Unit
Authors: Emma Holmes, Caroline Massarelli, Felix Richter, Stephanie Bernard, Robert Freeman, Nicholas Gavin, Courtney Juliano, Bruce D. Gelb, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang
Publisher Information: Cold Spring Harbor Laboratory, 2025.
Publication Year: 2025
Description: Importance Diagnosis coding is essential for clinical care, research validity, and hospital reimbursement. In neonatal settings, manual coding is frequently error-prone, contributing to misclassification and financial losses. Large language models (LLMs) offer a scalable approach to improve diagnostic consistency and optimize revenue. Objective To compare the diagnostic accuracy of LLMs with human coders in identifying common neonatal diagnoses and assess the potential impact on revenue from Diagnosis-Related Group (DRG) assignment. Design This was a retrospective cross-sectional study conducted using data from 2022 to 2023. LLMs were prompted with all physician notes from the admission. Two neonatologists independently and blindly adjudicated diagnoses from three sources: human coders, GPT-4o, and GPT-o3-mini. Setting A single academic referral center’s neonatal intensive care unit (NICU). Participants The study included a consecutive sample of 100 infants admitted to the NICU who did not require respiratory support. All available physician notes from the hospital stay were included. Exposure Two HIPAA-compliant LLMs (GPT-4o and GPT-o3-mini) were prompted to assign diagnoses from a standardized list based on physician notes. Three prompt iterations were developed and reviewed for optimization prior to final evaluation. Main Outcomes and Measures The primary outcome was diagnostic accuracy compared with physician adjudication. Secondary outcomes included changes in expected DRG assignment and projected annual revenue. Results Among 100 infants (median gestational age 35.6 weeks, 52% male), GPT-o3-mini achieved 79.1% diagnostic accuracy (95% CI, 74.0%-84.2%), comparable to human coders at 76.3% (95% CI, 70.9%-81.7%; P = .52). GPT-4o underperformed at 58.6% (95% CI, 52.5%-64.7%; P < .001 vs both). Accuracy of GPT-o3-mini did not differ by DRG impact. Extrapolated to one year, correct GPT-o3-mini diagnoses yielded projected revenue of $5.71 million, compared to $4.82 million from human coders, an 18% increase. Conclusions and Relevance A HIPAA-compliant LLM demonstrated diagnostic accuracy comparable to human coders in neonatal billing while identifying higher-acuity diagnoses that improved projected reimbursement. LLMs may serve as effective adjuncts to manual coding in neonatal care, with potential clinical and financial benefit. Key Points Question Can a large language model support accurate diagnosis generation for neonatal billing? Findings In this retrospective study of 100 neonates hospitalized in the Neonatal Intensive Care Unit, GPT-o3-mini demonstrated diagnostic accuracy comparable to human coders, as confirmed by physician review. Its implementation could yield an estimated 18% increase in revenue. Meaning Large language models may serve as effective adjuncts in neonatal coding, offering both diagnostic precision and financial benefit.
Document Type: Article
DOI: 10.1101/2025.07.23.25332086
Rights: URL: https://www.medrxiv.org/about/FAQ#license
Accession Number: edsair.doi...........2a899503d91c0c9e9585bfc5f4b53d75
Database: OpenAIRE
Description
Abstract:Importance Diagnosis coding is essential for clinical care, research validity, and hospital reimbursement. In neonatal settings, manual coding is frequently error-prone, contributing to misclassification and financial losses. Large language models (LLMs) offer a scalable approach to improve diagnostic consistency and optimize revenue. Objective To compare the diagnostic accuracy of LLMs with human coders in identifying common neonatal diagnoses and assess the potential impact on revenue from Diagnosis-Related Group (DRG) assignment. Design This was a retrospective cross-sectional study conducted using data from 2022 to 2023. LLMs were prompted with all physician notes from the admission. Two neonatologists independently and blindly adjudicated diagnoses from three sources: human coders, GPT-4o, and GPT-o3-mini. Setting A single academic referral center’s neonatal intensive care unit (NICU). Participants The study included a consecutive sample of 100 infants admitted to the NICU who did not require respiratory support. All available physician notes from the hospital stay were included. Exposure Two HIPAA-compliant LLMs (GPT-4o and GPT-o3-mini) were prompted to assign diagnoses from a standardized list based on physician notes. Three prompt iterations were developed and reviewed for optimization prior to final evaluation. Main Outcomes and Measures The primary outcome was diagnostic accuracy compared with physician adjudication. Secondary outcomes included changes in expected DRG assignment and projected annual revenue. Results Among 100 infants (median gestational age 35.6 weeks, 52% male), GPT-o3-mini achieved 79.1% diagnostic accuracy (95% CI, 74.0%-84.2%), comparable to human coders at 76.3% (95% CI, 70.9%-81.7%; P = .52). GPT-4o underperformed at 58.6% (95% CI, 52.5%-64.7%; P < .001 vs both). Accuracy of GPT-o3-mini did not differ by DRG impact. Extrapolated to one year, correct GPT-o3-mini diagnoses yielded projected revenue of $5.71 million, compared to $4.82 million from human coders, an 18% increase. Conclusions and Relevance A HIPAA-compliant LLM demonstrated diagnostic accuracy comparable to human coders in neonatal billing while identifying higher-acuity diagnoses that improved projected reimbursement. LLMs may serve as effective adjuncts to manual coding in neonatal care, with potential clinical and financial benefit. Key Points Question Can a large language model support accurate diagnosis generation for neonatal billing? Findings In this retrospective study of 100 neonates hospitalized in the Neonatal Intensive Care Unit, GPT-o3-mini demonstrated diagnostic accuracy comparable to human coders, as confirmed by physician review. Its implementation could yield an estimated 18% increase in revenue. Meaning Large language models may serve as effective adjuncts in neonatal coding, offering both diagnostic precision and financial benefit.
DOI:10.1101/2025.07.23.25332086