Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study.
Saved in:
| Title: | Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study. |
|---|---|
| Authors: | Cai, Qiong, Yang, Lanting, Xiao, Jiangping, Ma, Jiale, Liu, Molei, Pan, Xilong |
| Source: | Diagnostics (2075-4418); Jul2025, Vol. 15 Issue 14, p1829, 16p |
| Subject Terms: | ELECTRONIC health records, LANGUAGE models, PRODUCT quality management, COMPUTER performance, MANAGEMENT of electronic health records, REAL-time computing, FAULT diagnosis, COMPUTER science |
| Abstract: | Background/Objectives: This study examines the effectiveness of train-time computation, test-time computation, and their combination on the performance of large language modeling applied to an electronic medical record quality management system. It identifies the most effective combination of models to enhance clinical documentation performance and efficiency. Methods: A total of 597 clinical medical records were selected from the MEDEC-MS dataset, 10 of which were used for prompt engineering to guide model training. Eight large language models were employed for training, focusing on train-time computation and test-time computation. Model performance on specific error types was assessed using precision, recall, F1 score, and error correction accuracy. The dataset was divided into training and testing sets in a 7:3 ratio. The assembly model was created using binary logistic regression for assembly analysis of the top-performing models. Its performance was evaluated using area under the curve values and model weights. Results: GPT-4 and Deepseek R1 demonstrated higher overall accuracy in detecting errors. Models that focus on train-time computation exhibited shorter reasoning times and stricter error detection, while models emphasizing test-time computation achieved higher error correction accuracy. The GPT-4 model was particularly effective in addressing issues related to causal organisms, management, and pharmacotherapy, whereas models focusing on test-time computation performed better in tasks involving diagnosis and treatment. The assembly model, focusing on both train-time computation and test-time computation, outperformed any single large language model (Assembly model accuracy: 0.690 vs. GPT-4 accuracy: 0.477). Conclusions: Models focusing on train-time computation demonstrated greater efficiency in processing speed, while models focusing on test-time computation showed higher accuracy and interpretability in identifying and detecting quality issues in electronic medical records. Assembling the train-time and test-time computation strategies may strike a balance between high accuracy and model efficiency, thereby enhancing the development of electronic medical records and improving medical care. [ABSTRACT FROM AUTHOR] |
| Copyright of Diagnostics (2075-4418) is the property of MDPI and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) | |
| Database: | Biomedical Index |
Be the first to leave a comment!
Full Text Finder
Nájsť tento článok vo Web of Science