Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework.

Uložené v:
Podrobná bibliografia
Názov: Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework.
Autori: Abumelha M; Information Systems Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia.; Information Systems Department, Applied College at Khamis Mushait, King Khalid University, Abha, Saudi Arabia., Al-Ghamdi AA; Information Systems Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia.; Computer Science Department, School of Engineering, Computing and Design, Dar Alhekma University, Jeddah, Saudi Arabia., Fayoumi A; Information Systems Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia., Ragab M; Information Technology Department, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia.
Zdroj: JMIR medical informatics [JMIR Med Inform] 2025 Dec 03; Vol. 13, pp. e78432. Date of Electronic Publication: 2025 Dec 03.
Spôsob vydávania: Journal Article
Jazyk: English
Informácie o časopise: Publisher: JMIR Publications Country of Publication: Canada NLM ID: 101645109 Publication Model: Electronic Cited Medium: Internet ISSN: 2291-9694 (Electronic) Linking ISSN: 22919694 NLM ISO Abbreviation: JMIR Med Inform Subsets: MEDLINE
Imprint Name(s): Original Publication: Toronto : JMIR Publications, [2013]-
Výrazy zo slovníka MeSH: Natural Language Processing* , Data Mining*/methods , Electronic Health Records* , Physical Examination*, Humans ; Large Language Models
Abstrakt: Background: Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.
Objective: This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.
Methods: We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F 1 -scores, with error analysis conducted on predicted features from the private test set.
Results: The framework achieved an F 1 -score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F 1 =0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F 1 =0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F 1 improvements from 0.819 to 0.986.
Conclusions: Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F 1 =0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment applications.
(©Manal Abumelha, Abdullah AL-Malaise AL-Ghamdi, Ayman Fayoumi, Mahmoud Ragab. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 03.12.2025.)
Contributed Indexing: Keywords: automated medical assessment; clinical NLP; hallucination mitigation; instruction tuning; large language models; medical feature extraction; semantic matching
Entry Date(s): Date Created: 20251031 Date Completed: 20251203 Latest Revision: 20251203
Update Code: 20251204
DOI: 10.2196/78432
PMID: 41171081
Databáza: MEDLINE
Popis
Abstrakt:Background: Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.<br />Objective: This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.<br />Methods: We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F <subscript>1</subscript> -scores, with error analysis conducted on predicted features from the private test set.<br />Results: The framework achieved an F <subscript>1</subscript> -score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F <subscript>1</subscript> =0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F <subscript>1</subscript> =0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F <subscript>1</subscript> improvements from 0.819 to 0.986.<br />Conclusions: Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F <subscript>1</subscript> =0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment applications.<br /> (©Manal Abumelha, Abdullah AL-Malaise AL-Ghamdi, Ayman Fayoumi, Mahmoud Ragab. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 03.12.2025.)
ISSN:2291-9694
DOI:10.2196/78432