Use of Large Language Models to Extract Cost-Effectiveness Analysis Data: A Case Study

Cost-effectiveness analyses (CEA) generate extensive data that can support much health economic research. However, manual data collection is time-consuming and prone to errors. Development in artificial intelligence (AI) and large language models (LLMs) offers a solution for automating this process....

Full description

Saved in:
Bibliographic Details
Published in:Value in health Vol. 28; no. 11; pp. 1637 - 1645
Main Authors: Gu, Xujun, Zhang, Hanwen, Patil, Divya, Zafari, Zafar, Slejko, Julia, Onukwugha, Eberechukwu
Format: Journal Article
Language:English
Published: United States Elsevier Inc 01.11.2025
Subjects:
ISSN:1098-3015, 1524-4733, 1524-4733
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Cost-effectiveness analyses (CEA) generate extensive data that can support much health economic research. However, manual data collection is time-consuming and prone to errors. Development in artificial intelligence (AI) and large language models (LLMs) offers a solution for automating this process. This study aims to evaluate the accuracy of LLM-based data extraction and assess its feasibility for supporting CEA data collection. We evaluated the performance of the custom ChatGPT model (GPT), the Tufts CEA Registry (TCRD), and the researcher-validated data (RVE) in extracting 36 predetermined variables from 34 selected structured articles. Concordance rates between GPT and RVE, TCRD and RVE, and GPT and TCRD were calculated and compared. Paired student’s t tests assessed differences in accuracy, and concordance rates across 36 variables were provided. The accuracy of GPT (GPT & RVE) was comparable to the accuracy of TCRD (TCRD & RVE) (mean 0.88, SD 0.06 vs mean 0.90, SD 0.06, P = .71). The performance of GPT varied across variables. GPT outperformed TCRD in capturing “Population and Intervention Details” but struggled with complex variables like “Utility.” This study demonstrated that LLMs, such as GPT, can be a promising tool for automating CEA data extraction, offering comparable accuracy to established registries. However, human supervision and expertise is essential to address challenges in complex variables. •To our knowledge, this is the first study using large language models, specifically GPT-4o, to extract cost-effectiveness analyses data (GPT) and compare it with (1) an existing database, ie, Tufts Cost-Effectiveness Analyses Registry and (2) researcher-validated extraction.•The analysis found that GPT exhibited overall accuracy similar to Tufts Cost-Effectiveness Analyses Registry, with no statistically significant difference. However, GPT particularly underperformed in 4 variables: types of utilities, number of utilities, ratio quadrant, and number of ratios.•Currently, large language models can assist researchers by automating simple variable extraction, filling missing data, and acting as a reference tool during manual data extraction to enhance efficiency under human supervision. Future research should explore batch data extraction methods and strategies for handling complex or subjective variables.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1098-3015
1524-4733
1524-4733
DOI:10.1016/j.jval.2025.05.008