Extended sample size calculations for evaluation of prediction models using a threshold for classification.

Saved in:
Bibliographic Details
Title: Extended sample size calculations for evaluation of prediction models using a threshold for classification.
Authors: Whittle, Rebecca, Ensor, Joie, Archer, Lucinda, Collins, Gary S., Dhiman, Paula, Denniston, Alastair, Alderman, Joseph, Legha, Amardeep, van Smeden, Maarten, Moons, Karel G., Cazier, Jean-Baptiste, Riley, Richard D., Snell, Kym I. E
Source: BMC Medical Research Methodology; 7/1/2025, Vol. 25 Issue 1, p1-12, 12p
Subject Terms: SAMPLE size (Statistics), PREDICTION models, RESEARCH personnel, MODEL validation, CALIBRATION
Abstract: When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures are also often reported. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have reported closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and an iterative method to estimate the sample size required to target a sufficiently precise estimate of the F1-score, in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure alongside the outcome prevalence. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R, Stata and Python commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit. [ABSTRACT FROM AUTHOR]
Copyright of BMC Medical Research Methodology is the property of BioMed Central and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:When evaluating the performance of a model for individualised risk prediction, the sample size needs to be large enough to precisely estimate the performance measures of interest. Current sample size guidance is based on precisely estimating calibration, discrimination, and net benefit, which should be the first stage of calculating the minimum required sample size. However, when a clinically important threshold is used for classification, other performance measures are also often reported. We extend the previously published guidance to precisely estimate threshold-based performance measures. We have reported closed-form solutions to estimate the sample size required to target sufficiently precise estimates of accuracy, specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and an iterative method to estimate the sample size required to target a sufficiently precise estimate of the F1-score, in an external evaluation study of a prediction model with a binary outcome. This approach requires the user to pre-specify the target standard error and the expected value for each performance measure alongside the outcome prevalence. We describe how the sample size formulae were derived and demonstrate their use in an example. Extension to time-to-event outcomes is also considered. In our examples, the minimum sample size required was lower than that required to precisely estimate the calibration slope, and we expect this would most often be the case. Our formulae, along with corresponding Python code and updated R, Stata and Python commands (pmvalsampsize), enable researchers to calculate the minimum sample size needed to precisely estimate threshold-based performance measures in an external evaluation study. These criteria should be used alongside previously published criteria to precisely estimate the calibration, discrimination, and net-benefit. [ABSTRACT FROM AUTHOR]
ISSN:14712288
DOI:10.1186/s12874-025-02592-4