Understanding the evidence for artificial intelligence in healthcare

The entire evaluation sequence—technical performance, usability and workflow, and impact—should also be repeated whenever conditions change, especially if the models may learn and change their performance over time.7 Model performance can vary for a wide variety of reasons such as changes in the und...

Full description

Saved in:

Bibliographic Details
Published in:	BMJ quality & safety Vol. 34; no. 7; p. 421
Main Authors:	Jackson, Gretchen Purcell, Shortliffe, Edward H
Format:	Journal Article
Language:	English
Published:	England BMJ Publishing Group LTD 01.07.2025
Subjects:	Algorithms Artificial intelligence Clinical outcomes Health care Health care delivery Infectious diseases Informatics Medical errors Ophthalmology R&D Research & development Usability Performance measures Decision support, computerized Information technology Evaluation methodology Patient Safety
ISSN:	2044-5415, 2044-5423, 2044-5423
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The entire evaluation sequence—technical performance, usability and workflow, and impact—should also be repeated whenever conditions change, especially if the models may learn and change their performance over time.7 Model performance can vary for a wide variety of reasons such as changes in the underlying data used for prediction or behavioural changes from use of the model itself. In general, most AI algorithms either predict or classify, so their performance is measured in a manner similar to the evaluation of diagnostic tests, using metrics such as sensitivity, specificity and area under the curve.12 13 Healthcare providers should pay particular attention to rates of false positives and false negatives, as well as their consequences, as clinical judgement is often needed in selecting performance thresholds. Studies of healthcare AI tools should explicitly report false positive and false negative rates rather than composite measures such as F1 scores (an evaluation metric that combines precision and recall) so that medical practitioners can determine their suitability for practice. The phases of clinical research used for drugs and devices are useful for framing AI evaluation, but there are important nuances, as articulated by Park and colleagues.16 AI is similar to drugs and devices, as early-phase ‘laboratory’ studies are needed to demonstrate proof-of-concept technical performance, usability of prototypes, and potential for impact.
Bibliography:	SourceType-Scholarly Journals-1 content type line 14 ObjectType-Editorial-2 ObjectType-Commentary-1 content type line 23
ISSN:	2044-5415 2044-5423 2044-5423
DOI:	10.1136/bmjqs-2025-018559