Initial evaluation of State‐of‐the‐Art deep learning models on data of project forever

Aims/Purpose: We investigated state‐of‐the‐art deep learning models trained on the dataset Artificial Intelligence for Robust Glaucoma Screening Challenge [1] on a cohort subset of Project FOREVER. In addition, we investigated how the current evaluation metrics can be applied to a real‐world screeni...

Full description

Saved in:
Bibliographic Details
Published in:Acta ophthalmologica (Oxford, England) Vol. 103; no. S284
Main Authors: Reimann, Marcel, Andreasen, Jens Rovelt, Dahl, Anders Bjorholm, Kolko, Miriam
Format: Journal Article
Language:English
Published: Malden Wiley Subscription Services, Inc 01.01.2025
Subjects:
ISSN:1755-375X, 1755-3768
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Aims/Purpose: We investigated state‐of‐the‐art deep learning models trained on the dataset Artificial Intelligence for Robust Glaucoma Screening Challenge [1] on a cohort subset of Project FOREVER. In addition, we investigated how the current evaluation metrics can be applied to a real‐world screening scenario. Methods: We followed and combined the reported best‐performing model designs on the challenge dataset [1] to assess the generalizability of these models to our dataset. The entire pipeline was built using open‐source packages and model weights to ensure reproducibility. The optic disc segmentation and quality assessment were performed using AutoMorph [2]. Afterward, a vision transformer is used to classify the cropped images into non‐referable and referable glaucoma. The model is then applied to a labeled subset of participants of Project FOREVER. Results: Using AutoMorph resulted in a much smaller good‐quality training set compared to what is reported in the AIROGS challenge. It classified one‐fifth of the images as ungradable. In addition, it failed to segment the optic disc in numerous images. Overall, we achieved similar performances on our test split of the AIROGS dataset. Our results also showed that, although high specificity and sensitivity values can be reached, the precision scores of the algorithms were generally low. Conclusions: We suggest including precision as a standard metric to report when evaluating screening algorithms. Specificity and sensitivity are insufficient as they do not capture the economic aspects of the proposed models. Low precision algorithms might introduce high burdens on the healthcare systems due to a high number of false positive referrals. References C. de Vente et al., "AIROGS: Artificial Intelligence for Robust Glaucoma Screening Challenge," in IEEE Transactions on Medical Imaging, vol. 43, no. 1, pp. 542‐557, Jan. 2024, doi: 10.1109/TMI.2023.3313786. Zhou Y, Wagner SK, Chia MA, Zhao A, Woodward‐Court P, Xu M, Struyven R, Alexander DC, Keane PA. AutoMorph: Automated Retinal Vascular Morphology Quantification Via a Deep Learning Pipeline. Transl Vis Sci Technol. 2022 Jul 8;11(7):12. doi: 10.1167/tvst.11.7.12. PMID: 35833885; PMCID: PMC9290317.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1755-375X
1755-3768
DOI:10.1111/aos.17116