A framework to create, evaluate and select synthetic datasets for survival prediction in oncology

Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, eval...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers in biology and medicine Jg. 192; H. Pt A; S. 110198
Hauptverfasser: Christoforou, A.T., Spohn, S.K.B., Sprave, T., Fechter, T., Rühle, A., Nicolay, N.H., Popp, I., Grosu, A.L., Peeken, J.C., Thieme, A.H., Stylianopoulos, T., Strouthos, I., Ferentinos, K., Roussakis, Y., Zamboglou, C.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States Elsevier Ltd 01.06.2025
Schlagworte:
ISSN:0010-4825, 1879-0534, 1879-0534
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, evaluating, and selecting high-quality tabular SD for clinical use, focusing on survival datasets in RO. Five retrospectively collected survival-based RO datasets (n = 1038 recurrent prostate cancer, n = 117 primary localised prostate cancer, n = 48 primary nodal positive (metastasised) prostate cancer, n = 1269 head and neck cancer, n = 353 gliomas) underwent cleaning and preparation. SD was generated using four different machine-learning models, with each model producing multiple variants. These were evaluated for privacy, clinical behaviour, and feature distribution using robust and interpretable metrics, with a single SDset being selected for each real-world dataset using a weighted ranking system. The framework successfully generated high-quality SD for every real-world dataset, with the Tabular Variational Autoencoder producing the five best performing SDsets considering all metrics. No more than 5 % of rows overlapped between each synthetic and real-world dataset. Cox proportional hazards models for the real-world and synthetic datasets achieved similar concordance indexes (Avg. Of real-world C-indexes = 0.701 vs 0.699 for SD C-indexes), with every SD hazard ratio falling within the 95 % confidence intervals of their real-world counterparts for 4 of the 5 real-world datasets. The proposed framework enables the production and selection of SDsets that closely mirror real-world data characteristics, ensuring privacy and clinical utility in RO. This approach can facilitate data sharing in clinical research, addressing privacy-related barriers. •A novel framework for generating and evaluating deep learning synthetic datasets.•Framework tested on 5 clinical datasets containing survival and treatment details.•Robust metrics assess privacy, clinical relevance, and data distribution.•Synthetic data retains real-world characteristics, ensuring privacy and usability.•Framework advances secure data sharing in medicine, addressing privacy issues.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0010-4825
1879-0534
1879-0534
DOI:10.1016/j.compbiomed.2025.110198