A framework to create, evaluate and select synthetic datasets for survival prediction in oncology
Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, eval...
Gespeichert in:
| Veröffentlicht in: | Computers in biology and medicine Jg. 192; H. Pt A; S. 110198 |
|---|---|
| Hauptverfasser: | , , , , , , , , , , , , , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
United States
Elsevier Ltd
01.06.2025
|
| Schlagworte: | |
| ISSN: | 0010-4825, 1879-0534, 1879-0534 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Zusammenfassung: | Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, evaluating, and selecting high-quality tabular SD for clinical use, focusing on survival datasets in RO.
Five retrospectively collected survival-based RO datasets (n = 1038 recurrent prostate cancer, n = 117 primary localised prostate cancer, n = 48 primary nodal positive (metastasised) prostate cancer, n = 1269 head and neck cancer, n = 353 gliomas) underwent cleaning and preparation. SD was generated using four different machine-learning models, with each model producing multiple variants. These were evaluated for privacy, clinical behaviour, and feature distribution using robust and interpretable metrics, with a single SDset being selected for each real-world dataset using a weighted ranking system.
The framework successfully generated high-quality SD for every real-world dataset, with the Tabular Variational Autoencoder producing the five best performing SDsets considering all metrics. No more than 5 % of rows overlapped between each synthetic and real-world dataset. Cox proportional hazards models for the real-world and synthetic datasets achieved similar concordance indexes (Avg. Of real-world C-indexes = 0.701 vs 0.699 for SD C-indexes), with every SD hazard ratio falling within the 95 % confidence intervals of their real-world counterparts for 4 of the 5 real-world datasets.
The proposed framework enables the production and selection of SDsets that closely mirror real-world data characteristics, ensuring privacy and clinical utility in RO. This approach can facilitate data sharing in clinical research, addressing privacy-related barriers.
•A novel framework for generating and evaluating deep learning synthetic datasets.•Framework tested on 5 clinical datasets containing survival and treatment details.•Robust metrics assess privacy, clinical relevance, and data distribution.•Synthetic data retains real-world characteristics, ensuring privacy and usability.•Framework advances secure data sharing in medicine, addressing privacy issues. |
|---|---|
| Bibliographie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0010-4825 1879-0534 1879-0534 |
| DOI: | 10.1016/j.compbiomed.2025.110198 |