A framework to create, evaluate and select synthetic datasets for survival prediction in oncology
Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, eval...
Uloženo v:
| Vydáno v: | Computers in biology and medicine Ročník 192; číslo Pt A; s. 110198 |
|---|---|
| Hlavní autoři: | , , , , , , , , , , , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
United States
Elsevier Ltd
01.06.2025
|
| Témata: | |
| ISSN: | 0010-4825, 1879-0534, 1879-0534 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Data-driven decision-making in radiation oncology (RO) relies on integrating real-world data effectively. Synthetic data (SD), generated through machine learning, offers a solution by mimicking real-world data without compromising privacy. This paper presents a general framework for generating, evaluating, and selecting high-quality tabular SD for clinical use, focusing on survival datasets in RO.
Five retrospectively collected survival-based RO datasets (n = 1038 recurrent prostate cancer, n = 117 primary localised prostate cancer, n = 48 primary nodal positive (metastasised) prostate cancer, n = 1269 head and neck cancer, n = 353 gliomas) underwent cleaning and preparation. SD was generated using four different machine-learning models, with each model producing multiple variants. These were evaluated for privacy, clinical behaviour, and feature distribution using robust and interpretable metrics, with a single SDset being selected for each real-world dataset using a weighted ranking system.
The framework successfully generated high-quality SD for every real-world dataset, with the Tabular Variational Autoencoder producing the five best performing SDsets considering all metrics. No more than 5 % of rows overlapped between each synthetic and real-world dataset. Cox proportional hazards models for the real-world and synthetic datasets achieved similar concordance indexes (Avg. Of real-world C-indexes = 0.701 vs 0.699 for SD C-indexes), with every SD hazard ratio falling within the 95 % confidence intervals of their real-world counterparts for 4 of the 5 real-world datasets.
The proposed framework enables the production and selection of SDsets that closely mirror real-world data characteristics, ensuring privacy and clinical utility in RO. This approach can facilitate data sharing in clinical research, addressing privacy-related barriers.
•A novel framework for generating and evaluating deep learning synthetic datasets.•Framework tested on 5 clinical datasets containing survival and treatment details.•Robust metrics assess privacy, clinical relevance, and data distribution.•Synthetic data retains real-world characteristics, ensuring privacy and usability.•Framework advances secure data sharing in medicine, addressing privacy issues. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0010-4825 1879-0534 1879-0534 |
| DOI: | 10.1016/j.compbiomed.2025.110198 |