Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework

Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of bioinformatics and computational biology Jg. 23; H. 5; S. 2550015
Hauptverfasser: Wang, Jie-Huei, Guo, Tzung-Ying, Pai, Yen-Yi, Hou, Po-Lin, Kumari, Himani, Chan, Michael W Y
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Singapore 01.10.2025
Schlagworte:
ISSN:1757-6334, 1757-6334
Online-Zugang:Weitere Angaben
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1757-6334
1757-6334
DOI:10.1142/S0219720025500155