Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework
Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding...
Saved in:
| Published in: | Journal of bioinformatics and computational biology Vol. 23; no. 5; p. 2550015 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Singapore
01.10.2025
|
| Subjects: | |
| ISSN: | 1757-6334, 1757-6334 |
| Online Access: | Get more information |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine. |
|---|---|
| AbstractList | Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine. Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine. |
| Author | Wang, Jie-Huei Guo, Tzung-Ying Chan, Michael W Y Hou, Po-Lin Kumari, Himani Pai, Yen-Yi |
| Author_xml | – sequence: 1 givenname: Jie-Huei orcidid: 0000-0003-1596-8471 surname: Wang fullname: Wang, Jie-Huei organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan – sequence: 2 givenname: Tzung-Ying orcidid: 0009-0000-1401-6669 surname: Guo fullname: Guo, Tzung-Ying organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan – sequence: 3 givenname: Yen-Yi orcidid: 0009-0008-4243-8576 surname: Pai fullname: Pai, Yen-Yi organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan – sequence: 4 givenname: Po-Lin orcidid: 0009-0006-9217-8093 surname: Hou fullname: Hou, Po-Lin organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan – sequence: 5 givenname: Himani orcidid: 0009-0003-5930-2249 surname: Kumari fullname: Kumari, Himani organization: Department of Biomedical Sciences, National Chung Cheng University, Chiayi 621301, Taiwan – sequence: 6 givenname: Michael W Y orcidid: 0000-0003-0314-2437 surname: Chan fullname: Chan, Michael W Y organization: Department of Biomedical Sciences, National Chung Cheng University, Chiayi 621301, Taiwan |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/41083416$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNUEtLAzEYDFLRWv0BXiRHL6t57sNbWbQKBQ_W85JNvtjgblKTXUvxz1uxgqeZgWGYmTM08cEDQpeU3FAq2O0LYbQqGCFMSkKolEdoSgtZZDnnYvKPn6BTQUnJBc2n6KtWXkPEulMpOeu0GlzwWHmD7ej1j1Ad3qhhvVU7bFzS4RPiDo_J-Te8qhdzPETlk45uM4TeabyJwboO0h2e414Neg0Ga5Ug08EPMXTYRtXDNsT3c3RsVZfg4oAz9Ppwv6ofs-Xz4qmeLzPNmZAZVMRIYnMGlDBm2hxIawS3hJpSV3mhGC2LqqCaQt5SXlEwpLVUciUtlKxgM3T9m7uv9jFCGpp-vwO6TnkIY2o4y0lJmBBib706WMe2B9NsoutV3DV_h7FvIQdt6w |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.1142/S0219720025500155 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| EISSN | 1757-6334 |
| ExternalDocumentID | 41083416 |
| Genre | Journal Article |
| GroupedDBID | CGR CUY CVF ECM EIF NPM 7X8 |
| ID | FETCH-LOGICAL-c3245-e90d50f62e1022db6e0bd43f01d8c967a2187971c1e6b1391ed0bf153a5fe8272 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001592834400003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1757-6334 |
| IngestDate | Wed Oct 15 09:51:23 EDT 2025 Wed Oct 15 11:49:58 EDT 2025 |
| IsDoiOpenAccess | false |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 5 |
| Keywords | matched case-control design model-based gene set analysis machine learning classification matched-pairs feature screening incremental feature selection TCGA Corrected feature matrix transformation |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c3245-e90d50f62e1022db6e0bd43f01d8c967a2187971c1e6b1391ed0bf153a5fe8272 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ORCID | 0009-0008-4243-8576 0000-0003-0314-2437 0009-0003-5930-2249 0009-0000-1401-6669 0009-0006-9217-8093 0000-0003-1596-8471 |
| OpenAccessLink | http://www.worldscientific.com/doi/abs/10.1142/S0219720025500155 |
| PMID | 41083416 |
| PQID | 3260802444 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_3260802444 pubmed_primary_41083416 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Oct 20251001 |
| PublicationDateYYYYMMDD | 2025-10-01 |
| PublicationDate_xml | – month: 10 year: 2025 text: 2025-Oct |
| PublicationDecade | 2020 |
| PublicationPlace | Singapore |
| PublicationPlace_xml | – name: Singapore |
| PublicationTitle | Journal of bioinformatics and computational biology |
| PublicationTitleAlternate | J Bioinform Comput Biol |
| PublicationYear | 2025 |
| Score | 2.3807058 |
| Snippet | Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | 2550015 |
| SubjectTerms | Algorithms Case-Control Studies Computational Biology - methods Databases, Genetic Gene Expression Profiling - methods Humans Machine Learning Neoplasms - classification Neoplasms - genetics Neoplasms - metabolism Transcriptome |
| Title | Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/41083416 https://www.proquest.com/docview/3260802444 |
| Volume | 23 |
| WOSCitedRecordID | wos001592834400003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8QwEA7qevDiA1_riwheg02aNq0XWRZXLy4LrrC3kjSJeOmu21VZ_PPOpF08CYKX3gplZjrzZeabfIRcWSMBFkD2gwh2TGa5YsamjuEQLpWR0VKaIDahhsNsMslHbcOtbmmVq5wYErWdltgjvwaYgWuhUsrb2RtD1SicrrYSGuukEwOUwahWk7D9phLF0jiW7SCTS3H9BOUsVyKg6AAVfgeVobgMdv77Wbtku4WVtNfEwR5Zc9U--eqjS-e0RISMlKDgBaorS7GcNV1AiprEn3pJcT8X-ZxLilz4Fzru3_foAktZSCy4vUxbhe_6hvYoQF1wuKUl1EHWMt6pX3G9Dsjz4G7cf2Ct2AIrAVMlzOWRTSKfCodnQGtSFxkrYx9xm5V5qrRAXXLFS-5SA7bmzkbGQ77UiXeZUOKQbFTTyh0TGme-9DLhBq9HgwOZUcImLtO551p4l3TJ5cqWBQQzTih05abvdfFjzS45ahxSzJpbNwrJAS0CfDz5w9unZEugTi8O-tUZ6Xj4ld052Sw_Fq_1_CJECTyHo8dv-1jJMg |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cancer+classification+and+functional+pathway+discovery+using+TCGA+transcriptomic+profiles%3A+A+matched+case-control+framework&rft.jtitle=Journal+of+bioinformatics+and+computational+biology&rft.au=Wang%2C+Jie-Huei&rft.au=Guo%2C+Tzung-Ying&rft.au=Pai%2C+Yen-Yi&rft.au=Hou%2C+Po-Lin&rft.date=2025-10-01&rft.issn=1757-6334&rft.eissn=1757-6334&rft.volume=23&rft.issue=5&rft_id=info:doi/10.1142%2FS0219720025500155&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1757-6334&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1757-6334&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1757-6334&client=summon |