Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework

Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Journal of bioinformatics and computational biology Ročník 23; číslo 5; s. 2550015
Hlavní autoři: Wang, Jie-Huei, Guo, Tzung-Ying, Pai, Yen-Yi, Hou, Po-Lin, Kumari, Himani, Chan, Michael W Y
Médium: Journal Article
Jazyk:angličtina
Vydáno: Singapore 01.10.2025
Témata:
ISSN:1757-6334, 1757-6334
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.
AbstractList Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.
Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing precision oncology. Matched Case-control Design (MCCD), by pairing similar cases with controls, can enhance statistical power and reduce confounding bias. However, high-dimensional data present challenges such as overfitting, instability, and difficulty in interpretation, collectively referred to as the "curse of dimensionality." Feature selection can help mitigate these problems by identifying representative variables and reducing redundancy. This study's innovation lies in integrating a set of existing techniques into a unified analytical workflow tailored specifically for MCCD, validated through both simulated and real TCGA datasets. We compared the performance of paired versus unpaired feature selection approaches under simulated 1:1 MCCD scenarios, and developed a modular, pluggable pipeline. This includes mean-centering, gene filtering, and a Corrected Feature Matrix (CFM) transformation step that explicitly preserves the matched structure. This transformation is then combined with machine learning classifiers to predict cancer status. We also incorporated Incremental Feature Selection (IFS) to refine gene subsets and employed gene set enrichment analysis to enhance biological interpretability. While the individual components we used, such as paired testing, CFM, IFS, and model-based gene set analysis, are not novel in themselves, we demonstrate an integrated workflow optimized for MCCD tasks. This workflow outperforms uncorrected approaches in terms of classification accuracy, feature stability, and interpretability. Our results indicate that this method can enhance cancer classification accuracy, facilitate biomarker discovery, and aid in building interpretable diagnostic models, providing a practical and scalable tool for precision medicine.
Author Wang, Jie-Huei
Guo, Tzung-Ying
Chan, Michael W Y
Hou, Po-Lin
Kumari, Himani
Pai, Yen-Yi
Author_xml – sequence: 1
  givenname: Jie-Huei
  orcidid: 0000-0003-1596-8471
  surname: Wang
  fullname: Wang, Jie-Huei
  organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan
– sequence: 2
  givenname: Tzung-Ying
  orcidid: 0009-0000-1401-6669
  surname: Guo
  fullname: Guo, Tzung-Ying
  organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan
– sequence: 3
  givenname: Yen-Yi
  orcidid: 0009-0008-4243-8576
  surname: Pai
  fullname: Pai, Yen-Yi
  organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan
– sequence: 4
  givenname: Po-Lin
  orcidid: 0009-0006-9217-8093
  surname: Hou
  fullname: Hou, Po-Lin
  organization: Department of Mathematics, National Chung Cheng University, Chiayi 621301, Taiwan
– sequence: 5
  givenname: Himani
  orcidid: 0009-0003-5930-2249
  surname: Kumari
  fullname: Kumari, Himani
  organization: Department of Biomedical Sciences, National Chung Cheng University, Chiayi 621301, Taiwan
– sequence: 6
  givenname: Michael W Y
  orcidid: 0000-0003-0314-2437
  surname: Chan
  fullname: Chan, Michael W Y
  organization: Department of Biomedical Sciences, National Chung Cheng University, Chiayi 621301, Taiwan
BackLink https://www.ncbi.nlm.nih.gov/pubmed/41083416$$D View this record in MEDLINE/PubMed
BookMark eNpNUEtLAzEYDFLRWv0BXiRHL6t57sNbWbQKBQ_W85JNvtjgblKTXUvxz1uxgqeZgWGYmTM08cEDQpeU3FAq2O0LYbQqGCFMSkKolEdoSgtZZDnnYvKPn6BTQUnJBc2n6KtWXkPEulMpOeu0GlzwWHmD7ej1j1Ad3qhhvVU7bFzS4RPiDo_J-Te8qhdzPETlk45uM4TeabyJwboO0h2e414Neg0Ga5Ug08EPMXTYRtXDNsT3c3RsVZfg4oAz9Ppwv6ofs-Xz4qmeLzPNmZAZVMRIYnMGlDBm2hxIawS3hJpSV3mhGC2LqqCaQt5SXlEwpLVUciUtlKxgM3T9m7uv9jFCGpp-vwO6TnkIY2o4y0lJmBBib706WMe2B9NsoutV3DV_h7FvIQdt6w
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1142/S0219720025500155
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
EISSN 1757-6334
ExternalDocumentID 41083416
Genre Journal Article
GroupedDBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
ID FETCH-LOGICAL-c3245-e90d50f62e1022db6e0bd43f01d8c967a2187971c1e6b1391ed0bf153a5fe8272
IEDL.DBID 7X8
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001592834400003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1757-6334
IngestDate Wed Oct 15 09:51:23 EDT 2025
Wed Oct 15 11:49:58 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 5
Keywords matched case-control design
model-based gene set analysis
machine learning classification
matched-pairs feature screening
incremental feature selection
TCGA
Corrected feature matrix transformation
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c3245-e90d50f62e1022db6e0bd43f01d8c967a2187971c1e6b1391ed0bf153a5fe8272
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0009-0008-4243-8576
0000-0003-0314-2437
0009-0003-5930-2249
0009-0000-1401-6669
0009-0006-9217-8093
0000-0003-1596-8471
OpenAccessLink http://www.worldscientific.com/doi/abs/10.1142/S0219720025500155
PMID 41083416
PQID 3260802444
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3260802444
pubmed_primary_41083416
PublicationCentury 2000
PublicationDate 2025-Oct
20251001
PublicationDateYYYYMMDD 2025-10-01
PublicationDate_xml – month: 10
  year: 2025
  text: 2025-Oct
PublicationDecade 2020
PublicationPlace Singapore
PublicationPlace_xml – name: Singapore
PublicationTitle Journal of bioinformatics and computational biology
PublicationTitleAlternate J Bioinform Comput Biol
PublicationYear 2025
Score 2.3809135
Snippet Leveraging high-dimensional transcriptomic data from The Cancer Genome Atlas (TCGA) for cancer classification holds critical significance for advancing...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 2550015
SubjectTerms Algorithms
Case-Control Studies
Computational Biology - methods
Databases, Genetic
Gene Expression Profiling - methods
Humans
Machine Learning
Neoplasms - classification
Neoplasms - genetics
Neoplasms - metabolism
Transcriptome
Title Cancer classification and functional pathway discovery using TCGA transcriptomic profiles: A matched case-control framework
URI https://www.ncbi.nlm.nih.gov/pubmed/41083416
https://www.proquest.com/docview/3260802444
Volume 23
WOSCitedRecordID wos001592834400003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8NAEF7UevDiA1_1xQpelya7m2zWi5Ri9aClYIXeymZ3I17S2lSl-Oed2aR4EgQvOQQSwsxk5svMN_kIuZKR4IXRhqVGCiYTJ1kuE82g8sL5FOe-QbXkQQ0G2Xish03DrWpolaucGBK1m1rskXcAZuBaqJTyZvbGUDUKp6uNhMY6aQmAMhjVahy231SiWCqEbAaZseSdJyhnWvGAogNU-B1UhuLS3_nvY-2S7QZW0m4dB3tkzZf75KuHLp1TiwgZKUHBC9SUjmI5q7uAFDWJP82S4n4u8jmXFLnwL3TUu-vSBZaykFhwe5k2Ct_VNe1SgLrgcEct1EHWMN5pseJ6HZDn_u2od88asQVmAVMlzOvIJVGRco_fgC5PfZQ7KYoodpnVqTIcdclVbGOf5mDr2LsoLyBfmqTwGVf8kGyU09IfE1pwYYW2PItyLx3cxxpntdFpYlQulWyTy5UtJxDMOKEwpZ--V5Mfa7bJUe2Qyaz-68ZExoAWAT6e_OHqU7LFUac3kO7OSKuAV9mfk037sXit5hchSuA4GD5-A8UGyXw
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Cancer+classification+and+functional+pathway+discovery+using+TCGA+transcriptomic+profiles%3A+A+matched+case-control+framework&rft.jtitle=Journal+of+bioinformatics+and+computational+biology&rft.au=Wang%2C+Jie-Huei&rft.au=Guo%2C+Tzung-Ying&rft.au=Pai%2C+Yen-Yi&rft.au=Hou%2C+Po-Lin&rft.date=2025-10-01&rft.issn=1757-6334&rft.eissn=1757-6334&rft.volume=23&rft.issue=5&rft_id=info:doi/10.1142%2FS0219720025500155&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1757-6334&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1757-6334&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1757-6334&client=summon