Support Vector Machine – Recursive Feature Elimination for Feature Selection on Multi-omics Lung Cancer Data

Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high dimensionality, or more commonly known as the curse of dimensionality. The curse of dimensionality occurs when the dataset contains many features or attri...

Full description

Saved in:
Bibliographic Details
Published in:Progress in Microbes and Molecular Biology Vol. 6; no. 1
Main Authors: Azman, Nuraina Syaza, A Samah, Azurah, Lin, Ji Tong, Abdul Majid, Hairudin, Ali Shah, Zuraini, Wen, Nies Hui, Howe, Chan Weng
Format: Journal Article
Language:English
Published: HH Publisher 04.04.2023
ISSN:2637-1049, 2637-1049
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high dimensionality, or more commonly known as the curse of dimensionality. The curse of dimensionality occurs when the dataset contains many features or attributes but with significantly fewer samples or observations. The study focuses on mitigating the curse of dimensionality by implementing Support Vector Machine – Recursive Feature Elimination (SVM-RFE) as the selected feature selection method in the lung cancer (LUSC) multi-omics dataset integrated from three single omics dataset comprising genomics, transcriptomics and epigenomics, and assess the quality of the selected feature subsets using SDAE and VAE deep learning classifiers. In this study, the LUSC datasets first undergo data pre-processing, including checking for missing values, normalization, and removing zero variance features. The cleaned LUSC datasets are then integrated to form a multi-omics dataset. Feature selection was performed on the LUSC multi-omics data using SVM-RFE to select several optimal feature subsets. The five smallest feature subsets (FS) are used in classification using SDAE and VAE neural networks to assess the quality of the feature subsets. The results show that all 5 VAE models can obtain an accuracy and AUC score of 1.000, while only 2 out of 5 SDAE models (FS 1000 & 4000) can do so. 3 out of 5 SDAE models have an AUC score of 0.500, indicating zero capability in separating the binary class labels. The study concludes that a fine-tuned supervised learning VAE model has better capability in classification tasks compared to SDAE models for this specific study. Additionally, 1000 and 4000 are the two most optimal feature subsets selected by the SVM-RFE algorithm. The SDAE and VAE models built with these feature subsets achieve the best classification results.
AbstractList Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high dimensionality, or more commonly known as the curse of dimensionality. The curse of dimensionality occurs when the dataset contains many features or attributes but with significantly fewer samples or observations. The study focuses on mitigating the curse of dimensionality by implementing Support Vector Machine – Recursive Feature Elimination (SVM-RFE) as the selected feature selection method in the lung cancer (LUSC) multi-omics dataset integrated from three single omics dataset comprising genomics, transcriptomics and epigenomics, and assess the quality of the selected feature subsets using SDAE and VAE deep learning classifiers. In this study, the LUSC datasets first undergo data pre-processing, including checking for missing values, normalization, and removing zero variance features. The cleaned LUSC datasets are then integrated to form a multi-omics dataset. Feature selection was performed on the LUSC multi-omics data using SVM-RFE to select several optimal feature subsets. The five smallest feature subsets (FS) are used in classification using SDAE and VAE neural networks to assess the quality of the feature subsets. The results show that all 5 VAE models can obtain an accuracy and AUC score of 1.000, while only 2 out of 5 SDAE models (FS 1000 & 4000) can do so. 3 out of 5 SDAE models have an AUC score of 0.500, indicating zero capability in separating the binary class labels. The study concludes that a fine-tuned supervised learning VAE model has better capability in classification tasks compared to SDAE models for this specific study. Additionally, 1000 and 4000 are the two most optimal feature subsets selected by the SVM-RFE algorithm. The SDAE and VAE models built with these feature subsets achieve the best classification results.
Author Howe, Chan Weng
Ali Shah, Zuraini
Lin, Ji Tong
A Samah, Azurah
Abdul Majid, Hairudin
Azman, Nuraina Syaza
Wen, Nies Hui
Author_xml – sequence: 1
  givenname: Nuraina Syaza
  surname: Azman
  fullname: Azman, Nuraina Syaza
– sequence: 2
  givenname: Azurah
  surname: A Samah
  fullname: A Samah, Azurah
– sequence: 3
  givenname: Ji Tong
  surname: Lin
  fullname: Lin, Ji Tong
– sequence: 4
  givenname: Hairudin
  surname: Abdul Majid
  fullname: Abdul Majid, Hairudin
– sequence: 5
  givenname: Zuraini
  surname: Ali Shah
  fullname: Ali Shah, Zuraini
– sequence: 6
  givenname: Nies Hui
  surname: Wen
  fullname: Wen, Nies Hui
– sequence: 7
  givenname: Chan Weng
  surname: Howe
  fullname: Howe, Chan Weng
BookMark eNp1UdtqGzEQFcWFOq5f-6wfWEfSaqXdx-DYqcEhkLR9FaNbqrC7MlptIW_5h_5hv6Rrp4YSyDAww-GcwzDnAs362DuEvlCyKkUt5eWh6_QKyFQlkx_QnIlSFpTwZvbf_gkth-Fp4rCGlTXlc9Q_jIdDTBn_cCbHhG_B_Ay9w39efuN7Z8Y0hF8Obx3kMTm8aUMXesgh9thP7DP-4NpJfkSnvh3bHIrYBTPg_dg_4jX0xiV8DRk-o48e2sEt_80F-r7dfFt_LfZ3N7v11b4wTBBZyEbQxnrrqXMaGl9VTGpbEwJUVLy23NQOWG2JsKyiFS-1MMSCAWkYgVqWC7R79bURntQhhQ7Ss4oQ1AmI6VFBysG0TgHoRje8dAIs52A0qWqvmfCSEqv90Yu_epkUhyE5r0zIpx_kBKFVlKhTBOoYgTpHMMlWb2TnM94R_AXvnY3f
CitedBy_id crossref_primary_10_3390_ai6080165
crossref_primary_10_1016_j_mex_2025_103210
crossref_primary_10_48084_etasr_11388
crossref_primary_10_1016_j_canlet_2025_217825
crossref_primary_10_1016_j_mex_2025_103219
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.36877/pmmb.a0000327
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
EISSN 2637-1049
ExternalDocumentID oai_doaj_org_article_aab9b943e6ad44acb058fb26f710dbf7
10_36877_pmmb_a0000327
GroupedDBID AAYXX
ALMA_UNASSIGNED_HOLDINGS
CITATION
GROUPED_DOAJ
ID FETCH-LOGICAL-c2607-79619dfdf1eeba9f5527bd800a16548d4c8ea28d06d251543b6c0daca7c20a873
IEDL.DBID DOA
ISSN 2637-1049
IngestDate Fri Oct 03 12:52:01 EDT 2025
Tue Nov 18 22:30:19 EST 2025
Sat Nov 29 05:20:40 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://creativecommons.org/licenses/by-nc/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2607-79619dfdf1eeba9f5527bd800a16548d4c8ea28d06d251543b6c0daca7c20a873
OpenAccessLink https://doaj.org/article/aab9b943e6ad44acb058fb26f710dbf7
ParticipantIDs doaj_primary_oai_doaj_org_article_aab9b943e6ad44acb058fb26f710dbf7
crossref_citationtrail_10_36877_pmmb_a0000327
crossref_primary_10_36877_pmmb_a0000327
PublicationCentury 2000
PublicationDate 2023-04-04
PublicationDateYYYYMMDD 2023-04-04
PublicationDate_xml – month: 04
  year: 2023
  text: 2023-04-04
  day: 04
PublicationDecade 2020
PublicationTitle Progress in Microbes and Molecular Biology
PublicationYear 2023
Publisher HH Publisher
Publisher_xml – name: HH Publisher
SSID ssj0002923814
Score 2.250253
Snippet Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high...
SourceID doaj
crossref
SourceType Open Website
Enrichment Source
Index Database
Title Support Vector Machine – Recursive Feature Elimination for Feature Selection on Multi-omics Lung Cancer Data
URI https://doaj.org/article/aab9b943e6ad44acb058fb26f710dbf7
Volume 6
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2637-1049
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002923814
  issn: 2637-1049
  databaseCode: DOA
  dateStart: 20180101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LS8NAEF6kePAiior1xR4ET7FpssnuHrW2eKhFfJTewuwLhLaWPjz7H_yH_hJ3NmnJRbwIOS3LEr5JZr5ZZuYj5FJmmscucVGquI1YxuNIpG0kco7JFNomCdWEwz4fDMRoJB9rUl9YE1aOBy6BawEoqSRLbQ6GMdAqzoRTSe58aDTKhT5yz3pqyRT64ERiKGLllMY0F5y3ZpOJugb0zylKyNSiUG1Yf4gqvT2yW9FBelO-xj7ZstMDMkWpTU-L6TBcqdOHUPBo6ffnF33C-3EsOadI3lZzS7vjIMyFAFPPQDfrz0HhBlf9E_psI-xAXtC-_79pB609p3ewhEPy2uu-dO6jShch0h45HnHpsx7jjGtbq0A6HKKmjGd-gK1JwjAtLCTCxLnx7CVjqcp1bEAD10kMgqdHpDF9n9pjQr2b9EclYNs-M-GJBtBCSKFs5omIyuImidY4FboaGo7aFePCJw8B1wJxLda4NsnVZv-sHJfx685bhH2zC8dchwVv_KIyfvGX8U_-45BTsoMa8qEch52RxnK-sudkW38s3xbzi_Bd_QBcBdXa
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Support+Vector+Machine+%E2%80%93+Recursive+Feature+Elimination+for+Feature+Selection+on+Multi-omics+Lung+Cancer+Data&rft.jtitle=Progress+in+Microbes+and+Molecular+Biology&rft.au=Azman%2C+Nuraina+Syaza&rft.au=A+Samah%2C+Azurah&rft.au=Lin%2C+Ji+Tong&rft.au=Abdul+Majid%2C+Hairudin&rft.date=2023-04-04&rft.issn=2637-1049&rft.eissn=2637-1049&rft.volume=6&rft.issue=1&rft_id=info:doi/10.36877%2Fpmmb.a0000327&rft.externalDBID=n%2Fa&rft.externalDocID=10_36877_pmmb_a0000327
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2637-1049&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2637-1049&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2637-1049&client=summon