Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge...
Uložené v:
| Vydané v: | Frontiers in genetics Ročník 11; s. 820 |
|---|---|
| Hlavní autori: | , , , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Frontiers Media S.A
02.10.2020
|
| Predmet: | |
| ISSN: | 1664-8021, 1664-8021 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets. |
|---|---|
| AbstractList | Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets. Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets. |
| Author | Yan, Hanwei Gao, Qijuan Jin, Xiu Xia, Enhua Wu, Xiangwei Xia, Yingchun Li, Shaowen Gu, Lichuan |
| AuthorAffiliation | 2 State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University , Hefei , China 1 Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University , Hefei , China 4 School of Information and Computer Science, Anhui Agricultural University , Hefei , China 5 Key Laboratory of Crop Biology of Anhui Province, Anhui Agricultural University , Hefei , China 3 School of Resources and Environment, Anhui Agricultural University , Hefei , China |
| AuthorAffiliation_xml | – name: 4 School of Information and Computer Science, Anhui Agricultural University , Hefei , China – name: 1 Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University , Hefei , China – name: 2 State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University , Hefei , China – name: 3 School of Resources and Environment, Anhui Agricultural University , Hefei , China – name: 5 Key Laboratory of Crop Biology of Anhui Province, Anhui Agricultural University , Hefei , China |
| Author_xml | – sequence: 1 givenname: Qijuan surname: Gao fullname: Gao, Qijuan – sequence: 2 givenname: Xiu surname: Jin fullname: Jin, Xiu – sequence: 3 givenname: Enhua surname: Xia fullname: Xia, Enhua – sequence: 4 givenname: Xiangwei surname: Wu fullname: Wu, Xiangwei – sequence: 5 givenname: Lichuan surname: Gu fullname: Gu, Lichuan – sequence: 6 givenname: Hanwei surname: Yan fullname: Yan, Hanwei – sequence: 7 givenname: Yingchun surname: Xia fullname: Xia, Yingchun – sequence: 8 givenname: Shaowen surname: Li fullname: Li, Shaowen |
| BookMark | eNp1kc1rVDEUxYNU7Ifdu8zSzYz5fi8bQWttBwbqwq7DfXk305Q3yZi8EfzvTWcqWMEQuCE550c455ycpJyQkHecLaXs7YewwYRLwQRbMtYL9oqccWPUomeCn_x1PiWXtT6ytpSVUqo35FRK3rYQZ-TbasQ0xxA9zDEnmgO9K7sHSPSm0SuNid6nASZIHkf6BWaoOFf6uY2RNv11qrgdJqRrhJJi2rwlrwNMFS-f5wW5_3r9_ep2sb67WV19Wi-8knZeGM3tMIbOi0HLUfvOciUGprRA6DkLnUEJPYIXoRODZQCKW-xRBq3HEKy8IKsjd8zw6HYlbqH8chmiO1zksnFQ5ugndFZab3yvtQKlBO_AGgZSBm-59gZUY308snb7YYujb4kUmF5AX76k-OA2-afrtOkYFw3w_hlQ8o891tltY_U4tdgw76sTSpteC6Nkk5qj1Jdca8HgfJwP2TdynBxn7qldd2jXPbXrDu02I_vH-Od__7X8BpWWqNo |
| CitedBy_id | crossref_primary_10_1039_D5SC00270B crossref_primary_10_1080_19475705_2024_2314565 crossref_primary_10_3390_s23041811 crossref_primary_10_3390_su132212613 crossref_primary_10_3389_fpsyt_2021_793505 crossref_primary_10_3390_jdb11020027 crossref_primary_10_1177_09670335241258667 crossref_primary_10_3389_fneur_2023_1325941 crossref_primary_10_3390_plants14131947 crossref_primary_10_1016_j_algal_2024_103603 crossref_primary_10_1016_j_dental_2023_11_013 crossref_primary_10_3389_fpls_2022_947129 crossref_primary_10_1109_ACCESS_2024_3446992 crossref_primary_10_1016_j_eswa_2023_122778 crossref_primary_10_3390_electronics12061433 crossref_primary_10_3390_plants12162893 crossref_primary_10_3390_fi17090427 crossref_primary_10_1371_journal_pone_0291260 crossref_primary_10_3389_fphar_2024_1334929 |
| Cites_doi | 10.1111/j.1365-313X.2009.03793.x 10.1186/1471-2164-14-117 10.1038/35048692 10.1042/bst0370778 10.17933/jppi.2019.090103 10.1186/s12864-015-2211-z 10.1016/j.tplants.2014.07.003 10.1145/1007730.1007735 10.1016/j.cub.2014.04.042 10.1016/j.tig.2009.07.006 10.1016/j.ygeno.2019.08.003 10.1104/pp.15.01056 10.1128/MMBR.0001610 10.1145/1007730.1007734 10.1007/s10142-013-0345340 10.1186/1471-2105-13134 10.1007/bf00058655 10.1038/nrg3053 10.1126/science.1068037 10.1186/1471-2164-14-65 10.1145/2939672.2939785 10.1002/bies.201300007 10.1186/1471-2105-10-S1-S21 10.1093/bioinformatics/btl344 10.1016/j.cmpb.2015.03.003 10.1534/genetics.116.188201 10.1613/jair.953 10.3389/fgene.2019.01077 10.1016/S0022-2836(05)80360-2 10.1038/nrg3920 10.1126/science.1128691 10.1109/Tkde.2006.17 10.1186/1471-2148-11-47 10.3389/fgene.2019.00600 10.1186/1471-2148-10-41 |
| ContentType | Journal Article |
| Copyright | Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li |
| Copyright_xml | – notice: Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. – notice: Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li |
| DBID | AAYXX CITATION 7X8 5PM DOA |
| DOI | 10.3389/fgene.2020.00820 |
| DatabaseName | CrossRef MEDLINE - Academic PubMed Central (Full Participant titles) DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef MEDLINE - Academic |
| DatabaseTitleList | MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 1664-8021 |
| ExternalDocumentID | oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4 PMC7567012 10_3389_fgene_2020_00820 |
| GrantInformation_xml | – fundername: State Key Laboratory of Tea Plant Biology and Utilization grantid: SKLTOF20190101 |
| GroupedDBID | 53G 5VS 9T4 AAFWJ AAKDD AAYXX ACGFS ADBBV ADRAZ AFPKN ALMA_UNASSIGNED_HOLDINGS AOIJS BAWUL BCNDV CITATION DIK EMOBN GROUPED_DOAJ GX1 HYE KQ8 M48 M~E OK1 PGMZT RNS RPM 7X8 5PM |
| ID | FETCH-LOGICAL-c439t-6519bdf7c2b53d5c79142b0452ea810f76e3a8eac2f72b90aa419e8e3f55dff93 |
| IEDL.DBID | DOA |
| ISICitedReferencesCount | 22 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000578264100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1664-8021 |
| IngestDate | Fri Oct 03 12:53:38 EDT 2025 Tue Sep 30 15:54:04 EDT 2025 Fri Sep 05 07:30:52 EDT 2025 Tue Nov 18 21:44:30 EST 2025 Sat Nov 29 03:49:33 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| License | This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c439t-6519bdf7c2b53d5c79142b0452ea810f76e3a8eac2f72b90aa419e8e3f55dff93 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Edited by: Tao Huang, Shanghai Institute for Biological Sciences (CAS), China Reviewed by: Jun Jiang, Fudan University, China; Jing Ding, Nanjing Agricultural University, China; Xiaohui Zhang, Nanjing University, China These authors have contributed equally to this work This article was submitted to Systems Biology, a section of the journal Frontiers in Genetics |
| OpenAccessLink | https://doaj.org/article/939c6c8554a44217a960a33fc915c6a4 |
| PMID | 33133122 |
| PQID | 2456852643 |
| PQPubID | 23479 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4 pubmedcentral_primary_oai_pubmedcentral_nih_gov_7567012 proquest_miscellaneous_2456852643 crossref_citationtrail_10_3389_fgene_2020_00820 crossref_primary_10_3389_fgene_2020_00820 |
| PublicationCentury | 2000 |
| PublicationDate | 2020-10-02 |
| PublicationDateYYYYMMDD | 2020-10-02 |
| PublicationDate_xml | – month: 10 year: 2020 text: 2020-10-02 day: 02 |
| PublicationDecade | 2020 |
| PublicationTitle | Frontiers in genetics |
| PublicationYear | 2020 |
| Publisher | Frontiers Media S.A |
| Publisher_xml | – name: Frontiers Media S.A |
| References | Davies (B10) 2010; 74 Altschul (B1) 1990; 215 He (B17) 2008 Zhou (B42) 2006; 18 Neme (B27) 2013; 14 Weiss (B36) 2004; 6 Wu (B37) 2018 Ji (B19) 2019; 10 Tollriera (B33) 2009; 37 Batista (B4) 2004; 6 Chawla (B6) 2002; 16 (B2) 2002; 408 Chen (B8) 2016 Li (B22) 2009; 58 Libbrecht (B24) 2015; 16 Lemaitre (B21) 2017; 18 Cooper (B9) 2014; 24 Chen (B7) 2017; 205 Lin (B25) 2010; 10 Syahrani (B31) 2019; 9 Zhu (B43) 2009; 10 Arendsee (B3) 2014; 19 Donoghue (B13) 2011; 11 Ma (B26) 2020; 112 Yang (B39) 2013; 14 Zhang (B41) 2019 Pang (B28) 2006; 22 Breiman (B5) 1996; 26 Huang (B18) 2013; 35 Li (B23) 2019; 10 Perochon (B29) 2015; 169 Shah (B30) 2018 Dimitrakopoulos (B12) 2016; 2016 Khalturin (B20) 2009; 25 Xu (B38) 2015; 16 Drummond (B14) 2003 Tautz (B32) 2011; 12 Ye (B40) 2012; 13 Gao (B15) 2014; 14 Goff (B16) 2002; 296 Wang (B35) 2015; 119 Demidova (B11) 2017 Tuskan (B34) 2006; 313 |
| References_xml | – volume: 58 start-page: 485 year: 2009 ident: B22 article-title: Identification of the novel protein QQS as a component of the starch metabolic network in Arabidopsis leaves. publication-title: Plant J. doi: 10.1111/j.1365-313X.2009.03793.x – volume: 14 year: 2013 ident: B27 article-title: Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. publication-title: BMC Genomics doi: 10.1186/1471-2164-14-117 – volume: 408 start-page: 796 year: 2002 ident: B2 article-title: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. publication-title: Nature doi: 10.1038/35048692 – volume: 2016 start-page: 5969 year: 2016 ident: B12 article-title: Identifying disease network perturbations through regression on gene expression and pathway topology analysis. publication-title: Int. Conferen. IEEE Engin. Med. Biol. Soc. – volume: 37 start-page: 778 year: 2009 ident: B33 article-title: Evolution of primate orphan proteins. publication-title: Biochem. Syst. Ecol. doi: 10.1042/bst0370778 – year: 2003 ident: B14 article-title: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling publication-title: Workshop Notes ICML Workshop Learn. – volume: 9 start-page: 27 year: 2019 ident: B31 article-title: Comparation Analysis of Ensemble Technique With Boosting(Xgboost) and Bagging (Randomforest) For Classify Splice Junction DNA Sequence Category. publication-title: J. Penel. Pos dan Inform. doi: 10.17933/jppi.2019.090103 – volume: 16 year: 2015 ident: B38 article-title: Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis). publication-title: BMC Genomics doi: 10.1186/s12864-015-2211-z – volume: 19 start-page: 698 year: 2014 ident: B3 article-title: Coming of age: orphan genes in plants. publication-title: Trends Plant Sci. doi: 10.1016/j.tplants.2014.07.003 – volume: 6 start-page: 20 year: 2004 ident: B4 article-title: A study of the behavior of several methods for balancing machine learning training data. publication-title: Sigkdd Expl. doi: 10.1145/1007730.1007735 – volume: 24 start-page: R562 year: 2014 ident: B9 article-title: Horizontal gene transfer: accidental inheritance drives adaptation. publication-title: Curr. Biol. doi: 10.1016/j.cub.2014.04.042 – volume: 25 start-page: 404 year: 2009 ident: B20 article-title: More than just orphans: are taxonomically-restricted genes important in evolution? publication-title: Trends Gen. doi: 10.1016/j.tig.2009.07.006 – volume: 112 start-page: 1343 year: 2020 ident: B26 article-title: Identification, characterization and expression analysis of lineage-specific genes within Triticeae. publication-title: Genomics doi: 10.1016/j.ygeno.2019.08.003 – volume: 169 start-page: 2895 year: 2015 ident: B29 article-title: TaFROG Encodes a Pooideae Orphan Protein That Interacts with SnRK1 and Enhances Resistance to the Mycotoxigenic Fungus Fusarium graminearum. publication-title: Plant Physiol. doi: 10.1104/pp.15.01056 – volume: 74 start-page: 417 year: 2010 ident: B10 article-title: Origins and evolution of antibiotic resistance. publication-title: Microbiol. Mol. Biol. Rev. doi: 10.1128/MMBR.0001610 – volume: 6 start-page: 7 year: 2004 ident: B36 article-title: Mining with rarity: a unifying framework. publication-title: Sigkdd Explor. doi: 10.1145/1007730.1007734 – year: 2017 ident: B11 article-title: SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem publication-title: Paper presented at the mediterranean conference on embedded computing – volume: 14 start-page: 23 year: 2014 ident: B15 article-title: Horizontal gene transfer in plants. publication-title: Funct. Integr. Genom. doi: 10.1007/s10142-013-0345340 – volume: 13 year: 2012 ident: B40 article-title: Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. publication-title: BMC Bioinformatics doi: 10.1186/1471-2105-13134 – volume: 26 start-page: 123 year: 1996 ident: B5 article-title: Bagging predictors. publication-title: Mach. Learn. doi: 10.1007/bf00058655 – volume: 12 start-page: 692 year: 2011 ident: B32 article-title: The evolutionary origin of orphan genes. publication-title: Nat. Rev. Genet. doi: 10.1038/nrg3053 – volume: 296 start-page: 79 year: 2002 ident: B16 article-title: A draft séquence of the rice genome (Oryza sativa L. ssp. japonica) : The rice genome. publication-title: Science doi: 10.1126/science.1068037 – volume: 14 year: 2013 ident: B39 article-title: Genome-wide identification, characterization, and expression analysis of lineage-specific genes within zebrafish. publication-title: BMC Genomics doi: 10.1186/1471-2164-14-65 – start-page: 785 year: 2016 ident: B8 article-title: XGBoost: A Scalable Tree Boosting System, publication-title: knowledge discovery and data mining ACM SIGKDD International Conference on knowledge discovery and data mining doi: 10.1145/2939672.2939785 – start-page: 1322 year: 2008 ident: B17 publication-title: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. – volume: 35 start-page: 868 year: 2013 ident: B18 article-title: Horizontal gene transfer in eukaryotes: the weak-link model. publication-title: Bioessays doi: 10.1002/bies.201300007 – volume: 10 year: 2009 ident: B43 article-title: Network-based support vector machine for classification of microarray samples. publication-title: BMC Bioinformatics. doi: 10.1186/1471-2105-10-S1-S21 – volume: 22 start-page: 2028 year: 2006 ident: B28 article-title: Pathway analysis using random forests classification and regression. publication-title: Bioinformatics doi: 10.1093/bioinformatics/btl344 – start-page: 8394 year: 2018 ident: B37 publication-title: An Integrated Ensemble Learning Model for Imbalanced Fault Diagnostics and Prognostics. – volume: 119 start-page: 63 year: 2015 ident: B35 article-title: A hybrid classifier combining Borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan. publication-title: Comput. Meth. Progr. Biomed. doi: 10.1016/j.cmpb.2015.03.003 – volume: 205 start-page: 993 year: 2017 ident: B7 article-title: Emergence of a Novel Chimeric Gene Underlying Grain Number in Rice. publication-title: Genetics doi: 10.1534/genetics.116.188201 – volume: 16 start-page: 321 year: 2002 ident: B6 article-title: SMOTE: Synthetic minority over-sampling technique. publication-title: J. Artif. Intell. Res. doi: 10.1613/jair.953 – volume: 10 year: 2019 ident: B23 article-title: Gene expression value prediction based on XGBoost algorithm. publication-title: Front. Genet. doi: 10.3389/fgene.2019.01077 – volume: 215 start-page: 403 year: 1990 ident: B1 article-title: Basic local alignment search tool. publication-title: J. Mol. Biol. doi: 10.1016/S0022-2836(05)80360-2 – volume: 16 start-page: 321 year: 2015 ident: B24 article-title: Machine learning applications in genetics and genomics. publication-title: Nat. Rev. Genet. doi: 10.1038/nrg3920 – volume: 18 start-page: 559 year: 2017 ident: B21 article-title: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. publication-title: J. Mach. Learn. Res. – volume: 313 start-page: 1596 year: 2006 ident: B34 article-title: The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray). publication-title: Science doi: 10.1126/science.1128691 – volume: 18 start-page: 63 year: 2006 ident: B42 article-title: Training cost-sensitive neural networks with methods addressing the class imbalance problem. publication-title: IEEE Trans. Know. Data Engin. doi: 10.1109/Tkde.2006.17 – volume: 11 year: 2011 ident: B13 article-title: Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. publication-title: BMC Evol. Biol. doi: 10.1186/1471-2148-11-47 – year: 2018 ident: B30 publication-title: Identification and characterization of orphan genes in rice (Oryza sativa japonica) to understand novel traits driving evolutionary adaptation and crop improvement. Creative Components. – volume: 10 year: 2019 ident: B19 article-title: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. publication-title: Front. Genet. doi: 10.3389/fgene.2019.00600 – volume: 10 year: 2010 ident: B25 article-title: Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana. publication-title: BMC Evol. Biol. doi: 10.1186/1471-2148-10-41 – year: 2019 ident: B41 article-title: An Intrusion Detection System Based on Convolutional Neural Network for Imbalanced Network Traffic publication-title: Paper presented at the international conference on computer science and network technology |
| SSID | ssj0000493334 |
| Score | 2.3840241 |
| Snippet | Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an... |
| SourceID | doaj pubmedcentral proquest crossref |
| SourceType | Open Website Open Access Repository Aggregation Database Enrichment Source Index Database |
| StartPage | 820 |
| SubjectTerms | ensemble learning Genetics orphan genes two-class unbalanced dataset XGBoost model |
| Title | Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning |
| URI | https://www.proquest.com/docview/2456852643 https://pubmed.ncbi.nlm.nih.gov/PMC7567012 https://doaj.org/article/939c6c8554a44217a960a33fc915c6a4 |
| Volume | 11 |
| WOSCitedRecordID | wos000578264100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 1664-8021 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000493334 issn: 1664-8021 databaseCode: DOA dateStart: 20100101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 1664-8021 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000493334 issn: 1664-8021 databaseCode: M~E dateStart: 20100101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB1BBRIXVL7EQqmMxIVDtNnYie1jW7biAKUHivZm2Y4NWxUvatJKvfDbmXHSanOBC5dEShzFmRnHbzzjNwDv6laL0ARe-Fahg8KlLpxyqijbRlIkJoq8K-3bJ3lyolYrfbpV6otywgZ64EFwc821bzwlU1khED9bhNyW8-j1ovaNzUygpdRbztT5gHs552KIS6IXpucR9UG0mBWlcikq7701D2W6_gnGnGZIbk05x7vweMSK7GDo4xO4F9JTeDhUj7x5BqfDJts4rrqxTWRfUGo2MeKS7tg6sbPkKHXRh5Z9sD1OWH3HDvHUMmy_TF346S4CGzlWvz-Hs-Pl16OPxVggofCII_qiQfjl2ih95Wre1l7qhagckaQHqxZllKgEq_DXWkVZOV1aKxY6qMBjXbcxav4CdtImhZfAOCIVi2ivjJUWzgvHbcSxie6E5LwMcQbzW3EZP7KHUxGLC4NeBAnYZAEbErDJAp7B-7snfg3MGX9pe0gauGtHnNf5AlqCGS3B_MsSZvD2Vn8GxwgFPmwKm6vOUHBX1Qj9-AzkRLGTN07vpPWPzLYta7La6tX_6OJreEQfnZMBqz3Y6S-vwht44K_7dXe5D_flSu1nQ8bj59_LP4M091w |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Identification+of+Orphan+Genes+in+Unbalanced+Datasets+Based+on+Ensemble+Learning&rft.jtitle=Frontiers+in+genetics&rft.au=Qijuan+Gao&rft.au=Xiu+Jin&rft.au=Enhua+Xia&rft.au=Xiangwei+Wu&rft.date=2020-10-02&rft.pub=Frontiers+Media+S.A&rft.eissn=1664-8021&rft.volume=11&rft_id=info:doi/10.3389%2Ffgene.2020.00820&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1664-8021&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1664-8021&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1664-8021&client=summon |