Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning

Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Frontiers in genetics Ročník 11; s. 820
Hlavní autori: Gao, Qijuan, Jin, Xiu, Xia, Enhua, Wu, Xiangwei, Gu, Lichuan, Yan, Hanwei, Xia, Yingchun, Li, Shaowen
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Frontiers Media S.A 02.10.2020
Predmet:
ISSN:1664-8021, 1664-8021
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
AbstractList Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Author Yan, Hanwei
Gao, Qijuan
Jin, Xiu
Xia, Enhua
Wu, Xiangwei
Xia, Yingchun
Li, Shaowen
Gu, Lichuan
AuthorAffiliation 2 State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University , Hefei , China
1 Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University , Hefei , China
4 School of Information and Computer Science, Anhui Agricultural University , Hefei , China
5 Key Laboratory of Crop Biology of Anhui Province, Anhui Agricultural University , Hefei , China
3 School of Resources and Environment, Anhui Agricultural University , Hefei , China
AuthorAffiliation_xml – name: 4 School of Information and Computer Science, Anhui Agricultural University , Hefei , China
– name: 1 Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University , Hefei , China
– name: 2 State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University , Hefei , China
– name: 3 School of Resources and Environment, Anhui Agricultural University , Hefei , China
– name: 5 Key Laboratory of Crop Biology of Anhui Province, Anhui Agricultural University , Hefei , China
Author_xml – sequence: 1
  givenname: Qijuan
  surname: Gao
  fullname: Gao, Qijuan
– sequence: 2
  givenname: Xiu
  surname: Jin
  fullname: Jin, Xiu
– sequence: 3
  givenname: Enhua
  surname: Xia
  fullname: Xia, Enhua
– sequence: 4
  givenname: Xiangwei
  surname: Wu
  fullname: Wu, Xiangwei
– sequence: 5
  givenname: Lichuan
  surname: Gu
  fullname: Gu, Lichuan
– sequence: 6
  givenname: Hanwei
  surname: Yan
  fullname: Yan, Hanwei
– sequence: 7
  givenname: Yingchun
  surname: Xia
  fullname: Xia, Yingchun
– sequence: 8
  givenname: Shaowen
  surname: Li
  fullname: Li, Shaowen
BookMark eNp1kc1rVDEUxYNU7Ifdu8zSzYz5fi8bQWttBwbqwq7DfXk305Q3yZi8EfzvTWcqWMEQuCE550c455ycpJyQkHecLaXs7YewwYRLwQRbMtYL9oqccWPUomeCn_x1PiWXtT6ytpSVUqo35FRK3rYQZ-TbasQ0xxA9zDEnmgO9K7sHSPSm0SuNid6nASZIHkf6BWaoOFf6uY2RNv11qrgdJqRrhJJi2rwlrwNMFS-f5wW5_3r9_ep2sb67WV19Wi-8knZeGM3tMIbOi0HLUfvOciUGprRA6DkLnUEJPYIXoRODZQCKW-xRBq3HEKy8IKsjd8zw6HYlbqH8chmiO1zksnFQ5ugndFZab3yvtQKlBO_AGgZSBm-59gZUY308snb7YYujb4kUmF5AX76k-OA2-afrtOkYFw3w_hlQ8o891tltY_U4tdgw76sTSpteC6Nkk5qj1Jdca8HgfJwP2TdynBxn7qldd2jXPbXrDu02I_vH-Od__7X8BpWWqNo
CitedBy_id crossref_primary_10_1039_D5SC00270B
crossref_primary_10_1080_19475705_2024_2314565
crossref_primary_10_3390_s23041811
crossref_primary_10_3390_su132212613
crossref_primary_10_3389_fpsyt_2021_793505
crossref_primary_10_3390_jdb11020027
crossref_primary_10_1177_09670335241258667
crossref_primary_10_3389_fneur_2023_1325941
crossref_primary_10_3390_plants14131947
crossref_primary_10_1016_j_algal_2024_103603
crossref_primary_10_1016_j_dental_2023_11_013
crossref_primary_10_3389_fpls_2022_947129
crossref_primary_10_1109_ACCESS_2024_3446992
crossref_primary_10_1016_j_eswa_2023_122778
crossref_primary_10_3390_electronics12061433
crossref_primary_10_3390_plants12162893
crossref_primary_10_3390_fi17090427
crossref_primary_10_1371_journal_pone_0291260
crossref_primary_10_3389_fphar_2024_1334929
Cites_doi 10.1111/j.1365-313X.2009.03793.x
10.1186/1471-2164-14-117
10.1038/35048692
10.1042/bst0370778
10.17933/jppi.2019.090103
10.1186/s12864-015-2211-z
10.1016/j.tplants.2014.07.003
10.1145/1007730.1007735
10.1016/j.cub.2014.04.042
10.1016/j.tig.2009.07.006
10.1016/j.ygeno.2019.08.003
10.1104/pp.15.01056
10.1128/MMBR.0001610
10.1145/1007730.1007734
10.1007/s10142-013-0345340
10.1186/1471-2105-13134
10.1007/bf00058655
10.1038/nrg3053
10.1126/science.1068037
10.1186/1471-2164-14-65
10.1145/2939672.2939785
10.1002/bies.201300007
10.1186/1471-2105-10-S1-S21
10.1093/bioinformatics/btl344
10.1016/j.cmpb.2015.03.003
10.1534/genetics.116.188201
10.1613/jair.953
10.3389/fgene.2019.01077
10.1016/S0022-2836(05)80360-2
10.1038/nrg3920
10.1126/science.1128691
10.1109/Tkde.2006.17
10.1186/1471-2148-11-47
10.3389/fgene.2019.00600
10.1186/1471-2148-10-41
ContentType Journal Article
Copyright Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li.
Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li
Copyright_xml – notice: Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li.
– notice: Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li
DBID AAYXX
CITATION
7X8
5PM
DOA
DOI 10.3389/fgene.2020.00820
DatabaseName CrossRef
MEDLINE - Academic
PubMed Central (Full Participant titles)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic


Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 1664-8021
ExternalDocumentID oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4
PMC7567012
10_3389_fgene_2020_00820
GrantInformation_xml – fundername: State Key Laboratory of Tea Plant Biology and Utilization
  grantid: SKLTOF20190101
GroupedDBID 53G
5VS
9T4
AAFWJ
AAKDD
AAYXX
ACGFS
ADBBV
ADRAZ
AFPKN
ALMA_UNASSIGNED_HOLDINGS
AOIJS
BAWUL
BCNDV
CITATION
DIK
EMOBN
GROUPED_DOAJ
GX1
HYE
KQ8
M48
M~E
OK1
PGMZT
RNS
RPM
7X8
5PM
ID FETCH-LOGICAL-c439t-6519bdf7c2b53d5c79142b0452ea810f76e3a8eac2f72b90aa419e8e3f55dff93
IEDL.DBID DOA
ISICitedReferencesCount 22
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000578264100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1664-8021
IngestDate Fri Oct 03 12:53:38 EDT 2025
Tue Sep 30 15:54:04 EDT 2025
Fri Sep 05 07:30:52 EDT 2025
Tue Nov 18 21:44:30 EST 2025
Sat Nov 29 03:49:33 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
License This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c439t-6519bdf7c2b53d5c79142b0452ea810f76e3a8eac2f72b90aa419e8e3f55dff93
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Edited by: Tao Huang, Shanghai Institute for Biological Sciences (CAS), China
Reviewed by: Jun Jiang, Fudan University, China; Jing Ding, Nanjing Agricultural University, China; Xiaohui Zhang, Nanjing University, China
These authors have contributed equally to this work
This article was submitted to Systems Biology, a section of the journal Frontiers in Genetics
OpenAccessLink https://doaj.org/article/939c6c8554a44217a960a33fc915c6a4
PMID 33133122
PQID 2456852643
PQPubID 23479
ParticipantIDs doaj_primary_oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4
pubmedcentral_primary_oai_pubmedcentral_nih_gov_7567012
proquest_miscellaneous_2456852643
crossref_citationtrail_10_3389_fgene_2020_00820
crossref_primary_10_3389_fgene_2020_00820
PublicationCentury 2000
PublicationDate 2020-10-02
PublicationDateYYYYMMDD 2020-10-02
PublicationDate_xml – month: 10
  year: 2020
  text: 2020-10-02
  day: 02
PublicationDecade 2020
PublicationTitle Frontiers in genetics
PublicationYear 2020
Publisher Frontiers Media S.A
Publisher_xml – name: Frontiers Media S.A
References Davies (B10) 2010; 74
Altschul (B1) 1990; 215
He (B17) 2008
Zhou (B42) 2006; 18
Neme (B27) 2013; 14
Weiss (B36) 2004; 6
Wu (B37) 2018
Ji (B19) 2019; 10
Tollriera (B33) 2009; 37
Batista (B4) 2004; 6
Chawla (B6) 2002; 16
(B2) 2002; 408
Chen (B8) 2016
Li (B22) 2009; 58
Libbrecht (B24) 2015; 16
Lemaitre (B21) 2017; 18
Cooper (B9) 2014; 24
Chen (B7) 2017; 205
Lin (B25) 2010; 10
Syahrani (B31) 2019; 9
Zhu (B43) 2009; 10
Arendsee (B3) 2014; 19
Donoghue (B13) 2011; 11
Ma (B26) 2020; 112
Yang (B39) 2013; 14
Zhang (B41) 2019
Pang (B28) 2006; 22
Breiman (B5) 1996; 26
Huang (B18) 2013; 35
Li (B23) 2019; 10
Perochon (B29) 2015; 169
Shah (B30) 2018
Dimitrakopoulos (B12) 2016; 2016
Khalturin (B20) 2009; 25
Xu (B38) 2015; 16
Drummond (B14) 2003
Tautz (B32) 2011; 12
Ye (B40) 2012; 13
Gao (B15) 2014; 14
Goff (B16) 2002; 296
Wang (B35) 2015; 119
Demidova (B11) 2017
Tuskan (B34) 2006; 313
References_xml – volume: 58
  start-page: 485
  year: 2009
  ident: B22
  article-title: Identification of the novel protein QQS as a component of the starch metabolic network in Arabidopsis leaves.
  publication-title: Plant J.
  doi: 10.1111/j.1365-313X.2009.03793.x
– volume: 14
  year: 2013
  ident: B27
  article-title: Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution.
  publication-title: BMC Genomics
  doi: 10.1186/1471-2164-14-117
– volume: 408
  start-page: 796
  year: 2002
  ident: B2
  article-title: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.
  publication-title: Nature
  doi: 10.1038/35048692
– volume: 2016
  start-page: 5969
  year: 2016
  ident: B12
  article-title: Identifying disease network perturbations through regression on gene expression and pathway topology analysis.
  publication-title: Int. Conferen. IEEE Engin. Med. Biol. Soc.
– volume: 37
  start-page: 778
  year: 2009
  ident: B33
  article-title: Evolution of primate orphan proteins.
  publication-title: Biochem. Syst. Ecol.
  doi: 10.1042/bst0370778
– year: 2003
  ident: B14
  article-title: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling
  publication-title: Workshop Notes ICML Workshop Learn.
– volume: 9
  start-page: 27
  year: 2019
  ident: B31
  article-title: Comparation Analysis of Ensemble Technique With Boosting(Xgboost) and Bagging (Randomforest) For Classify Splice Junction DNA Sequence Category.
  publication-title: J. Penel. Pos dan Inform.
  doi: 10.17933/jppi.2019.090103
– volume: 16
  year: 2015
  ident: B38
  article-title: Identification, characterization and expression analysis of lineage-specific genes within sweet orange (Citrus sinensis).
  publication-title: BMC Genomics
  doi: 10.1186/s12864-015-2211-z
– volume: 19
  start-page: 698
  year: 2014
  ident: B3
  article-title: Coming of age: orphan genes in plants.
  publication-title: Trends Plant Sci.
  doi: 10.1016/j.tplants.2014.07.003
– volume: 6
  start-page: 20
  year: 2004
  ident: B4
  article-title: A study of the behavior of several methods for balancing machine learning training data.
  publication-title: Sigkdd Expl.
  doi: 10.1145/1007730.1007735
– volume: 24
  start-page: R562
  year: 2014
  ident: B9
  article-title: Horizontal gene transfer: accidental inheritance drives adaptation.
  publication-title: Curr. Biol.
  doi: 10.1016/j.cub.2014.04.042
– volume: 25
  start-page: 404
  year: 2009
  ident: B20
  article-title: More than just orphans: are taxonomically-restricted genes important in evolution?
  publication-title: Trends Gen.
  doi: 10.1016/j.tig.2009.07.006
– volume: 112
  start-page: 1343
  year: 2020
  ident: B26
  article-title: Identification, characterization and expression analysis of lineage-specific genes within Triticeae.
  publication-title: Genomics
  doi: 10.1016/j.ygeno.2019.08.003
– volume: 169
  start-page: 2895
  year: 2015
  ident: B29
  article-title: TaFROG Encodes a Pooideae Orphan Protein That Interacts with SnRK1 and Enhances Resistance to the Mycotoxigenic Fungus Fusarium graminearum.
  publication-title: Plant Physiol.
  doi: 10.1104/pp.15.01056
– volume: 74
  start-page: 417
  year: 2010
  ident: B10
  article-title: Origins and evolution of antibiotic resistance.
  publication-title: Microbiol. Mol. Biol. Rev.
  doi: 10.1128/MMBR.0001610
– volume: 6
  start-page: 7
  year: 2004
  ident: B36
  article-title: Mining with rarity: a unifying framework.
  publication-title: Sigkdd Explor.
  doi: 10.1145/1007730.1007734
– year: 2017
  ident: B11
  article-title: SVM classification: Optimization with the SMOTE algorithm for the class imbalance problem
  publication-title: Paper presented at the mediterranean conference on embedded computing
– volume: 14
  start-page: 23
  year: 2014
  ident: B15
  article-title: Horizontal gene transfer in plants.
  publication-title: Funct. Integr. Genom.
  doi: 10.1007/s10142-013-0345340
– volume: 13
  year: 2012
  ident: B40
  article-title: Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction.
  publication-title: BMC Bioinformatics
  doi: 10.1186/1471-2105-13134
– volume: 26
  start-page: 123
  year: 1996
  ident: B5
  article-title: Bagging predictors.
  publication-title: Mach. Learn.
  doi: 10.1007/bf00058655
– volume: 12
  start-page: 692
  year: 2011
  ident: B32
  article-title: The evolutionary origin of orphan genes.
  publication-title: Nat. Rev. Genet.
  doi: 10.1038/nrg3053
– volume: 296
  start-page: 79
  year: 2002
  ident: B16
  article-title: A draft séquence of the rice genome (Oryza sativa L. ssp. japonica) : The rice genome.
  publication-title: Science
  doi: 10.1126/science.1068037
– volume: 14
  year: 2013
  ident: B39
  article-title: Genome-wide identification, characterization, and expression analysis of lineage-specific genes within zebrafish.
  publication-title: BMC Genomics
  doi: 10.1186/1471-2164-14-65
– start-page: 785
  year: 2016
  ident: B8
  article-title: XGBoost: A Scalable Tree Boosting System,
  publication-title: knowledge discovery and data mining ACM SIGKDD International Conference on knowledge discovery and data mining
  doi: 10.1145/2939672.2939785
– start-page: 1322
  year: 2008
  ident: B17
  publication-title: ADASYN: Adaptive synthetic sampling approach for imbalanced learning.
– volume: 35
  start-page: 868
  year: 2013
  ident: B18
  article-title: Horizontal gene transfer in eukaryotes: the weak-link model.
  publication-title: Bioessays
  doi: 10.1002/bies.201300007
– volume: 10
  year: 2009
  ident: B43
  article-title: Network-based support vector machine for classification of microarray samples.
  publication-title: BMC Bioinformatics.
  doi: 10.1186/1471-2105-10-S1-S21
– volume: 22
  start-page: 2028
  year: 2006
  ident: B28
  article-title: Pathway analysis using random forests classification and regression.
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btl344
– start-page: 8394
  year: 2018
  ident: B37
  publication-title: An Integrated Ensemble Learning Model for Imbalanced Fault Diagnostics and Prognostics.
– volume: 119
  start-page: 63
  year: 2015
  ident: B35
  article-title: A hybrid classifier combining Borderline-SMOTE with AIRS algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan.
  publication-title: Comput. Meth. Progr. Biomed.
  doi: 10.1016/j.cmpb.2015.03.003
– volume: 205
  start-page: 993
  year: 2017
  ident: B7
  article-title: Emergence of a Novel Chimeric Gene Underlying Grain Number in Rice.
  publication-title: Genetics
  doi: 10.1534/genetics.116.188201
– volume: 16
  start-page: 321
  year: 2002
  ident: B6
  article-title: SMOTE: Synthetic minority over-sampling technique.
  publication-title: J. Artif. Intell. Res.
  doi: 10.1613/jair.953
– volume: 10
  year: 2019
  ident: B23
  article-title: Gene expression value prediction based on XGBoost algorithm.
  publication-title: Front. Genet.
  doi: 10.3389/fgene.2019.01077
– volume: 215
  start-page: 403
  year: 1990
  ident: B1
  article-title: Basic local alignment search tool.
  publication-title: J. Mol. Biol.
  doi: 10.1016/S0022-2836(05)80360-2
– volume: 16
  start-page: 321
  year: 2015
  ident: B24
  article-title: Machine learning applications in genetics and genomics.
  publication-title: Nat. Rev. Genet.
  doi: 10.1038/nrg3920
– volume: 18
  start-page: 559
  year: 2017
  ident: B21
  article-title: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning.
  publication-title: J. Mach. Learn. Res.
– volume: 313
  start-page: 1596
  year: 2006
  ident: B34
  article-title: The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray).
  publication-title: Science
  doi: 10.1126/science.1128691
– volume: 18
  start-page: 63
  year: 2006
  ident: B42
  article-title: Training cost-sensitive neural networks with methods addressing the class imbalance problem.
  publication-title: IEEE Trans. Know. Data Engin.
  doi: 10.1109/Tkde.2006.17
– volume: 11
  year: 2011
  ident: B13
  article-title: Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana.
  publication-title: BMC Evol. Biol.
  doi: 10.1186/1471-2148-11-47
– year: 2018
  ident: B30
  publication-title: Identification and characterization of orphan genes in rice (Oryza sativa japonica) to understand novel traits driving evolutionary adaptation and crop improvement. Creative Components.
– volume: 10
  year: 2019
  ident: B19
  article-title: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling.
  publication-title: Front. Genet.
  doi: 10.3389/fgene.2019.00600
– volume: 10
  year: 2010
  ident: B25
  article-title: Comparative analyses reveal distinct sets of lineage-specific genes within Arabidopsis thaliana.
  publication-title: BMC Evol. Biol.
  doi: 10.1186/1471-2148-10-41
– year: 2019
  ident: B41
  article-title: An Intrusion Detection System Based on Convolutional Neural Network for Imbalanced Network Traffic
  publication-title: Paper presented at the international conference on computer science and network technology
SSID ssj0000493334
Score 2.3840241
Snippet Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an...
SourceID doaj
pubmedcentral
proquest
crossref
SourceType Open Website
Open Access Repository
Aggregation Database
Enrichment Source
Index Database
StartPage 820
SubjectTerms ensemble learning
Genetics
orphan genes
two-class
unbalanced dataset
XGBoost model
Title Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
URI https://www.proquest.com/docview/2456852643
https://pubmed.ncbi.nlm.nih.gov/PMC7567012
https://doaj.org/article/939c6c8554a44217a960a33fc915c6a4
Volume 11
WOSCitedRecordID wos000578264100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 1664-8021
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000493334
  issn: 1664-8021
  databaseCode: DOA
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 1664-8021
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000493334
  issn: 1664-8021
  databaseCode: M~E
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB1BBRIXVL7EQqmMxIVDtNnYie1jW7biAKUHivZm2Y4NWxUvatJKvfDbmXHSanOBC5dEShzFmRnHbzzjNwDv6laL0ARe-Fahg8KlLpxyqijbRlIkJoq8K-3bJ3lyolYrfbpV6otywgZ64EFwc821bzwlU1khED9bhNyW8-j1ovaNzUygpdRbztT5gHs552KIS6IXpucR9UG0mBWlcikq7701D2W6_gnGnGZIbk05x7vweMSK7GDo4xO4F9JTeDhUj7x5BqfDJts4rrqxTWRfUGo2MeKS7tg6sbPkKHXRh5Z9sD1OWH3HDvHUMmy_TF346S4CGzlWvz-Hs-Pl16OPxVggofCII_qiQfjl2ih95Wre1l7qhagckaQHqxZllKgEq_DXWkVZOV1aKxY6qMBjXbcxav4CdtImhZfAOCIVi2ivjJUWzgvHbcSxie6E5LwMcQbzW3EZP7KHUxGLC4NeBAnYZAEbErDJAp7B-7snfg3MGX9pe0gauGtHnNf5AlqCGS3B_MsSZvD2Vn8GxwgFPmwKm6vOUHBX1Qj9-AzkRLGTN07vpPWPzLYta7La6tX_6OJreEQfnZMBqz3Y6S-vwht44K_7dXe5D_flSu1nQ8bj59_LP4M091w
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Identification+of+Orphan+Genes+in+Unbalanced+Datasets+Based+on+Ensemble+Learning&rft.jtitle=Frontiers+in+genetics&rft.au=Qijuan+Gao&rft.au=Xiu+Jin&rft.au=Enhua+Xia&rft.au=Xiangwei+Wu&rft.date=2020-10-02&rft.pub=Frontiers+Media+S.A&rft.eissn=1664-8021&rft.volume=11&rft_id=info:doi/10.3389%2Ffgene.2020.00820&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_939c6c8554a44217a960a33fc915c6a4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1664-8021&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1664-8021&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1664-8021&client=summon