Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data.

Gespeichert in:
Bibliographische Detailangaben
Titel: Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data.
Autoren: Quelin A; UMR 7206 Eco-Anthropologie (EA), CNRS, Muséum National d'Histoire Naturelle, Université Paris Cité, Paris, France. arnaud.quelin@mnhn.fr.; UMR 9015 - Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CNRS, INRIA, Université Paris-Saclay, Orsay, France. arnaud.quelin@mnhn.fr., Austerlitz F; UMR 7206 Eco-Anthropologie (EA), CNRS, Muséum National d'Histoire Naturelle, Université Paris Cité, Paris, France., Jay F; UMR 9015 - Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CNRS, INRIA, Université Paris-Saclay, Orsay, France.
Quelle: Heredity [Heredity (Edinb)] 2025 Jul; Vol. 134 (7), pp. 417-426. Date of Electronic Publication: 2025 Jun 06.
Publikationsart: Journal Article
Sprache: English
Info zur Zeitschrift: Publisher: Nature Publishing Group Country of Publication: England NLM ID: 0373007 Publication Model: Print-Electronic Cited Medium: Internet ISSN: 1365-2540 (Electronic) Linking ISSN: 0018067X NLM ISO Abbreviation: Heredity (Edinb) Subsets: MEDLINE
Imprint Name(s): Publication: <2003->: London : Nature Publishing Group
Original Publication: London, Oliver and Boyd.
MeSH-Schlagworte: Supervised Machine Learning* , Computer Simulation* , Genetics, Population*/methods , Genomics*/methods , Machine Learning*, Models, Genetic ; Bayes Theorem ; Humans ; Algorithms ; Neural Networks, Computer
Abstract: The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of populations. Several demographic inference methods have been developed to take advantage of these massive genomic data. Simulation-based approaches, such as approximate Bayesian computation (ABC), have proved particularly efficient for complex demographic models. However, taking full advantage of the comprehensive information contained in massive genomic data remains a challenge for demographic inference methods, which generally rely on partial information from these data. Using advanced computational methods, such as machine learning, is valuable for efficiently integrating more comprehensive information. Here, we showed how simulation-based supervised machine learning methods applied to an extensive range of summary statistics are effective in inferring demographic parameters for connected populations. We compared three machine learning (ML) methods: a neural network, the multilayer perceptron (MLP), and two ensemble methods, random forest (RF) and the gradient boosting system XGBoost (XGB), to infer demographic parameters from genomic data under a standard isolation with migration model and a secondary contact model with varying population sizes. We showed that MLP outperformed the other two methods and that, on the basis of permutation feature importance, its predictions involved a larger combination of summary statistics. Moreover, they outperformed all three tested ABC algorithms. Finally, we demonstrated how a method called SHAP, from the field of explainable artificial intelligence, can be used to shed light on the contribution of summary statistics within the ML models.
(© 2025. The Author(s).)
Competing Interests: Competing interests: The authors declare no competing interests.
References: Nat Genet. 2015 May;47(5):555-9. (PMID: 25848749)
Mol Ecol Resour. 2021 Nov;21(8):2645-2660. (PMID: 32644216)
Genetics. 2006 Jul;173(3):1511-20. (PMID: 16624908)
Nat Commun. 2019 Jan 16;10(1):246. (PMID: 30651539)
Nat Genet. 2012 Nov;44(11):1277-81. (PMID: 23001126)
Proc Biol Sci. 2017 Aug 30;284(1861):. (PMID: 28835553)
PLoS Genet. 2013 Jun;9(6):e1003521. (PMID: 23754952)
Nat Rev Genet. 2012 Oct;13(10):745-53. (PMID: 22965354)
Proc Biol Sci. 2021 Apr 14;288(1948):20210073. (PMID: 33823666)
Proc Natl Acad Sci U S A. 2012 Oct 30;109(44):17758-64. (PMID: 23077256)
Proc Natl Acad Sci U S A. 1979 Oct;76(10):5269-73. (PMID: 291943)
Proc Natl Acad Sci U S A. 2020 Feb 11;117(6):3026-3033. (PMID: 31988125)
J R Soc Interface. 2009 Feb 6;6(31):187-202. (PMID: 19205079)
Am J Hum Genet. 2015 Sep 3;97(3):404-18. (PMID: 26299365)
PLoS Biol. 2004 Oct;2(10):e286. (PMID: 15361935)
Genetics. 2016 Nov;204(3):1207-1223. (PMID: 27605051)
Elife. 2018 Aug 23;7:. (PMID: 30125248)
Mol Biol Evol. 1999 Dec;16(12):1791-8. (PMID: 10605120)
Nat Protoc. 2014 Feb;9(2):439-56. (PMID: 24457334)
PLoS Comput Biol. 2023 Nov 27;19(11):e1010979. (PMID: 38011281)
PLoS Comput Biol. 2016 Mar 28;12(3):e1004845. (PMID: 27018908)
PLoS Comput Biol. 2022 Aug 3;18(8):e1010407. (PMID: 35921376)
Mol Biol Evol. 2021 Jun 25;38(7):2986-3003. (PMID: 33591322)
Genetics. 2022 Mar 3;220(3):. (PMID: 34897427)
Mol Ecol Resour. 2012 Sep;12(5):846-55. (PMID: 22571382)
PLoS Genet. 2009 Oct;5(10):e1000695. (PMID: 19851460)
Mol Ecol. 2017 Nov;26(22):6270-6283. (PMID: 28980346)
Mol Ecol Resour. 2021 Nov;21(8):2598-2613. (PMID: 33950563)
Mol Biol Evol. 2012 Dec;29(12):3653-67. (PMID: 22787284)
Mol Ecol. 2016 Jan;25(1):135-41. (PMID: 26394805)
BMC Bioinformatics. 2019 Nov 22;20(Suppl 9):337. (PMID: 31757205)
PLoS Comput Biol. 2016 May 04;12(5):e1004842. (PMID: 27145223)
Nature. 1950 Aug 12;166(4215):247-9. (PMID: 15439261)
BMC Genet. 2006 Mar 15;7:16. (PMID: 16539698)
Genetics. 1997 Feb;145(2):505-18. (PMID: 9071603)
Trends Genet. 2018 Apr;34(4):301-312. (PMID: 29331490)
Genetics. 2002 Dec;162(4):2025-35. (PMID: 12524368)
Mol Ecol Resour. 2021 Nov;21(8):2614-2628. (PMID: 33000507)
Am J Hum Genet. 2012 Nov 2;91(5):809-22. (PMID: 23103233)
PLoS Genet. 2014 May 29;10(5):e1004379. (PMID: 24875776)
Mol Ecol. 2010 Jul;19(13):2609-25. (PMID: 20561199)
Nat Genet. 2014 Aug;46(8):919-25. (PMID: 24952747)
Nature. 2012 Aug 23;488(7412):471-5. (PMID: 22914163)
Curr Biol. 2015 Oct 5;25(19):2577-83. (PMID: 26412128)
Nat Genet. 2017 Feb;49(2):303-309. (PMID: 28024154)
Genome Biol Evol. 2023 Feb 3;15(2):. (PMID: 36683406)
New Phytol. 2018 Mar;217(4):1726-1736. (PMID: 29178135)
Nat Rev Genet. 2015 Dec;16(12):727-40. (PMID: 26553329)
Genetics. 2013 Jul;194(3):647-62. (PMID: 23608192)
PLoS Genet. 2016 Mar 04;12(3):e1005877. (PMID: 26943927)
Mol Biol Evol. 2019 Jul 1;36(7):1565-1579. (PMID: 30785202)
Bioinformatics. 2019 May 15;35(10):1720-1728. (PMID: 30321307)
Heredity (Edinb). 2018 Jan;120(1):13-24. (PMID: 29234166)
Genome Biol. 2016 Dec 14;17(1):251. (PMID: 27964752)
Mol Biol Evol. 2019 Feb 1;36(2):220-238. (PMID: 30517664)
Mol Biol Evol. 2023 May 2;40(5):. (PMID: 37128989)
Philos Trans R Soc Lond B Biol Sci. 2005 Jul 29;360(1459):1387-93. (PMID: 16048782)
Genetics. 1989 Nov;123(3):585-95. (PMID: 2513255)
Nature. 2011 Jul 13;475(7357):493-6. (PMID: 21753753)
Bioinformatics. 2016 Mar 15;32(6):859-66. (PMID: 26589278)
Curr Biol. 2021 Mar 22;31(6):R276-R279. (PMID: 33756135)
Genetics. 1997 Mar;145(3):847-55. (PMID: 9055093)
Grant Information: ANR-20-CE45-0010-01 RoDAPoG Agence Nationale de la Recherche (French National Research Agency)
Entry Date(s): Date Created: 20250605 Date Completed: 20250702 Latest Revision: 20250704
Update Code: 20250704
PubMed Central ID: PMC12216105
DOI: 10.1038/s41437-025-00773-x
PMID: 40473775
Datenbank: MEDLINE
Beschreibung
Abstract:The ever-increasing availability of high-throughput DNA sequences and the development of numerous computational methods have led to considerable advances in our understanding of the evolutionary and demographic history of populations. Several demographic inference methods have been developed to take advantage of these massive genomic data. Simulation-based approaches, such as approximate Bayesian computation (ABC), have proved particularly efficient for complex demographic models. However, taking full advantage of the comprehensive information contained in massive genomic data remains a challenge for demographic inference methods, which generally rely on partial information from these data. Using advanced computational methods, such as machine learning, is valuable for efficiently integrating more comprehensive information. Here, we showed how simulation-based supervised machine learning methods applied to an extensive range of summary statistics are effective in inferring demographic parameters for connected populations. We compared three machine learning (ML) methods: a neural network, the multilayer perceptron (MLP), and two ensemble methods, random forest (RF) and the gradient boosting system XGBoost (XGB), to infer demographic parameters from genomic data under a standard isolation with migration model and a secondary contact model with varying population sizes. We showed that MLP outperformed the other two methods and that, on the basis of permutation feature importance, its predictions involved a larger combination of summary statistics. Moreover, they outperformed all three tested ABC algorithms. Finally, we demonstrated how a method called SHAP, from the field of explainable artificial intelligence, can be used to shed light on the contribution of summary statistics within the ML models.<br /> (© 2025. The Author(s).)
ISSN:1365-2540
DOI:10.1038/s41437-025-00773-x