On optimal selection of summary statistics for approximate Bayesian computation
How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation...
Uloženo v:
| Vydáno v: | Statistical applications in genetics and molecular biology Ročník 9; s. Article34 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Germany
01.01.2010
|
| Témata: | |
| ISSN: | 1544-6115, 1544-6115 |
| On-line přístup: | Zjistit podrobnosti o přístupu |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged. |
|---|---|
| AbstractList | How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged. How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged. |
| Author | Nunes, Matthew A Balding, David J |
| Author_xml | – sequence: 1 givenname: Matthew A surname: Nunes fullname: Nunes, Matthew A email: m.nunes@lancs.ac.uk organization: Lancaster University. m.nunes@lancs.ac.uk – sequence: 2 givenname: David J surname: Balding fullname: Balding, David J |
| BackLink | https://www.ncbi.nlm.nih.gov/pubmed/20887273$$D View this record in MEDLINE/PubMed |
| BookMark | eNpNkMtLAzEYxINU7EOvHiU3T1vzTveoxRcUetHzktdCZJOsmyzY_94tVvH0DcOP4ZtZgllM0QFwjdGaEETuMGesEhjzNeZSnIHFnzH7p-dgmfMHQgQTii7AnKDNRhJJF2C_jzD1xQfVwew6Z4pPk9PCPIaghgPMRRWfizcZtmmAqu-H9DXhxcEHdXDZqwhNCv145FK8BOet6rK7Ot0VeH96fNu-VLv98-v2flcZJmiptMFCUEUsZ9ZYi6mWurZTBSQMdaKVLdXYaoYNI5bUlCskOK4RYtwyW2uyArc_udM7n6PLpQk-G9d1Kro05kZyIQSjTEzkzYkcdXC26Qd_LNb8bkC-AXaYYCU |
| CitedBy_id | crossref_primary_10_1371_journal_pone_0099581 crossref_primary_10_3390_e23080961 crossref_primary_10_1038_nprot_2014_025 crossref_primary_10_1246_bcsj_20180027 crossref_primary_10_1016_j_gde_2018_06_016 crossref_primary_10_1080_10618600_2021_1981341 crossref_primary_10_1038_s41540_017_0010_7 crossref_primary_10_1109_JBHI_2025_3546844 crossref_primary_10_12688_wellcomeopenres_15048_1 crossref_primary_10_3390_stats4030045 crossref_primary_10_12688_wellcomeopenres_15048_2 crossref_primary_10_3390_e27070683 crossref_primary_10_1073_pnas_1102900108 crossref_primary_10_1093_molbev_mst140 crossref_primary_10_1214_17_STS618 crossref_primary_10_1016_j_jtbi_2023_111467 crossref_primary_10_1002_wics_1486 crossref_primary_10_1002_jcc_23504 crossref_primary_10_1016_j_spl_2015_08_003 crossref_primary_10_1016_j_copbio_2013_03_012 crossref_primary_10_1093_molbev_msu277 crossref_primary_10_1007_s41109_025_00694_y crossref_primary_10_1016_j_csda_2015_05_005 crossref_primary_10_1093_sysbio_syw077 crossref_primary_10_1146_annurev_statistics_030718_105212 crossref_primary_10_1007_s00180_025_01607_4 crossref_primary_10_1016_j_tcs_2021_09_039 crossref_primary_10_1111_j_1365_2966_2012_21371_x crossref_primary_10_1214_12_STS406 crossref_primary_10_1111_j_1365_294X_2011_05322_x crossref_primary_10_1039_c2ib00175f crossref_primary_10_1371_journal_pcbi_1004845 crossref_primary_10_7717_peerj_5198 crossref_primary_10_1073_pnas_1400425111 crossref_primary_10_1371_journal_pcbi_1010683 crossref_primary_10_1146_annurev_ecolsys_110617_062431 crossref_primary_10_1016_j_csda_2016_07_005 crossref_primary_10_1007_s10845_024_02343_0 crossref_primary_10_1155_2013_210646 crossref_primary_10_1371_journal_pgen_1003905 crossref_primary_10_1111_rssc_12042 crossref_primary_10_1534_genetics_112_143164 crossref_primary_10_1093_comnet_cnae017 crossref_primary_10_1093_tse_tdab011 crossref_primary_10_1002_ece3_698 crossref_primary_10_1111_ecog_04824 crossref_primary_10_1515_sagmb_2013_0010 crossref_primary_10_1016_j_epidem_2019_100368 crossref_primary_10_1007_s11222_019_09905_w crossref_primary_10_1111_2041_210X_12489 crossref_primary_10_1016_j_cels_2016_12_002 crossref_primary_10_1016_j_epidem_2023_100665 crossref_primary_10_1371_journal_pcbi_1002803 crossref_primary_10_1128_spectrum_01449_24 crossref_primary_10_1007_s00158_022_03185_1 crossref_primary_10_1007_s42113_023_00180_7 crossref_primary_10_1111_ahg_12606 crossref_primary_10_1111_jeb_12280 crossref_primary_10_1038_hdy_2015_38 crossref_primary_10_1098_rsos_180384 crossref_primary_10_1016_j_ecolmodel_2022_110251 crossref_primary_10_1016_j_ijsolstr_2018_10_020 crossref_primary_10_1214_15_STS534 crossref_primary_10_1007_s11222_012_9335_7 crossref_primary_10_1093_biolinnean_blad115 crossref_primary_10_1111_mec_12729 crossref_primary_10_1080_02664763_2015_1134447 crossref_primary_10_1080_10485252_2023_2292690 crossref_primary_10_1111_biom_12081 crossref_primary_10_1038_hdy_2013_52 crossref_primary_10_1080_10618600_2024_2379349 crossref_primary_10_1515_sagmb_2012_0014 crossref_primary_10_1371_journal_pone_0172516 crossref_primary_10_1007_s11222_018_9817_3 crossref_primary_10_1080_03610926_2017_1348523 crossref_primary_10_1371_journal_pone_0018155 |
| ContentType | Journal Article |
| DBID | CGR CUY CVF ECM EIF NPM 7X8 |
| DOI | 10.2202/1544-6115.1576 |
| DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
| DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
| DatabaseTitleList | MEDLINE MEDLINE - Academic |
| Database_xml | – sequence: 1 dbid: NPM name: PubMed url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X8 name: MEDLINE - Academic url: https://search.proquest.com/medline sourceTypes: Aggregation Database |
| DeliveryMethod | no_fulltext_linktorsrc |
| Discipline | Biology |
| EISSN | 1544-6115 |
| ExternalDocumentID | 20887273 |
| Genre | Journal Article |
| GroupedDBID | --- -~S 0R~ 123 1WD 4.4 9-L AAAEU AAAVF AACIX AAFPC AAGVJ AAILP AAKRG AALGR AAONY AAOWA AAPJK AAQCX AASQH AASQN AAWFC AAXCG AAXMT ABABW ABAOT ABAQN ABDRH ABFKT ABIQR ABJNI ABLVI ABMIY ABPLS ABRDF ABRQL ABUVI ABWLS ABXMZ ABYBW ACDEB ACEFL ACGFO ACGFS ACHNZ ACMKP ACONX ACPMA ACRPL ACUND ACXLN ACYCL ACZBO ADALX ADEQT ADGQD ADGYE ADNMO ADOZN ADUQZ AECWL AEDGQ AEGVQ AEICA AEJQW AEKEB AEMOE AENEX AEQDQ AEQLX AERZL AEXIE AFAUI AFBAA AFBDD AFBQV AFCXV AFGNR AFQUK AFSHE AFYRI AGBEV AGGNV AGQPQ AGQYU AGWTP AHCWZ AHVWV AHXUK AIAGR AIERV AIKXB AIWOI AJATJ AJPIC AKXKS ALMA_UNASSIGNED_HOLDINGS ALUKF ALWYM AMAVY ASPBG ASYPN AVWKF AZFZN AZMOX BAKPI BBCWN BBDJO BCIFA BDLBQ CAG CGR CKPZI COF CS3 CUY CVF DASCH DU5 EBS ECM EIF EJD EMOBN F5P FEDTE FSTRU H13 HVGLF HZ~ IY9 J9A K.~ KDIRW LG7 LVMAB MV1 NPM NQBSW O9- P2P QD8 ROL RYL SA. SLJYH T2Y UK5 WTRAM ~Z8 7X8 ADNPR DSRVY |
| ID | FETCH-LOGICAL-c463t-bc1663a2d54dcdd13b7b9d15706c3e6f7f3b1db41c42d2935a065190045d4d9b2 |
| IEDL.DBID | 7X8 |
| ISICitedReferencesCount | 113 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000283287600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1544-6115 |
| IngestDate | Fri Sep 05 13:51:01 EDT 2025 Mon Jul 21 06:03:08 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c463t-bc1663a2d54dcdd13b7b9d15706c3e6f7f3b1db41c42d2935a065190045d4d9b2 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| PMID | 20887273 |
| PQID | 756664346 |
| PQPubID | 23479 |
| ParticipantIDs | proquest_miscellaneous_756664346 pubmed_primary_20887273 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-01-01 |
| PublicationDateYYYYMMDD | 2010-01-01 |
| PublicationDate_xml | – month: 01 year: 2010 text: 2010-01-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | Germany |
| PublicationPlace_xml | – name: Germany |
| PublicationTitle | Statistical applications in genetics and molecular biology |
| PublicationTitleAlternate | Stat Appl Genet Mol Biol |
| PublicationYear | 2010 |
| SSID | ssj0021230 |
| Score | 2.226092 |
| Snippet | How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data... |
| SourceID | proquest pubmed |
| SourceType | Aggregation Database Index Database |
| StartPage | Article34 |
| SubjectTerms | Bayes Theorem Computational Biology - methods Computer Simulation Databases, Genetic Haplotypes - genetics Models, Genetic |
| Title | On optimal selection of summary statistics for approximate Bayesian computation |
| URI | https://www.ncbi.nlm.nih.gov/pubmed/20887273 https://www.proquest.com/docview/756664346 |
| Volume | 9 |
| WOSCitedRecordID | wos000283287600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpZ07T8MwEMdPQEFi4f0oL3lgNY3jR5oJAaJioe0AUrcofm0khRZEvz1nJ-2GGFgyRI4Une3zzz7f_QGu-0wlZaIl9eiAqVDaUc2MopnJeUhFDxASxSay4bA_meTj9m7OrL1WufSJ0VHb2oQz8l6G3IGrp1C303caRKNCcLVV0FiHDkeSCTe6sskqiBCccsyHlELgDonJpmZjitv93urdDZOZ-p0u4yoz2P3n_-3BTouX5K4ZD_uw5qoD2GoEJxeHMBpVpEYn8YZtZlEBB7uF1J60OWwk5Bc1pZsJ0iyJFce_sfnckfty4ULGJTFRCCL26BG8Dh5fHp5oK6lAjVB8TrVhiBhlaqWwxlrGdaZzizZIlOFO-cxzzawWzIjUIgnIEhEFmQHBzwqb6_QYNqq6cqdAws4KW_q05FHJKpd5qb3zCdMSwcB0gSwNVeCQDXGIsnL156xYmaoLJ42xi2lTWqNIg9NDojr7--Nz2G4i-eE45AI6Hqeru4RN84V2-riKQwGfw_HzD4vrvoc |
| linkProvider | ProQuest |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=On+optimal+selection+of+summary+statistics+for+approximate+Bayesian+computation&rft.jtitle=Statistical+applications+in+genetics+and+molecular+biology&rft.au=Nunes%2C+Matthew+A&rft.au=Balding%2C+David+J&rft.date=2010-01-01&rft.issn=1544-6115&rft.eissn=1544-6115&rft.volume=9&rft.spage=Article34&rft_id=info:doi/10.2202%2F1544-6115.1576&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1544-6115&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1544-6115&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1544-6115&client=summon |