On optimal selection of summary statistics for approximate Bayesian computation

How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Statistical applications in genetics and molecular biology Ročník 9; s. Article34
Hlavní autoři: Nunes, Matthew A, Balding, David J
Médium: Journal Article
Jazyk:angličtina
Vydáno: Germany 01.01.2010
Témata:
ISSN:1544-6115, 1544-6115
On-line přístup:Zjistit podrobnosti o přístupu
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.
AbstractList How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.
How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.
Author Nunes, Matthew A
Balding, David J
Author_xml – sequence: 1
  givenname: Matthew A
  surname: Nunes
  fullname: Nunes, Matthew A
  email: m.nunes@lancs.ac.uk
  organization: Lancaster University. m.nunes@lancs.ac.uk
– sequence: 2
  givenname: David J
  surname: Balding
  fullname: Balding, David J
BackLink https://www.ncbi.nlm.nih.gov/pubmed/20887273$$D View this record in MEDLINE/PubMed
BookMark eNpNkMtLAzEYxINU7EOvHiU3T1vzTveoxRcUetHzktdCZJOsmyzY_94tVvH0DcOP4ZtZgllM0QFwjdGaEETuMGesEhjzNeZSnIHFnzH7p-dgmfMHQgQTii7AnKDNRhJJF2C_jzD1xQfVwew6Z4pPk9PCPIaghgPMRRWfizcZtmmAqu-H9DXhxcEHdXDZqwhNCv145FK8BOet6rK7Ot0VeH96fNu-VLv98-v2flcZJmiptMFCUEUsZ9ZYi6mWurZTBSQMdaKVLdXYaoYNI5bUlCskOK4RYtwyW2uyArc_udM7n6PLpQk-G9d1Kro05kZyIQSjTEzkzYkcdXC26Qd_LNb8bkC-AXaYYCU
CitedBy_id crossref_primary_10_1371_journal_pone_0099581
crossref_primary_10_3390_e23080961
crossref_primary_10_1038_nprot_2014_025
crossref_primary_10_1246_bcsj_20180027
crossref_primary_10_1016_j_gde_2018_06_016
crossref_primary_10_1080_10618600_2021_1981341
crossref_primary_10_1038_s41540_017_0010_7
crossref_primary_10_1109_JBHI_2025_3546844
crossref_primary_10_12688_wellcomeopenres_15048_1
crossref_primary_10_3390_stats4030045
crossref_primary_10_12688_wellcomeopenres_15048_2
crossref_primary_10_3390_e27070683
crossref_primary_10_1073_pnas_1102900108
crossref_primary_10_1093_molbev_mst140
crossref_primary_10_1214_17_STS618
crossref_primary_10_1016_j_jtbi_2023_111467
crossref_primary_10_1002_wics_1486
crossref_primary_10_1002_jcc_23504
crossref_primary_10_1016_j_spl_2015_08_003
crossref_primary_10_1016_j_copbio_2013_03_012
crossref_primary_10_1093_molbev_msu277
crossref_primary_10_1007_s41109_025_00694_y
crossref_primary_10_1016_j_csda_2015_05_005
crossref_primary_10_1093_sysbio_syw077
crossref_primary_10_1146_annurev_statistics_030718_105212
crossref_primary_10_1007_s00180_025_01607_4
crossref_primary_10_1016_j_tcs_2021_09_039
crossref_primary_10_1111_j_1365_2966_2012_21371_x
crossref_primary_10_1214_12_STS406
crossref_primary_10_1111_j_1365_294X_2011_05322_x
crossref_primary_10_1039_c2ib00175f
crossref_primary_10_1371_journal_pcbi_1004845
crossref_primary_10_7717_peerj_5198
crossref_primary_10_1073_pnas_1400425111
crossref_primary_10_1371_journal_pcbi_1010683
crossref_primary_10_1146_annurev_ecolsys_110617_062431
crossref_primary_10_1016_j_csda_2016_07_005
crossref_primary_10_1007_s10845_024_02343_0
crossref_primary_10_1155_2013_210646
crossref_primary_10_1371_journal_pgen_1003905
crossref_primary_10_1111_rssc_12042
crossref_primary_10_1534_genetics_112_143164
crossref_primary_10_1093_comnet_cnae017
crossref_primary_10_1093_tse_tdab011
crossref_primary_10_1002_ece3_698
crossref_primary_10_1111_ecog_04824
crossref_primary_10_1515_sagmb_2013_0010
crossref_primary_10_1016_j_epidem_2019_100368
crossref_primary_10_1007_s11222_019_09905_w
crossref_primary_10_1111_2041_210X_12489
crossref_primary_10_1016_j_cels_2016_12_002
crossref_primary_10_1016_j_epidem_2023_100665
crossref_primary_10_1371_journal_pcbi_1002803
crossref_primary_10_1128_spectrum_01449_24
crossref_primary_10_1007_s00158_022_03185_1
crossref_primary_10_1007_s42113_023_00180_7
crossref_primary_10_1111_ahg_12606
crossref_primary_10_1111_jeb_12280
crossref_primary_10_1038_hdy_2015_38
crossref_primary_10_1098_rsos_180384
crossref_primary_10_1016_j_ecolmodel_2022_110251
crossref_primary_10_1016_j_ijsolstr_2018_10_020
crossref_primary_10_1214_15_STS534
crossref_primary_10_1007_s11222_012_9335_7
crossref_primary_10_1093_biolinnean_blad115
crossref_primary_10_1111_mec_12729
crossref_primary_10_1080_02664763_2015_1134447
crossref_primary_10_1080_10485252_2023_2292690
crossref_primary_10_1111_biom_12081
crossref_primary_10_1038_hdy_2013_52
crossref_primary_10_1080_10618600_2024_2379349
crossref_primary_10_1515_sagmb_2012_0014
crossref_primary_10_1371_journal_pone_0172516
crossref_primary_10_1007_s11222_018_9817_3
crossref_primary_10_1080_03610926_2017_1348523
crossref_primary_10_1371_journal_pone_0018155
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.2202/1544-6115.1576
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X8
  name: MEDLINE - Academic
  url: https://search.proquest.com/medline
  sourceTypes: Aggregation Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
EISSN 1544-6115
ExternalDocumentID 20887273
Genre Journal Article
GroupedDBID ---
-~S
0R~
123
1WD
4.4
9-L
AAAEU
AAAVF
AACIX
AAFPC
AAGVJ
AAILP
AAKRG
AALGR
AAONY
AAOWA
AAPJK
AAQCX
AASQH
AASQN
AAWFC
AAXCG
AAXMT
ABABW
ABAOT
ABAQN
ABDRH
ABFKT
ABIQR
ABJNI
ABLVI
ABMIY
ABPLS
ABRDF
ABRQL
ABUVI
ABWLS
ABXMZ
ABYBW
ACDEB
ACEFL
ACGFO
ACGFS
ACHNZ
ACMKP
ACONX
ACPMA
ACRPL
ACUND
ACXLN
ACYCL
ACZBO
ADALX
ADEQT
ADGQD
ADGYE
ADNMO
ADOZN
ADUQZ
AECWL
AEDGQ
AEGVQ
AEICA
AEJQW
AEKEB
AEMOE
AENEX
AEQDQ
AEQLX
AERZL
AEXIE
AFAUI
AFBAA
AFBDD
AFBQV
AFCXV
AFGNR
AFQUK
AFSHE
AFYRI
AGBEV
AGGNV
AGQPQ
AGQYU
AGWTP
AHCWZ
AHVWV
AHXUK
AIAGR
AIERV
AIKXB
AIWOI
AJATJ
AJPIC
AKXKS
ALMA_UNASSIGNED_HOLDINGS
ALUKF
ALWYM
AMAVY
ASPBG
ASYPN
AVWKF
AZFZN
AZMOX
BAKPI
BBCWN
BBDJO
BCIFA
BDLBQ
CAG
CGR
CKPZI
COF
CS3
CUY
CVF
DASCH
DU5
EBS
ECM
EIF
EJD
EMOBN
F5P
FEDTE
FSTRU
H13
HVGLF
HZ~
IY9
J9A
K.~
KDIRW
LG7
LVMAB
MV1
NPM
NQBSW
O9-
P2P
QD8
ROL
RYL
SA.
SLJYH
T2Y
UK5
WTRAM
~Z8
7X8
ADNPR
DSRVY
ID FETCH-LOGICAL-c463t-bc1663a2d54dcdd13b7b9d15706c3e6f7f3b1db41c42d2935a065190045d4d9b2
IEDL.DBID 7X8
ISICitedReferencesCount 113
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000283287600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1544-6115
IngestDate Fri Sep 05 13:51:01 EDT 2025
Mon Jul 21 06:03:08 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c463t-bc1663a2d54dcdd13b7b9d15706c3e6f7f3b1db41c42d2935a065190045d4d9b2
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 20887273
PQID 756664346
PQPubID 23479
ParticipantIDs proquest_miscellaneous_756664346
pubmed_primary_20887273
PublicationCentury 2000
PublicationDate 2010-01-01
PublicationDateYYYYMMDD 2010-01-01
PublicationDate_xml – month: 01
  year: 2010
  text: 2010-01-01
  day: 01
PublicationDecade 2010
PublicationPlace Germany
PublicationPlace_xml – name: Germany
PublicationTitle Statistical applications in genetics and molecular biology
PublicationTitleAlternate Stat Appl Genet Mol Biol
PublicationYear 2010
SSID ssj0021230
Score 2.226092
Snippet How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage Article34
SubjectTerms Bayes Theorem
Computational Biology - methods
Computer Simulation
Databases, Genetic
Haplotypes - genetics
Models, Genetic
Title On optimal selection of summary statistics for approximate Bayesian computation
URI https://www.ncbi.nlm.nih.gov/pubmed/20887273
https://www.proquest.com/docview/756664346
Volume 9
WOSCitedRecordID wos000283287600002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwED4BBYmF96O85IE1tPErzYQAUTFA2wFQt8jPjaTQgui_5-yk3RADiwfLkazz5fyd7_EBXDrcnTZcJNYHCjMfiNy7TCc-7eZeKydprFt7fcwGg954nI-a3Jxpk1a5sInRUNvKhDfyToa4A29PLq8n70kgjQrB1YZBYxVaDJFMyOjKxssgQjDKsR5ScI4eUirqno0U3f3Ocu4qFZn8HV3GW6a__c_97cBWAy_JTa0Pu7Diyj3YqAkn5_swHJakQiPxhmumkQEHj4VUnjQ1bCTUF9WtmwmiWRI7jn_j8pkjt2ruQsUlMZEIIp7oAbz075_vHpKGUiExXLJZok2KEENRK7g11qZMZzq3KIOuNMxJn3mmU6t5aji1iASEQoiCmAGBn-U21_QQ1sqqdMdAmDc9xqlyyhneU0xxphl16POiRyWEbANZCKpAlQ1xCFW66nNaLEXVhqNa2MWkbq1R0GD0EFGd_P3xKWzWkfzwHHIGLY-_qzuHdfOFcvq4iKqA42D09AMRB76y
linkProvider ProQuest
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=On+optimal+selection+of+summary+statistics+for+approximate+Bayesian+computation&rft.jtitle=Statistical+applications+in+genetics+and+molecular+biology&rft.au=Nunes%2C+Matthew+A&rft.au=Balding%2C+David+J&rft.date=2010-01-01&rft.eissn=1544-6115&rft.volume=9&rft.spage=Article34&rft_id=info:doi/10.2202%2F1544-6115.1576&rft_id=info%3Apmid%2F20887273&rft_id=info%3Apmid%2F20887273&rft.externalDocID=20887273
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1544-6115&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1544-6115&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1544-6115&client=summon