Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans

Abstract Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Molecular biology and evolution Ročník 42; číslo 11
Hlavní autori:	Rahman, Amatur, Smith, T Quinn, Szpiech, Zachary A
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	UK Oxford University Press 01.11.2025
Predmet:	Algorithms Computation Computer Simulation Dynamic Programming Genomics Genotypes Haplotypes Homozygosity Homozygote Humans Machine learning Models, Genetic Positive selection Run time (computers) Selection, Genetic Software Source code Statistics population genetics haplotypes positive selection software
ISSN:	0737-4038, 1537-1719, 1537-1719
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Abstract Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for potentially millions of sites, is computationally demanding, as it involves tracking counts of unique haplotypes iteratively over long genomic distances and across many individuals. Because of these computational challenges, existing tools do not scale well when applied to large-scale population datasets, such as the 1,000 Genomes Project, or the UK Biobank with 500,000 individuals. Optimizing computation becomes crucial when data sets grow large, especially when handling large sample sizes or generating training data for machine learning algorithms. Here, we propose a dynamic programming algorithm that substantially improves runtime and memory usage over existing tools on both real and simulated data. On real phased data, we achieve 5–50x speedup with minimal memory footprint. Our simulations show an even more pronounced performance gap with large populations (up to 15x speedup and 46x memory reduction). EHH-based statistics designed for unphased genotypes run an order of magnitude faster, and multi-parameter support results in 20x runtime improvement. Source code and binaries are available at https://github.com/szpiech/selscan as selscan v2.1.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0737-4038 1537-1719 1537-1719
DOI:	10.1093/molbev/msaf275