Mixture modeling of next generation sequencing data and its applications to genotyping and estimating genotype frequencies

Estimating the probability that an individual has a base pair nucleodite different from the reference nucleotide is important in next generation sequencing (NGS) research. I present a method for modeling the frequency of single nucleotide polymorphism variants in the exome capturing sequence data of...

Celý popis

Uložené v:
Podrobná bibliografia
Hlavný autor: Lihm, Jayon
Médium: Dissertation
Jazyk:English
Vydavateľské údaje: ProQuest Dissertations & Theses 01.01.2013
Predmet:
ISBN:9781303807114, 1303807114
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Estimating the probability that an individual has a base pair nucleodite different from the reference nucleotide is important in next generation sequencing (NGS) research. I present a method for modeling the frequency of single nucleotide polymorphism variants in the exome capturing sequence data of an individual. A mixture distribution was used to model the proportion of alternative alleles at a specified base pair position assuming a biallelic single nucleotide polymorphism model. I measured the proportion of alternative alleles for positions in chromosome 1 exome sequencing data fro two trios taken from the Pilot 3 data in the 1000 Genomes Project. The measurements were based on the counts of reference and alternative alleles calculated by the SAMtools genetic software. The mixture model studied here had two point distributions and five continuous distributions. I applied the expectation-maximization algorithm to obtain the maximum likelihood estimates of the mixture model parameters for each individual. The fitted mixture model well described the properties of the distribution of the alternative allele proportions. The estimates of mixing proportions were used to estimate the genotype frequencies in the data. Each individual had different estimates of model parameters, but the estimates of genotype fractions of the six individuals were similar. The estimated fractions of the members from each trio were similar to each other. I next combined two approaches of clustering and mixture modeling to genotype the exomic base pair positions of an individual using next generation sequencing data. The alternative allele proportion at a position was used to measure the Bayesian posterior probability of single nucleotide polymorphism at a position. I developed software package named "SNVclust" to generate alternative allele proportions and genotypes of an individual. This software was used to make a call set of single nucleotide polymorphism positions and genotypes for each of three members of a trio from the 1000 Genomes Project. The results from this software were compared with the released single nucleotide polymorphisms in the 1000 Genomes Project and results from two other programs. Then I found that minimal average coverage greater than 43 should be to use SNVclust for whole exome sequencing data.
Bibliografia:SourceType-Dissertations & Theses-1
ObjectType-Dissertation/Thesis-1
content type line 12
ISBN:9781303807114
1303807114