Department of Statistics Seminar
North Carolina State University

presents

Dipak K. Dey

Department of Statistics, University of Connecticut

 

Model Selection and Diagnostics to Identify Genetic Markers for

Single-nucleotide Polymorphisms

 

Abstract

 

The distribution of genetic variation among populations is conveniently measured by Wright's F{ST}, which is a scaled variance taking on values in [0,1]. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs), in particular, it is reasonable to presume that the genotype at most loci detected by those markers are selectively neutral. For such loci, the distribution of genetic variation among populations is determined by the size of local populations, the pattern and rate of migration among those populations, and the rate of mutation. Because the demographic parameters (population size and migration rates) are common across all loci, locus-specific estimates of F{ST} will depart from a common mean only for loci with unusually high or low rates of mutation and for loci that are closely associated with genomic regions having a substantial effect on fitness. Thus, loci showing significantly more variation than background are likely to mark genomic regions subject to diversifying selection among the sample populations, while those showing significantly less variation than background are likely to mark genomic regions subject to stabilizing selection across the sample populations. We propose several Bayesian hierarchical models to estimate locus-specific effects on F{ST}, and we apply these models to single nucleotide polymorphism data from the HapMap project. Because loci that are physically associated with one another are likely to show similar patterns of variation, we introduce conditional autoregressive models to incorporate the local correlation among loci. We estimate the posterior distributions of the model parameters using Markov chain Monte Carlo (MCMC) simulations. Model comparison using several criteria, including DIC and LPML, reveals that a model with locus- and population-specific effects is superior to other models for the data used in the analysis. To detect loci for which locus-specific effects are not well explained by the common F{ST}, we propose an approach that measures the divergence between the posterior distributions of locus-specific effects and the common F{ST} with the Kullback-Leibler divergence measure (KLD). With this method, we identify 15 SNP loci that have unusually large values of F{ST}. By comparing the map position of the SNP loci with known gene locations, we find 10 out of the 15 are located either within identified genes or nearby.

Friday, November 2, 2007
3:35 - 4:35 pm
301 Riddick Hall
Refreshments will be served in the common area of 301 Riddick at 3:00 pm.