Sparse Segment Identification with Genomic Application

Jessie Jeng
Department of Statistics
North Carolina State University

4:15-5:15 pm
Monday, November 5, 2012
2225 SAS Hall, NCSU Campus

In current high-throughput genomic data analysis, genetic signals are often very sparse and assume certain structures arising from biological phenomena. An important example is DNA copy number variation (CNV), which plays a significant role in population diversity and complex diseases. Motivated by CNV analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of identifying sparse segments in a long sequence of observations with Gaussian noise, where the number, length, and location of the segments are unknown.

We study fundamental properties for segment identification by characterizing the identifiable and the unidentifiable regions. Only in the identifiable region, it is possible to consistently separate the segments from noise. An efficient likelihood ratio selection (LRS) procedure is developed, and its asymptotic optimality is presented in the sense that the LRS can separate the segments from noise as long as they are in the identifiable region. The proposed method is demonstrated with simulations and analysis of a family trio dataset. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods.


Return to Biostatistics Seminars