In many genetic epidemiological studies, a common interest is to estimate the genotype-specific distributions (penetrance functions) where the observed data arise from a mixture of scientifically meaningful subpopulations. A feature of the data is that only probabilities of each observation belonging to a genotype population are available, but not the actual population membership. Although the mixing propabilities can be obtained, inference is complicated by unobserved genotypes and some times by random right-censoring. Using semiparametric theories, we characterize the complete class of consistent estimators in these problems which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and implement the estimator using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We extend the methods to censored mixture samples by inverse probability weighting (IPW), nonparametric imputation and augmented IPW. Finally, we apply these estimators to the Cooperative Huntington's Observational Research Trial (COHORT) to estimate survival functions for Huntington gene mutation carriers and non-carriers. The estimated survival rates in carriers are useful in genetic counseling to provide guidelines on interpreting risk of death associated with a positive genetic test and to facilitate subjects at risk to make informed decisions on whether to undergo genetic mutation testing.
Return to Biostatistics Working Group