An Introduction to Problems With Missing Data

Marie Davidian

In practical applications, although the intention may be to collect data according to some carefully planned study design, things do not always work out as hoped. For example, In other settings, resources may dictate that not all information of interest can be gathered on all subjects in a study. To reduce costs, the full set of necessary information may be gathered on only a subset of the subjects by design, leaving it missing for the remainder of the subjects.

In all of these cases, some of the data that ideally would have been collected are missing , either by misfortune or design. In general, inference is usually focused on some aspect of the distribution of the "full" data, i.e. the data that would have been available if there were no missingness. For example, it may be desired to estimate the mean response in the entire population of subjects if exposed to a particular treatment; however, the concern is that, under missingness, the subjects whose data are actually available may not represent a true random sample from this population. Thus, more generally, there is concern that the ability to make accurate inference may be compromised in the presence of missing data.

An extensive literature exists on approaches to taking account of missing data in these situations, and the general area is an area of active ongoing research. This lecture will provide a basic introduction to the issues, terminology, and notational conventions critical for appreciating the problems associated with missing data.