|
Table of Contents |
|
Handling Missing Data CASE STUDY - Exercise Four
Estimation of bias from imputation This
exercise examines the impact of the response mechanism, the imputation method
and the response rate on the point estimator using the Generalized System for
Imputation Simulations (GENESIS v.1.0). In order to install and run GENESIS v.1.0 you need
to be running SAS© Release 8.02 for windows.A beta-version
of GENESIS v.1.0 is available to download from this site free of charge for use
within this case study. For information
on this and later versions of GENESIS contact
Students can address several questions in this section. Using various response mechanisms, response rates and imputation methods look at the following questions:
Background
1) Even though imputation
leads to a complete data file, inference, in particular point estimation, is
valid only if the additional underlying assumptions are satisfied.
Let U be a population of size N. We want to estimate the population mean To that end, we draw a simple random sample of size n and observe each value It is well known that the sample mean is an unbiased
estimator of
where
In fact, to make inference in the presence of
imputation, we have no choice but to make assumptions on the response mechanism
and on the variable of interest y.
Indeed, one can either choose to completely specify the response mechanism, or
to completely specify a model for the variable of interest y, in which case, the response mechanism takes a more general
form. We now illustrate that the imputed
estimator (1) is unbiased for
Assumptions on the response mechanism Let us suppose that we make assumptions with regard to the response mechanism. For example, it is sometimes justified to suppose that the response mechanism is missing completely at random or MCAR. Recall from exercise one that a MCAR response mechanism is a mechanism for which:
According to this
mechanism, it is easy to see that the imputed estimator (1) is unbiased for
What will happen if we assume that the response mechanism is MCAR when in reality it is not? In this case, the imputed estimator may be biased. To illustrate this point, let us take the case of a response mechanism for which the probability of responding to item y varies from one unit to the next (in other words,
In this case, the imputed estimator (1) is biased:
where
is the mean of the
probabilities of response in the population. Note that the bias (2) is equal to
0 if the covariance between the probability of response and the variable of
interest y is 0 in the population, which is the case, for example, for a MCAR
response mechanism (
Assumptions on the variable
of interest Instead of modelling the response mechanism, some
prefer to model the variable of interest (the response mechanism then takes a
more general form). Let us now suppose that the method used is ratio
imputation: an auxiliary value x is
available for all units in the sample and each missing value
where
(see Table 1, exercise 2). It is then easy to show that the imputed
estimator (1) is unbiased provided the assumed model (3) is valid for the units
in the population. What can be said if model (3) is not valid? Once again, the
imputed estimator may be biased. Indeed, suppose that the real model linking
variables y and x is not (3) but rather
It is then easy to show that, according to model (4),
the imputed estimator (1) is biased:
where
Note that bias (5) is equal to 0 if the response
mechanism is MCAR ( These two examples clearly illustrate that if the
starting assumption (MCAR mechanism or model) is not satisfied, then the
imputed estimator will likely be biased.
2) Imputation has the effect of modifying relationships Since the theoretical treatment of relationships under
imputation is rather complex, let us consider instead a population of size N = 10 with two variables x and y. In Table 1 are the data for the population:
Table 1
The coefficient of correlation between x and y in the population is
For simplicity, let us suppose that we had a census instead of a survey. The data collected is shown in the following table:
Table 2
We generate randomly missing values indicated by "."
in Table 2 independently for x and y so that the response rate is
approximately 70%. Let us suppose that marginal mean imputation is used, (i.e.,
for a missing value for item x, we
impute the mean of respondents
Table 3 shows the data after imputation, where the
imputed values are flagged using *: Table 3
The coefficient of correlation between x and y in the imputed data set is
3) If imputed values are treated as observed values, the variance of the estimator can be substantially underestimated, especially if the proportion of nonresponse is high.
Survey
statisticians have studied this issue extensively in recent years. A number of
articles have emphasized that imputed values must not be treated as if they had
been observed, especially if the rate of nonresponse is significant. For example, with a response rate of 70%,
treating imputed values as observed values may lead to an underestimation in
the variance,
Summary
In
conclusion, when imputation is used to deal with partial (item) nonresponse, it
is important to:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|