Table of Contents

 

Handling Missing Data

CASE STUDY - Exercise Four

 

Estimation of bias from imputation

 

This exercise examines the impact of the response mechanism, the imputation method and the response rate on the point estimator using the Generalized System for Imputation Simulations (GENESIS v.1.0). In order to install and run GENESIS v.1.0 you need to be running SAS© Release 8.02 for windows.A beta-version of GENESIS v.1.0 is available to download from this site free of charge for use within this case study.  For information on this and later versions of GENESIS contact David Haziza david.haziza@statcan.ca. 

 

Students can address several questions in this section.  Using various response mechanisms, response rates and imputation methods look at the following questions:

 

1. Compare the relative bias of the point estimator using various response mechanisms, response rates and imputation methods.
2. Consider the distribution of the imputed estimator.
3.

Compare the population variance with the variance after imputation. For example evaluate

in the case of mean and hot-deck imputation. (see below for definitions of )

4. Compare the MSE of the imputed estimator under different imputation methods.  For example, use ratio imputation with different correlation between the variable of interest and the auxiliary variable used for imputation.

     

Background

 

1) Even though imputation leads to a complete data file, inference, in particular point estimation, is valid only if the additional underlying assumptions are satisfied.

 

Let U be a population of size N.  We want to estimate the population mean

. 

To that end, we draw a simple random sample of size n and observe each value

.

It is well known that the sample mean

is an unbiased estimator of  in the case of full response. Where there is nonresponse, it is impossible to calculate the mean  since certain y values are missing. We can define an imputed estimator for , designated , given by:

 

where  is the set of r units that responded to item y,  is the set of m units that did not respond to item y (), and  is the imputed value created in order to "fill the hole" for the missing value .

 

In fact, to make inference in the presence of imputation, we have no choice but to make assumptions on the response mechanism and on the variable of interest y. Indeed, one can either choose to completely specify the response mechanism, or to completely specify a model for the variable of interest y, in which case, the response mechanism takes a more general form.  We now illustrate that the imputed estimator (1) is unbiased for  only if the assumptions we made ourselves at the outset have been satisfied.

 

Assumptions on the response mechanism

Let us suppose that we make assumptions with regard to the response mechanism. For example, it is sometimes justified to suppose that the response mechanism is missing completely at random or MCAR. Recall from exercise one that a MCAR response mechanism is a mechanism for which:

 

  i)

The probability of responding to item y is constant for all units in the population (more formally,

 

 

and;

  ii) The units respond to item y independently of one another.

 

According to this mechanism, it is easy to see that the imputed estimator (1) is unbiased for .

What will happen if we assume that the response mechanism is MCAR when in reality it is not? In this case, the imputed estimator may be biased. To illustrate this point, let us take the case of a response mechanism for which the probability of responding to item y varies from one unit to the next (in other words,

.

 

In this case, the imputed estimator (1) is biased:

 

     (2)

where

 

 

is the mean of the probabilities of response in the population. Note that the bias (2) is equal to 0 if the covariance between the probability of response and the variable of interest y is 0 in the population, which is the case, for example, for a MCAR response mechanism (. In general, however, the bias is different from 0. In fact, expression (2) justifies the formation of imputation classes.

 

Assumptions on the variable of interest

Instead of modelling the response mechanism, some prefer to model the variable of interest (the response mechanism then takes a more general form). Let us now suppose that the method used is ratio imputation: an auxiliary value x is available for all units in the sample and each missing value  is replaced by

,

 

where  and  are the mean of respondents for variables x and y respectively. Ratio imputation naturally suggests the use of a model of the form

 

 

                        (3)

 

(see Table 1, exercise 2).  It is then easy to show that the imputed estimator (1) is unbiased provided the assumed model (3) is valid for the units in the population. What can be said if model (3) is not valid? Once again, the imputed estimator may be biased. Indeed, suppose that the real model linking variables y and x is not (3) but rather

 

.                 (4)

 

It is then easy to show that, according to model (4), the imputed estimator (1) is biased:

 (5)

 

where  and .

 

 

Note that bias (5) is equal to 0 if the response mechanism is MCAR (. However, in general, this bias is not equal to 0.

 

These two examples clearly illustrate that if the starting assumption (MCAR mechanism or model) is not satisfied, then the imputed estimator will likely be biased.

 

2) Imputation has the effect of modifying relationships

 

Since the theoretical treatment of relationships under imputation is rather complex, let us consider instead a population of size N = 10 with two variables x and y. In Table 1 are the data for the population:

 

Table 1

 

x

1

2

3

4

5

6

7

8

9

10

y

2

5

3

9

11

6

11

13

11

12

 

The coefficient of correlation between x and y in the population is

.

 

For simplicity, let us suppose that we had a census instead of a survey. The data collected is shown in the following table:

 

Table 2

 

x

1

.

3

4

5

6

7

.

9

.

y

.

5

3

9

.

6

11

13

.

12

 

We generate randomly missing values indicated by "." in Table 2 independently for x and y so that the response rate is approximately 70%. Let us suppose that marginal mean imputation is used, (i.e., for a missing value for item x, we impute the mean of respondents and for a missing value in item y, we impute the mean of respondents ).

Table 3 shows the data after imputation, where the imputed values are flagged using *:

 

Table 3

 

x

1

5*

3

4

5

6

7

5*

9

5*

y

8.42*

5

3

9

8.42*

6

11

13

8.42*

12

 

The coefficient of correlation between x and y in the imputed data set is . We note that imputation attenuated the relationship (or the association) between the variables x and y and that the effect was substantial (0.8450 reduced to 0.2141). One could argue that the missing values denoted by "." in Table 2 are a particular realisation of the response mechanism used. To address this issue, we repeatedly generated  realisations of the response mechanism and calculated the coefficient of correlation  for each realisation . Let  be the average value of the 's.  After simulation, we find  compared to the true value 0.8450, which clearly shows the attenuation of the correlation between the variables.  How can this attenuation problem be dealt with? This question is currently under investigation and the initial findings indicate that "sophisticated" methods of imputation and/or estimation need to be used.

 

3) If imputed values are treated as observed values, the variance of the estimator can be substantially underestimated, especially if the proportion of nonresponse is high.

 

Survey statisticians have studied this issue extensively in recent years. A number of articles have emphasized that imputed values must not be treated as if they had been observed, especially if the rate of nonresponse is significant.  For example, with a response rate of 70%, treating imputed values as observed values may lead to an underestimation in the variance,, of as much as 50%. Confidence intervals treating imputed values as observed may be narrower than those obtained using a correct estimator that accounts for imputation, giving a false sense of accuracy. Note also the importance of flagging the imputed values, as in Table 3, for variance estimation.

 

Summary

In conclusion, when imputation is used to deal with partial (item) nonresponse, it is important to:

  • be careful modelling the response mechanism or the characteristics to ensure that the models "stand up" for inference purposes;

  • calculate the variance estimator of the imputed estimator correctly; and 

  • use more sophisticated imputation and/or estimation methods to preserve the relationships between the variables.