# [S] How to "count" NA's in a data set?

Marc R. Feldesman (feldesmanm@pdx.edu)
Mon, 19 Oct 1998 21:28:24 -0700

Today I received a very sparse data set. The data consists of 151 cases
upon which a maximum of 42 variables were measured. Both in looking at the
data set and also running simple summary statistics, it is obvious that
there is an enormous amount of randomly missing data. This isn't
surprising given that the data consists of measurements of the heads of
fossil human ancestors. Eventually I want to do some multiple imputation
on the dataset, but the volume of missing data is rather daunting at the
moment.

Before considering multiple imputation, I really need to figure out what
the missingness pattern is. I've never confronted a dataset quite this
sparse before (some measurements are missing on as many as 90 of 151 cases
while other variables are missing on only a few cases), and the number of
permutations of cases with missing and non-missing variables is quite large.

I'd like to know (1) which cases have no missing variables (the easy one),
(2) which cases have all but variable 1, all but variable 2, etc,
(relatively trivial to do) (3) which cases have all but variables 1 & 2,
1& 3, 1&4, etc. (now things get a bit more complicated) In other words,
there will come a point where I will want to know what the "optimal" data
set might look like - the set with the largest number of variables and
cases. I'll also be interested in the ordered set of less optimal subsets
ranging from most to least complete.

Only after I get this information can I make any rational decision about
what and how much needs imputation.

I've never had to deal with a problem like this before. Any [S or
otherwise] suggestions on how to approach it efficiently would be welcomed.

Thanks.

Dr. Marc R. Feldesman
email: feldesmanm@pdx.edu
email: feldesman@ibm.net
fax: 503-725-3905

"Don't know where I'm goin'
Don't like where I've been
There may be no exit
But, hell I'm goin' in" Jimmy Buffett