[S] Alternatives to looping over big dataframes?

Bruce Western (western@lotka.princeton.edu)
Wed, 5 Aug 98 17:33:12 EDT


We have a dataframe, called data, with repeated measures on a large
sample of survey respondents (4600 respondents, over 100,000
observations). Each respondent may be observed as little as 1 time or
as many as 11 times. The dataframe also includes an id variable
(IDCODE) that assigns a unique number to each respondent.

We want to create a new dataframe that includes only those respondents
that are completely observed for all 11 time points. We can do this by
first deleting rows of the dataframe with NA's, then looping over
respondent id's, but this is very slow:

data1 <- na.omit(data)
data2 <- NULL
id <- unique(data$IDCODE)
for(i in 1:length(id)) {
temp <- data1[data1$IDCODE==id[i],]
if(nrow(temp)==11) data2 <- rbind(data2,temp)
}

This approach takes about 6 hours on our SPARC 20 (Version 3.4 Release
1 for Sun SPARC, SunOS 4.1.3_U1 : 1996)

Can anyone think of a way of creating the object data2, without looping.
Or at least by using loops more efficiently?

Cheers,

Bruce Western

--
Bruce Western
Department of Sociology            Phone: (609) 258-2445
Princeton University               Fax: (609) 258-2180
Princeton NJ 08544-1010            E-mail: western@princeton.edu

----------------------------------------------------------------------- This message was distributed by s-news@wubios.wustl.edu. To unsubscribe send e-mail to s-news-request@wubios.wustl.edu with the BODY of the message: unsubscribe s-news