[S] Summary of Responses to my Data Frame Subsetting Question

Humbolt, Allen (HumboltA@kochind.com)
Fri, 29 May 1998 15:11:11 -0500

My original question was similar to what I restated below. I've changed the
first level of variable B below to 17.50 from 17.43 since it is indeed
possible for me to have ties in my goal of finding those records where A is
closest to B within each Class. I thank Patrick Connolly for pointing out
the relevance of knowing how I wish to handle ties. In my case, it isn't
critical which record I get as long as A is reasonably close to B. My
actual application involves the identification of which options data are "at
the money". The "strike" of the option is what I called A below, and the
current price level is what I called B.

I have a data frame called "mydata" with data like the following.
Class A B Index
1 16 17.50 1
1 17 17.50 2
1 18 17.50 3
1 19 17.50 4
2 17 18.02 5
2 18 18.02 6
2 19 18.02 7
My goal is to select the subset of this data frame where A is closest to B
within each class. My desired result for the above data would be the
Class A B Index
1 17 17.50 2
2 18 18.02 6

I received several solutions which worked on my example. I wish to share
the two solutions which were simple and fast not only on my small example,
but on a real and rather large data frame.

>From Bill Venables
> newdat <- mydata[order(mydata$Class, abs(mydata$A - mydata$B)),]
> newdat <- newdat[c(1, diff(as.numeric(mydata$Class))) > 0.1,]
Note: Bill had "==1" instead of ">0.1" . The ">0.1" happens to be more
appropriate for my actual data where Class is a numeric representation of
dates and any change of more than zero (1, a daily change, 3 a weekend
change) involves a new date or a new Class.
On a real data set this reduced 18,185 rows to 1967 rows in 1.6 seconds on
my PC.

>From Nicole DePriest Demers
> ordered.dif <- order(abs(mydata$A-mydata$B))
> newdat <- mydata[ordered.dif[tapply(ordered.dif, list(mydata$Class),
> newdat <- newdat[order(newdat$Class),]
On a real data set this reduced 18,185 rows to 1967 rows in 1.1 seconds on
my PC.

I received several other solutions which worked nicely on my small example
data frame, but languished for my real and much larger data -- an issue I
avoided when asking my question. There were also a few solutions which I
didn't investigate because I didn't understand the solution or the output.
I thank Douglas Bates, Charles Berry, Charles Pollak,
buttrey@sun10or.or.nps.navy.mil, Don MacQueen, Patrick Connolly, Jens
Oehlschlaegel, Jan Schelling, james.holtman@cbis.com, Bill Venables, and
Nicole DePriest Demers for taking the time to respond to my question.

Allen Humbolt
Quantitative Analyst
Koch Industries, Inc.

This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news