# [S] Computations on subsets of a data frame (by()? Aggregate()?)

Pedro de Barros (pbarros@ualg.pt)
Mon, 08 Jun 1998 12:14:08 +0100

Dear S'ers.
I have several problems which require me to use the same function on several
subsets of a data frame.
I have tried by(), which normally gives the results I want, but I cannot
manage to make the output come out the way I want it.
I have also tried to use aggregate.data.frame(), agg.f() agg.table() and
several other functions which seemed to have a potential to solve this
problem, but I do not seem to manage to make it work. Some of them only
allow functions producing single scalar, while others produce a multi-way
array of mode "list", which I cannot find out how to associate with the
indices I want. Does anyone have a (preferably "clean" ;-) solution to this?
A further elaboration of my problem is below:

I have a data frame (say, JUNK) containing 2 or more factor variables (or
other variables which can be converted to factors), and several data variables.
I want to do several types of processing on the data variables, separately
by levels of the factor(s).
In the example below, "A" and "B" are factors, while "C" and "D" are data
columns.
JUNK
A B C D
1 1 1 10
1 1 2 7
1 1 3 5
1 1 4 4
1 2 1 12
1 2 2 9
1 2 3 7
1 2 4 5
1 3 1 8
1 3 2 6
1 3 3 5
1 3 4 3
2 1 1 18
2 1 2 13
2 1 3 11
2 1 4 9
2 1 5 8
2 2 1 10
2 2 2 7
2 2 3 5
2 2 4 4
2 3 1 14
2 3 2 11
2 3 3 9
2 3 4 7
2 3 5 6
2 4 1 15
2 4 2 13
2 4 3 12
2 4 4 10

The first operation I want to do is to calculate the difference of each
element of "D" to the first element of its group. The result I want would be
as below:

JUNK (Modified)
A B C D
1 1 1 0
1 1 2 3
1 1 3 5
1 1 4 6
1 2 1 0
1 2 2 3
1 2 3 5
1 2 4 7
1 3 1 0
1 3 2 2
1 3 3 3
1 3 4 5
2 1 1 0
2 1 2 5
2 1 3 7
2 1 4 9
2 1 5 10
2 2 1 0
2 2 2 3
2 2 3 5
2 2 4 6
2 3 1 0
2 3 2 3
2 3 3 5
2 3 4 7
2 3 5 8
2 4 1 0
2 4 2 2
2 4 3 3
2 4 4 5

I manage to obtain the result for each group using
by(JUNK\$D, list(JUNK\$B, JUNK\$A), FUN=function(x){x[1,]-x})
but the results come as a multiway list, with no association with the
original factors except for the dimnames....

The second thing I need to do is probably more complicated. It requires
using simultaneously several data columns, but within groups defined by the
factors. I want to run a regression (linear and non-linear) of (in this
case) D on C - but I do want to use more independent variables, in fact.

In this case, the result I would like to obtain would be something like

Regression results

A B B0 B1 B2 r2
1 1 ? ? ? ?
1 2 ? ? ? ?
1 3 ? ? ? ?
2 1 ? ? ? ?
2 2 ? ? ? ?
2 3 ? ? ? ?
2 4 ? ? ? ?

Where B0, B1, B2 would be the regression coefficients and r2 the r2
statistic. I am also interested in using other statistics for
goodness-of-fit, or comparing different models within each group.
Again, aggregate() and similar functions only accept scalar functions, while
by() does not give me the result as I want it.

I am sure someone has done this before.
Any tips?

Thanks in advance. I will post a summary to the list.
Pedro
============================================================================
==========
Pedro de Barros Tel: +351 89 800918/900/100
Universidade do Algarve Fax: +351 89 818353
Unidade de Ciencias e Tecnologias dos Recursos Aquáticos E-mail:
pbarros@ualg.pt
Campus de Gambelas, 8000 FARO, PORTUGAL

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news