Re: Statistica anova

William Gould (wgould@stata.com)
Sat, 3 Sep 94 13:52:13 EDT


Duncan Murdoch, in direct communication with me, pointed out two typographical
errors in my previous posting, <QQxfsi26057.199409030611@relay3.UU.NET>,
dealing with the logic behind the main effect test performed by SAS, SPSS,
Stata, etc. "As it is," he says correctly, "it is very confusing."

I wrote:

>The DEFINITION of a main effect of B has to do with prediction IN BALANCED
>DATA, said more succinctly as the absence of knowledge about B.

It should say, "said more succinctly as the absence of knowledge about A".

I also wrote:

>Statistica answers the question as follows: Not knowing A, I note that
>the mean of Y for A=1 and B=1 are the same and therefore the point
>estimate is 0.

It should say "the mean of Y for B=1 and B=2".

Murdoch then continues with a substantive point of his own:

> [...]
> there's one more way to look at this problem. Perhaps the A=2, B=2
> combination is impossible, and that's why we have no observations there.
> (E.g. A is sex, and B is use of oral contraceptives.) In that case, it
> hardly makes sense to take account of the A=2 observations at all when doing
> the comparison. Why should you care about the male experience when you are
> trying to decide whether to take oral contraceptives or not? Here you
> should really compare the 11 cell to the 12 cell. I think this is what you
> end up doing, isn't it?

Answer: Yes, although there is a subtle issue having to do with the
variance of the estimate and hence significance levels.

Here is the computer output of estimating Y on A and B and Y on B among the
A==1 observations. (I use Stata for obvious reasons of personal bias --
remember, my address is wgould@stata.com -- but any of the other packages will
produce the same results and format it in roughly the same way):

----------------------------------------------------------------------------
. anova y a b

Number of obs = 6 R-square = 0.9143
Root MSE = .707107 Adj R-square = 0.8571

Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 16.00 2 8.00 16.00 0.0251
|
a | 16.00 1 16.00 32.00 0.0109
b | 4.00 1 4.00 8.00 0.0663
|
Residual | 1.50 3 .50
-----------+----------------------------------------------------
Total | 17.50 5 3.50

. anova y b if a==1

Number of obs = 4 R-square = 0.8000
Root MSE = .707107 Adj R-square = 0.7000

Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 4.00 1 4.00 8.00 0.1056
|
b | 4.00 1 4.00 8.00 0.1056
|
Residual | 1.00 2 .50
-----------+----------------------------------------------------
Total | 5.00 3 1.66666667

----------------------------------------------------------------------------

In the first case, I run the model which we have been discussing. In the
second, "anova y b if a==1", I estimate the model of y on B on just the
A=1 observations. Note the b line in each:

Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
(from Y on A B) b | 4.00 1 4.00 8.00 0.0663
(From Y on B if A=1) b | 4.00 1 4.00 8.00 0.1056

The sums of squares and the F are the same, but the significance levels are
different. As I will explain, the Partial SS and MS must be the same.
It is merely by chance in this example that the F's came out the same -- they
did so because the sample variance of Y|A=1 was the same as for Y|A=2. The
same or not, the significance levels will differ because, mechanically, the
first is evaluated as F(1,3) and the second as F(1,2).

Why?

The point estimate of the B effect for "females" is 2 no matter how one thinks
about the problem. The significance of the difference 2 depends on the
variance of Y. If Y naturally varies hugely, even within B, then we would
not be surprised to see a difference of 2 even when the true effect is 0.
If Y does not vary much -- if the 2 is large relative to the background
variance -- then we trust the measurement more.

We do not, however, know the residual variance, so we estimate it. In
"anova y b if a==1", we estimate the variance using the female observations.
In "anova y a b" -- using all the data -- we use the "male" (A=1) observations
to improve the variance estimate (under the assumption the variances are the
same). In this case, the male observations reenforce the finding that the
variance is "small" -- in fact, the sample variance among "males" is exactly
what we observed among "females", so it reenforced that the variance is
exactly what we thought it was when we estimated "anova y b if a==1".

Our variance estimate did not change, but now being measured over more
observations, we are more certain of it, and thus more certain of our
measurement of the B effect. The denominator degrees of freedom for
the F account for this fact (if we knew the variance a priori, we would
use a chi-square distribution to evaluate the significance level).

B did not have to become more significant. Had the "male" observations
exhibited substantial variation, the significance level of the B effect would
have fallen.

--Bill.
wgould@stata.com