As the author of one of the packages that appears to have it right (Stata),
I would like to agree, but I cannot. My current belief is that STATISTICA
has it different and which is right hinges on what you mean by a main effect.
The argument I make below is akin to how many angels can dance on the head of
a pin, but it is important if you use ANOVA with anybody's statistical
package. The fact that the answer is not obvious to everybody (and I admit
that it was not obvious to me at the outset) is proof that we should all be
careful about using ANOVA.
Let us begin by providing a definition of what we mean by a main effect.
Let's start with two factors, A and B. We observe the following:
| A=1 A=2
-----+--------------------------
B=1 | x11 x12
B=2 | x21 x22
where xij is the mean of the data in the cell. Let mij be the true mean
within the cell (think of m as mu; sorry, no greek letters). The main effect
of A is defined as the test that
(m11+m21)/2 = (m12+m22)/2 (1)
To wit, in the absence of knowledge of B, does knowledge of A allow us to
better predict the outcome? Similarly, the main effect of B is defined as
a test that
(m11+m12)/2 = (m21+m22)/2
which means, in the absence of knowledge of A, does knowledge of B allow
us to better predict the outcome?
To be perfectly clear about what a main effect is, consider the following
table of the probability of surviving a bout with one of two diseases
according to the drug administered to you:
| Disease
| A=1 A=2
-----+--------------------------
Drug B=1 | 1 0
B=2 | 0 1
If you have disease 1 and are administered drug 1, you live. If you have
disease 2 and are administered drug 2, you live. In all other cases, you
die.
This table has no main effects of either drug or disease, although there is
a whopping interaction effect. Does that mean you are indifferent between
the two drugs in the absence of knowledge about which disease infects you?
Given an equal chance of having either disease, you are, but if you know
disease 1 is 99 times as prevalent as disease 2, you have a strong preference
for drug 1.
That is, the MAIN EFFECT is DEFINED as if the data were balanced even when
it is not.
Equations (1) and (2) given above are the formal definition of the main
effect in a 2 x 2 factorial model. Estimate the model using plain old
linear regression, work out the means in the table, perform the appropriate
linear hypothesis test, and you will get the same results produced by
formal ANOVA. The math is easy:
Estimate:
y = a*(A=2) + b*(B=2) + d*(A=2)*(B=2) + c + noise
where (A=2) is a dummy equal to 1 if A is 2, (B=2) is a dummy for B=2,
and a, b, d, and c are the estimated regression coefficients.
The table of means is then:
| A=1 A=2
-----+--------------------------
B=1 | c a+c
B=2 | b+c a+b+c
Equation 1, the main effect of A is now
(c + b+c)/2 = (a+c + a+b+c)/2 (1')
Which simplifies to
a = 0 (1'')
Similarly, equation 2, the main effect of B, is
(c + a+c)/2 = (b+c + a+b+c)/2 (2')
which simplifies to
b = 0 (2'')
We have done this in the context of a 2 x 2 FACTORIAL model. If we constrain
the interaction effect c to be 0, however, we still end up with the same
hypothesis being tested. The test of the main effect of A is a test that
a=0; the test of the main effect of B is that b=0. The 2 x 2 case,
unfortunately, is too simple to show the difference in the definitions of
main effects. This difference, as the argument develops, is important,
so lets take an aside and do a 2 x 3 case.
A 2 x 3 case
------------
Our tableu of theoretical means is:
| A=1 A=2 A=3
-----+--------------------------------------
B=1 | m11 m12 m13
B=2 | m21 m22 m23
The test of the main effect of A is a test that
(m11+m21)/2 = (m12+m22)/2 = (m13+m23)/2
Note, there are two constraints being simultaneously tested. The test of
the main effect of B is a test that:
(m11+m12+m13)/3 = (m21+m22+m23)/3
Let us estimate the linear regression:
Y = a*(A=2) + b*(A=3) + d*(B=2) + e*(A=2)*(B=2) + f*(A=3)*(B=2) + c
The table is then:
| A=1 A=2 A=3
-----+--------------------------------------
B=1 | c a+c b+c
B=2 | d+c a+d+e+c a+d+f+c
The test of the main effect of A is a test that
(c + d+c)/2 = (a+c + a+d+e+c)/2 = (b+c + a+d+f+c)/2
which simplifies to 2a-e=0 and b+a+f=0. Thus, in the 2x2 case, the main
effect of A simplified to something that we might naturally term the effect
(main or not) of A, that the A coefficient was 0. This is not true in the
2 x 3 case. The main effect is NOT a test that a=0 and b=0.
The test of the main effect of B is a test that
(c + a+c + b+c)/3 = (d+c + a+d+e+c + a+d+f+c)/3
which simplifies to b=3d+a+e+f (which is not the same as a test that d=0).
If we were now to consider a 2 x 3 model, NOT factorial, we constrain e=f=0,
and the test of the main effect of A becomes a=0 and b=0. The test of
the main effect of B becomes b=3d+a+e.
The 2 x 3 case with a missing cell
----------------------------------
Let's imagine the data is unabalanced; so unbalanced that we do not observe
the A=2, B=2 cell. Estimating our regression, the 2x3 factorial tableau is:
| A=1 A=2 A=3
-----+--------------------------------------
B=1 | c a+c b+c
B=2 | d+c ? a+d+f+c
What now is the test of the main effect of A (the test of difference in
A means in absence of knowledge of B)? One definition is:
(c + d+c)/2 = a+c = (b+c + a+d+f+c)/2
This is the sophisticated definition, sophisticated in the sense that it is
the one sophisticated packages such as Stata and, it must be noted, SAS, use.
Its sophistication, however, is merely that it follows from another,
different development of main effects based on matrix algebra that disguises
what is really being tested. Other definitions are possible and, thought
about substantively, no doubt equally as good.
Now for angels dancing: Why is there a question mark in the table for the
mean of A=2 and B=2?
1) Obvious: you just told me there was no data for the cell, so I
do not know it.
2) Obvious: since there is no data, I cannot estimate the coefficient
e, and since the mean of the cell is a+c+e, that becomes a+c+? which
becomes ?.
Think there is no difference between the answers? All right, let's consider
the 2x3 model (sans factorial). The hypothesis test for the main effect of
A is either
(c + d+c)/2 = a+c = (b+c + a+d+c)/2
or
(c + d+c)/2 = (a+c + a+d+c)/2 = (b+c + a+d+c)/2
depending on whether you buy into obvious explanation 1 or 2! Under
explanation 1, no data in the cell means you act as if you are ignorant
about it. Under explanation 2, however, you could estimate all the
coefficients necessary to PREDICT the cell's mean, so you use it!
Under obvious explanation 1, you don't know the mean of the cell so you
don't use it. Under obvious explanation 2, you can't estimate e, so you
don't use the cell. But, if you constrain e to be 0, the fact that you
can't estimate it becomes irrelevant!
Back to the 2 x 2 problem at hand
---------------------------------
So, what is STATISTICA doing? What are the rest of us vendors doing?
Miller provided a 2 x 2 example with the A=2, B=2 cell unobserved:
| A=1 A=2
-----+--------------------------
B=1 | x11 x12
B=2 | x21 ?
She asked about a 2 x 2 nonfactorial layout.
Stata and the others say, in the nonfactorial layout, I can estimate all
the ingredients necessary to PREDICT the A=2 and B=2 mean, and the
test of the main effect of A is a test that
[predicted(x11)+predicted(x21)]/2 = [predicted(x12)+predicted(x22)]/2
Statistica is saying, sure you can, but there's no data for the cell so
I'm not going to use that prediction. My test for the main effect of A
is
[predicted(x11)+predicted(x21)]/2 = predicted(x12)
Conclusion
----------
What do YOU mean by a main effect?
The language of ANOVA, once we move from simple cases, is insufficiently
precise. It leaves room for honest statisticians to disagree as to
what the linear hypothesis ought to be corresponding to the vague words.
Stata, SAS, and the rest have one definition.
Statistica has another.
In defense of Stata and SAS, both will provide the SYMBOLIC FORM of the
test being perfomed (I don't know about the rest of us). Stata and SAS
will reveal exactly what it is that they are testing so that you can think
about it without having to work through the math. In the real world of
messy data, analysts should examine this report.
--Bill Gould
wgould@stata.com