First, let me state the problem. The data provided is:
> A B Y
> 1 1 1
> 1 1 2
> 1 2 3
> 1 2 4
> 2 1 5
> 2 1 6
An ANOVA model of Y on A and B, but not A*B, is estimated. The reported
default computed by SPSS is
> Sum of Mean Sig
> Source Squares DF Square F of F
> Main Effects 16.000 2 8.000 16.000 .025
> A 16.000 1 16.000 32.000 .011
> B 4.000 1 4.000 8.000 .066
and in this SPSS is not alone -- Stata, a packaged I authored, produces
the same answer. The puzzle is brought on when one examines the means of
the data:
----------------------------------------------------------------------------
. tabulate a b, summarize(y) mean
Means of y
| b
a | 1 2 Total
-----------+----------------------+----------
1 | 1.5 3.5 | 2.5
2 | 5.5 . | 5.5
-----------+----------------------+----------
Total | 3.5 3.5 | 3.5
----------------------------------------------------------------------------
Note that the mean(Y|B=1) is equal to mean(Y|B=2). How can the main effect
of B be even a little significant?
In a previous posting, I attempted to explain how the calculation is made
by Statistica and by others. Let me use the notation Yij to refer to Y
for a=i and b=j. To summarize my previous discussion, in this special case,
Stata, SAS, SPSS, and others, say the test of B is a test that
[predicted(Y11) + predicted(Y21)]/2 = [predicted(Y12) + predicted(Y22)]/2
whereas followers of the "cell means" approach, of which Statistica is one,
say the test of B is a test that
[predicted(Y11) + predicted(Y21)]/2 = predicted(Y12)
I also attempted to summarize the logic behind each approach and argued
that the difference hinged on something akin to how many angels can dance
on the head of a pin.
I would now like to attempt to explain the approach used by SAS, SPSS, Stata,
etc., although I feel again that we are talking about angels. Here's why:
The DEFINITION of a main effect of B has to do with prediction IN BALANCED
DATA, said more succinctly as the absence of knowledge about B. This feature
of the main effect is a far more important determinant of the misinterpre-
tation of ANOVA output than what we are now discussing. It happens, however,
that in simple cases, this feature of the main effect falls away and the
definition of the main effect merely happens to correspond to what one might
term the effect of B. Thus, this discussion is of interest, but is subject
to misinterpreation. In more complicated ANOVA models, it is the balanced
feature which will most likely bite users.
In any case, let us continue. I provide a mathematical appendix below,
but this problem can be more revealing talked through.
The problem states that a 2x2 nonfactorial model is used to estimate the
data. It is the nonfactorial part that is the key. We know:
Means of y
| b
a | 1 2 Total
-----------+----------------------+----------
1 | 1.5 3.5 | 2.5
2 | 5.5 . | 5.5
-----------+----------------------+----------
Total | 3.5 3.5 | 3.5
To formulate the hypothesis, think about point estimates. We are being
asked: What is our point estimate of the effect of B in the absence
of knowledge of A?
Statistica answers the question as follows: Not knowing A, I note that
the mean of Y for A=1 and B=1 are the same and therefore the point
estimate is 0.
SAS, SPSS, Stata, etc., would answer: You said a nonfactorial model? Well,
in that case, you have told me that the science of the problem is such that
A and B do not interact. I note that for A=1, the difference is
3.5-1.5 = 2, and given no interaction, I would expect the same difference
for A=2 if only I had observed it, so my point estimate of the B effect,
even in the absence of knowledge of A, is 2.
Which argument do you find more appealing? To decide, think of the
following: You are a member of a group with a horrible disease and Y
is the expected number of years until your death. A is a characteristic
you have, one that can only be determined by autopsy. 50% of people
turn out to be Type A=1 and 50% Type A=2. B is a drug. A and B are
known, a priori, to not interact, because A has to do with a characteristic
of your heart and B affects your liver. With this disease, it is merely
a question of which goes first.
Statistica says: you should be indifferent whether you take the
drug; the point estimate is 0. Taking all people together, there is
no difference in average length of remaining life.
We say: You should want to take it, the point estimate is positive. Among
people with Type A=1 hearts, it appears liver treatment B=2 prolongs life.
We would expect liver treatment B=2 to work the same even among those with
Type A=2 hearts, but we do not know this yet.
To wit: I find the argument convincing that, under the assumptions of the
model, the SAS, SPSS, Stata, ..., answer is preferred.
The more embarrassing case for SAS, SPSS, Stata, ..., is when the question is
asked about this model of Y on A, B, and A*B -- when there is an interaction
between A and B. In that case, at a substantive level, I think one would have
to throw up one's hands and say, I have no idea about the main effect of B.
Unfortunately, the formulas we (Stata) have implemented to make the language
of ANOVA operational are based on orthogonalization of upper-hermite forms.
To make a long story short, in a 2x2 table with a missing cell, there are
simply no more parameters to be fit; we orthogonalize with respect to a zero
matrix and produce the same results for the main effects as when the
interaction is not included! (Despite this, the formulas used have much to
recommend them.)
As I rethink these issues now, I am struck by two thoughts:
1) Someone on this list reported that Systat will not estimate models
with missing cells. I've heard this before and, before, I have
always thought, "How very inelegant of them." I am now more
sympathetic to that position.
2) I am more sympathetic, but there are instances in which handling
missing cells the way "we" do is helpful. There is, however,
a danger. Perhaps if any of the requested effects are estimated
to be 0, we should either refuse to report the results -- saying
the model is not estimable -- or report it with a warning.
A mathematical appendix follows and has a slightly different take on this
whole issue.
-- Bill
wgould@stata.com
Mathematical appendix
---------------------
We can most easily understand the issues by writing the underlying regression
model for the 2 x 2, noninteracted table:
Y = a*(A=2) + b*(B=2) + c + residual
The table of predicted means is:
| B=1 B=2
-----+--------------------------
A=1 | c b+c
A=2 | a+c a+b+c
One test of the main effect of B, the one reported by SAS, SPSS, Stata,
et al., as I explained previously is
(c + a+c)/2 = (b+c + a+b+c)/2 (1)
or, simplifying:
b = 0 (2)
Another test of the main effect, the one reported by Statistica, is:
(c + a+c)/2 = b+c (1')
The argument for (2) is that the A=2, B=2 mean is unobserved. However,
(2) can be rewritten:
(c + a+c)/2 = (b+c + b+c)/2
Which places it in the form of (1). In this form, it becomes clear that
the cell-means model is equivalent to assuming that the unobserved mean
for A=2, B=2 is the same as the mean for A=1, B=2. In effect, we see
the table:
B
| 1 2 | 1 2
--+---------- but act as if we see --+----------
A 1 | 1.5 3.5 1 | 1.5 3.5
2 | 5.5 ? 2 | 5.5 3.5
Clearly, we have biased the test to underreport the effect of B under the
assumptions given.
This is not the interpretation proponents of the cell-mean model would want
us to take. They would merely say the test is,
(c + a+c)/2 = b+c (1')
or, simplifying:
a = 2b (2')
They would say my recasting of (1') into my formulation may make it sound
absurd, but the test was never intended to be a test of coefficient b -- it
is a test of the main effect of B and, as stated, valid.
I personally do not find this response convincing, but others may disagree.
--Bill
wgould@stata.com