Re: Statistica anova

Robert Terry (rterry@acpub.duke.edu)
Tue, 6 Sep 94 09:05:52 EDT


On Mon, 5 Sep 1994, James H Steiger wrote:

> Date: Mon, 5 Sep 94 21:42:25 EDT
> From: James H Steiger <steiger@unixg.ubc.ca>
> To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>
> Subject: Re: Statistica anova
>
> Maria Czyz, following a thread initiated by Paige Miller,
> had given the following interesting example of a 2x2 ANOVA
> with a missing cell:
>
> Czyz had written
>
> >try analyzing these data:
> > A B Y
> > 1 1 1
> > 1 1 2
> > 1 2 3
> > 1 2 4
> > 2 1 5
> > 2 1 6
> >The default results computed by SPSS are:
> > Sum of Mean Sig
> >Source Squares DF Square F of F
> >Main Effects 16.000 2 8.000 16.000 .025
> > A 16.000 1 16.000 32.000 .011
> > B 4.000 1 4.000 8.000 .066
> >
> >Is there really a marginally significant B main
> >effect? Here are the cell means and marginal means:
> >
> >Factor A Factor B Marg.Means
> > 1 2
> > ----------------
> >1 | 1.5 3.5 | 2.5
> >2 | 5.5 missing | 5.5
> > ----------------
> > 3.5 3.5
> >
>
> Statistica's default hypothesis simply ignores B (by treating all
> cells equally and averaging their means) when computing a "main
> effect" for A. We could call the Statistica type of hypothesis a "full
> data hypothesis," just for the sake of simplicity in subsequent
> discussion.
>
> If we can move away from the more petty aspects of this discussion
> (i.e., an attempt to score points for/against one's favorite
> statistical package), I think we might discover an interesting
> substantive issue here.
>
> In many cases, I would find myself agreeing with Paige Miller that it
> makes more sense to compare factor A only for levels of B that are
> available.
>
I think Steiger (along with Gould) has finally elevated the
discussion to one which has educational value.
As I have stated before (as has GOULD more eloquently), the proper
analysis is this situation DEPENDS entirely upon what you believe about
the POPULATION INTERACTION between factors A and B. Under an assumption
of NO INTERACTION, then STATISTICA'S test is biased; under an assumption
of a particular interaction (e.g. the missing cell has a value of 3.5 in
the population, then STATISTICA's test is more efficient.
Now, my problem with this is that if I really believe that 3.5 serves
as the POPULATION value for the missing cell, this indicates that an
INTERACTION actually does exist, and any MAIN EFFECT is quite
misleading, in the sense that such an effect is NOT CONSTANT over levels
of the other factor. On these grounds, I find both tests somewhat
suspect; one which presumes no interaction, the other which essentially
assumes an interaction which then leads to ambiguous interpretation of
the MAIN effect.
Note that this problem may also arise in factorial designs with
COMPLETE data. SAS's type II SS for MAIN effects of A and B are both
predicated on models which assume NO INTERACTION in the POPULATION. Of
course, with complete data, one can somewhat "gauge" the validity of this
assumption by "testing" the interaction with a significantly high alpha
level. In situations where one cannot evaluate the underlying
assumptions, then such models are incomplete from the point of being able
to test assumptions. This is generlly true of all missing data designs
of which I am familiar.
I must admit that I prefer SAS's approach which is to give the results
for many different tests under a variety of assumptions. IT is then
incumbent upon the user to know what is being tested and what assumptions
are being made. Far too often, I have heard from people using STATISTICA
that they like the package BECAUSE they really don't have to know very
much (READ - THINK).
AS far as pragmatics are concerned, I don't have a good answer to
this question. Although we would all prefer that user's of any package
not depend upon defaults, this is unrealistic in view of the amount of
training that otherwise bright individuals will receive in statistics.
Maybe Herman's analytic strategy may be of use here: we need to know all
possible states of the world (user's general knowledge) and the
probabilities that they would take a variety of actions, the consequences
and costs of all those actions, and determine whether the greatest good
is obtained by using one or more defaults over reasoned statistical
thinking.
Maybe a statistical approach would suggest that our "best" bet is to
take a decidely "cookbook" approach to data analysis :-).
I may be out of a job ....

Robert Terry
Dept. of Psychology
Duke University