As I understand Al Best's reply to the query by Jinping Shi
concerning interaction terms in logistic regression, Best regards
Shi's difficulty as stemming from the use of Wald tests, rather
than likelihood-ratio (Wilks) tests, and Best resolves the problem
in favor of the additive model by stating that "there is no evidence
from the Lack-of-Fit test that you need more terms." He favors the
the use of software such as JMP, which
"does real logistic regression and can test effects
using the likelihood ratio tests."
I see four different flaws in this answer.
(1) The deep problem underlying Shi's query derives from the fact
that his design is unbalanced, in a way that makes questions
about additivity very hard to answer without much larger
sample size; yet Best's answer does not address this problem
at all.
(2) The conclusion that Best seems to favor is completely unwarranted
on the basis of Shi's data. In general, failure to reject the
additive model, in a low-power test, is NOT sound evidence for
lack of interaction; and in particular, there is a 3-parameter
interactive model that fits these data even better than the
(also 3-parameter) additive model--namely, large treatment effect
for gender=1, with ZERO treatment effect for gender=0, plus a
gender difference, for treatment=0. The plausibility of such
an interactive alternative is an open question--one that has to
be considered in the detailed context of the scientific problem
(which Shi has not yet supplied, though I requested it). More
generally, with such low power, data consistent with additivity
are usually consistent also with models containing large and
important interactions.
(3) The issue has nothing to do with Wald versus LR tests, which
(typically, and in this instance) give very similar results.
There is nothing "unreal" about Shi's logistic regressions--
though as I pointed out in my earlier comment, the main
points can be seen most clearly just by considering the
observed differences in proportions and their sampling
errors. Raising the question of Wald tests vs. Wilks tests
places an irrelevant technical question in the limelight.
(4) The key feature of the JMP software illustrated by Best's output
is that comparison of models is made explicit. I do agree
with Best that model comparison is generally useful and that
LR tests make this (slightly) easier--because of the additivity
of log LR within a model hierarchy. In the present case,
however, the model comparisons led Best to an unjustified
conclusion (that the additive model is good enough), while the
software Shi used originally suggested (correctly) that the
true model is not well determined by the data. Boosting JMP
seems unwarranted on the basis of the example.
----------------------------------------------------------------
While I am busy being negative, let me also disagree sharply with
the summary statement Jinping Shi offered in response to a variety
of inputs from the list:
"I guess the so-called regression techniques developed
in the last few decades are not the right tools given the
fact that it can be so confusing and so unreliable."
On the face of things, the problem is not with statistical techniques,
it is with the question being asked and the design of the experiment.
No method of analyzing data will answer an additivity question
for a 2 x 2 factorial design when the sample size in one of
the 4 cells is zero; and in the present context, N=20 is almost
the same as N=0.
----------------------------------------------------------------
In a deeper sense, relevant to this list, there IS a problem concerning
statistical techniques: DO SOME TECHNIQUES FACILITATE GOOD EDUCATION
AND AVERT MISUNDERSTANDING MORE THAN OTHER (MATHEMATICALLY EQUIVALENT)
TECHNIQUES? Twenty-four years ago, at the end of my first year of serious
teaching of statistics, I concluded that the answer is "yes, definitely."
Right now, my answer is "probably", but discussions like the one
initiated by Shi's question make me think that any powerful set of
data-analytic methods will, by very reason of their power, be hard
to teach and easy to misunderstand.
Dave Krantz (dhk@columbia.edu)