Re: [S] more powerful model without intercept?

Alan Zaslavsky (zaslavsk@hcp.med.harvard.edu)
Wed, 18 Mar 1998 13:56:35 -0500 (EST)


> From: Lutz Prechelt <prechelt@ira.uka.de>
>
> > summary(fit _ lm(time ~ expect+AP+sess+prg+subtask+langsN+oomethN+
> log(cpp+1), data=d))$r.squared
> [1] 0.5015439
> > summary(fit2_ lm(time ~ expect+AP+sess+prg+subtask+langsN+oomethN+
> log(cpp+1)-1, data=d))$r.squared
> [1] 0.8114594
>
> The only difference between the models is that the second one
> must do WITHOUT an intercept.
> How can it possibly be (so much) better?

Right, these are identical models, but differently coded. When you
have a factor in the model but no intercept, S uses a factor coding
with parameters for both levels (for the first factor in the model),
since otherwise there is some arbitrary constraint on the mean at one
of the levels. That's why you see the same number of parameters (no
intercept, but two parameters for the first factor) in the two models.

Look at the following four models:

d_data.frame(y=rnorm(20),x=rnorm(20),a=factor(rnorm(20)>0))
attach(d)
mod1_lm(y~x)
mod2_lm(y~x-1)
mod3_lm(y~a)
mod4_lm(y~a-1)

If you fit these you will find that the first two differ because there is
no factor, so a no-intercept model is meaningful, but the last two are the
same because you get a two-parameter coding of factor "a". The strange
thing is that the R^2 for models 3 and 4 are different.

The source of the "problem" appears to be in these lines excerpted from
summary.lm:

if (int) { <snip>
# intercept and no weights
mn <- mean(fv) # same as mean(yy)
r2 <- sum((fv - mn)^2)/sum((yy - mn)^2)
} <snip> else {
# no intercept
r2 <- sum((fv)^2)/sum((yy)^2)
<snip>

"int" is determined whether there is an explicit intercept in the
model, so it is false for mod2 and mod4. In most models, like mod1 and
mod3, the null model is the one with only an intercept, so you would
pull out that intercept (subtract the mean) to calculate Total Sum of
Squares (TSS) for the denominator of R^2.

In a model like mod2, the null model is the model with no
coefficients at all (all predictions are 0), so the (uncorrected) sum
of squares is the relevant TSS for calculating R^2.

For mod4, it's not clear what the appropriate null model is. The
behavior of S is consistent with the interpretation that your null
model is the same as for mod2, i.e. you really don't want any
intercept hanging about. But if you just put in -1 to change
the factor coding, you might have meant that the null model still has
an intercept. Then you will get puzzled when you see a different R^2
from mod3.

I would call this behavior a definite "feature" rather than a bug, but
you have to be pretty careful about asking the right question of your
summary!

Alan Zaslavsky
zaslavsk@hcp.med.harvard.edu
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news