Re: [S] Problem with predict()

Bill Venables (wvenable@attunga.stats.adelaide.edu.au)
Mon, 25 May 1998 17:57:48 +0930


Terry M. Therneau writes:
>
> Thomas Lumley has kindly pointed out a bug in the "prediction"
> part of survexp for coxph models, when there is a factor variable
> for which the new data contains a subset of the old data's
> levels. I thought I'd make it work correctly, i.e., like
> predict.lm. That was before I looked at what predict.lm does.

...

> Why doesn't the prediction make use of the contrast matrix? The
> contrasts matrix is referenced in the predict.lm call to
> model.matrix() that builds a new X matrix, but it appears that
> the internal C code pays NO attention to the row labels of the
> contrasts. It makes the assumption that the new data frame has
> the same number of levels, in the same order, for every factor in
> the original data frame. It does read the columns, which only
> fixes the problem of a change in the global 'contrasts' option
> between the time of the lm() and predict() calls.
>
> This is a serious bug.

Indeed, a some of us have been harping on about it for years, but
to no obvious effect. (You touch a raw nerve here, Terry...)

The standard, if obscure, advice is "if in doubt, use predict.gam,
explicitly". This is in the blue book, no less. To see why this
is a safe procedure you need to look at the protocol adopted by
predict.gam. A slightly simplified version is as follows:

1. Put together the original model frame and new data frame into a
single data frame of (n.old + n.new) rows. (This ensures, that
factor levels are all complete, among other things.)

2. Re-fit the model using all data to generate the model matrix,
but only the rows corresponding to the original data (the rows
with a response present, in other words) to do the estimation,

3. Predict for all cases, new and old,

4. Check that the predictions for the original data agree with the
original predictions sufficiently well (this is more an issue
for smoothing splines and loess terms where it will not be
exact), If not, issue a warning.

5. The predictions for the new cases only are returned as the
value.

This is a potentially lengthy and cumbersome procedure and most of
the predict methods do not do it for simple efficiency's sake. So
if you use anything where the form of the model matrix is not
strictly calculable row by row (terms involving poly(), bs(),
ns(), factors with incomplete levels, and scores of others) using
predict() will probably get you WRONG answers without warning.

In my view the correct procedure is always to provide a SAFE
technique as the default and if there is a quick risky alternative
you provide that as an alternative for the desparate or the
foolhardy. You should not provide a risky technique as the
default, especially one that fails without any noise or smoke.

When I first raised this with friends in Seattle the response I
got was "Eh? You want to make the modelling software *slower*?"
and after the laughter subsided the talk went to another subject.
Well, my answer is "Yes, if that's what it takes to make it safe".
No qualifications, no apologies. (I am not convinced that
adequate safety necessarily requires all the steps involved in
predict.gam every time, either.)

> Predict.coxph and survexp.coxph are slightly more serious -- I
> forgot to pass along the contrasts matrix. In fixing this, I now
> have to decide whether to propogate the global S mistake. I
> favor not, but this will make coxph not consistent with the rest
> of the package.
>
> Opinions?

Unequivocally you should go against the global S mistake and do
the sensible, safe, correct thing.

Bill
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news