Prof Brian D Ripley (ripley@stats.ox.ac.uk)
Wed, 17 Jun 1998 18:34:33 +0100 (BST)

MathSoft have introduced a potentially serious bug in 4.5 Professional (and I
presume Standard) that was not in the beta release. This affects many uses of
predict with data frames containing factors: for example I get incorrect
results on my test examples with predict.tree including the shuttle data
example in V&R2 p.424.

The problem is a new argument drop.unused.levels in model.frame.default, which
defaults to T. (Despite the help page, it is the fourth argument, not the
third). The explanation says that without this there may be singular model
matrices. However, forming the model matrix is the job of model.matrix, not
model.frame, and singularity is only important when fitting, not predicting.

There has been some considerable discussion here of the need for safe
prediction to have the same sets of levels in the original data and in newdata.
The new default drop.unused.levels=T makes this impossible unless the original
data and newdata both have the same set of used levels. So it has made an
existing nuisance into a man trap.

I have no idea how far-reaching this is: to find out one would have to examine
all uses of model.frame throughout the system _and_ all user-contributed code.
I do know that I _usually_ get wrong answers with predict.tree, as this maps
the factors to their codes. I suspect that this applies to most
model-fitting functions that use model.frame.

I have replaced model.frame.default by a copy with drop.unused.levels=F.
I trust MathSoft will think very hard indeed in future about introducing
such a pervasive change without beta testing, and that they look hard
at the coverage of their verification suite. (We find the V&R2 scripts
quite a useful test....)

