Re: [S] Prediction problem

Bill Venables (wvenable@attunga.stats.adelaide.edu.au)
Wed, 4 Mar 1998 09:42:19 +1030


James Clark writes:
> Can someone please explain why the error below occurs when I try to
> predict using f2. I don't understand why the result isn't the same
> as using f1. I have come across this problem before with other functions,
> this is just the latest incarnation.
>
> > f1 <- lm(y ~ poly(x, deg=5), old.data)
> > dd <- 5
> > f2 <- lm(y ~ poly(x, deg=dd), old.data)
> > predict.gam(f1, newdata = new.x)
> [1] 4.1063
> > predict.gam(f2, newdata = new.x)
> Error in safe.predict.gam(object, newdata, type, s..: Length of variable 2 is 1
> != length of row names (1000)
> Dumped
>

This seems to me to be a serious design flaw and one I have
complained about several times in the past, (with the effect
evident...). I can't offer a solution but I can offer a clumsy
work-around. First, what's going on? Had you typed

> traceback()

after you saw Dumped, (always a good idea, in fact The Golden
Rule) you would have got something like the following:

> traceback()
Message: Length of variable 2 is 1 != length of row names (41)
6: safe.predict.gam(object, newdata, type, se.fit, terms)
5: model.frame.default(formula = structure(.Data = expression(x, dd
4: eval(Call)
3: safe.predict.gam(object, newdata, type, se.fit, terms)
2: predict.gam(f2, nd)
1:

Not very informative, but thinking about it you start to see what
is happening. model.frame.default is being used to set up a data
frame for safe.predict.gam to evaluate and use. This data frame
contains the variables needed to evaluate the predictions, namely
variable 1, x, from old.data, which in my case is of length 41
(and in your case, I bet, if of length 1000), and variable 2, dd,
which is of length 1, so the two do not fit in the one data
frame. This is only detected, though, when the data frame is
used in evaluation again by safe.predict.gam, and blooey! The
message now starts to make sense.

The (admittedly unsatisfactory) work-around is to operate with
formulae that contain only variables that can be placed within a
common data frame. Here is a mock-up example.

> tmp <- -20:20
> old.data <- data.frame(x = tmp, y = tmp + tmp^2 + rnorm(tmp))
> new.x <- data.frame(x = 21)

Now suppose we want to find out what happens to predictions as
the degree of the fitted polynomial is increased. A natural way
to go about it would seem to be:

> for(i in 1:4) {
+ ff <- lm(y ~ poly(x, i), old.data)
+ print(predict.gam(ff, new.x))
+ }
Error in safe.predict.gam(object, newdata,..: Length of variable
2 is 1 != length of row names (41)
Dumped

but the design flaw frustrates us. It does not seem to be
immediately clear how to fix it in this case, but substitute()
comes to the rescue:

> for(i in 1:4) {
+ form <- substitute(y ~ poly(x, .i), list(.i = i))
+ ff <- lm(form, old.data)
+ print(predict.gam(ff, new.x))
+ }
[1] 161.64
[1] 462.64
[1] 463.56
[1] 463.5
>

(Statistically this gives the wrong impression, of course. The
apparent stability of predictions certainly does not hold up
indefinitely. Exercise: when does it start to collapse?)

Not very satisfying, but possibly instructive.

Bill

-- 
Bill Venables, Head, Dept of Statistics,    Tel.: +61 8 8303 5418
University of Adelaide,                     Fax.: +61 8 8303 3696
South AUSTRALIA.     5005.   Email: Bill.Venables@adelaide.edu.au

----------------------------------------------------------------------- This message was distributed by s-news@wubios.wustl.edu. To unsubscribe send e-mail to s-news-request@wubios.wustl.edu with the BODY of the message: unsubscribe s-news