Re: [S] comparing spline fits with different df?

Prof Brian Ripley (ripley@stats.ox.ac.uk)
Wed, 8 Apr 1998 21:47:20 +0100 (BST)


Bill Shipley wrote:
>
> Bill Shipley wrote:
> >
> > Hello,
> > I know that there are a number of different methods that have been suggested
> > to choose the "best" span value, or smoother df, when using spline
> > smoothers. However, I would like to know if there is any way of testing
> > whether or not a spline fit with x df provides a significantly better fit
> > than a spline fit with y df? In other words, assuming normality and
> > homogeneity of variance for the residuals, what is the sampling distribution
> > of the smoother cross-validation score?
>
> Brian Ripley responded:
>
> None I would trust very much. If the asymptotic theory is applicable,
> you could just use AIC to compare them, that is
>
> AIC = n log (RSS/n) + 2 * df.
>
> Only use this for df small compared with n, though.
>
> How about this: take a large number of bootstrap samples and fit with x df
> to get the empirical distribution of cv. Now see if the cv obtained with y
> df in the original sample falls in the lower 95% of this bootstrap
> distribution. If not, then y df gives a significantly larger cv score.

Hey ho. This falls into a common trap of assuming that the bootstrap
gives samples from the original distribution. It does not, by any means.
In this case the original sample will (I hope) have distinct values, the
bootstrap lots of repeated values. Most spline smoothing code does not
handle repeated values well (you don't say which one you are using) and in
any case the problem is nothing like the original one; for example
the distinct points are on average farther apart. I suspect that
the optimal degree of smoothing for a bootstrap re-sample is considerably
larger than for the original sample, as the effective sample size for the
re-sample really is much smaller.

My understanding is that the bootstrap assesses \hat\theta - \theta
by looking at \theta^* - \hat\theta (and often does that in very
complicated ways). So in this problem you need to compare results
on bootstrap resamples under both x and y df, _and_ you need some theory
to demonstrate that the resampling biases are of a high enough asymptotic
order.

Trust me (as Bill V says), you really don't want to try to prove such
a procedure works. And it is vastly slower than AIC.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news