[S] summary, df in smooths

Jane Elith (j.elith@botany.unimelb.edu.au)
Thu, 01 Oct 1998 15:03:15 +1000

Thanks very much to those who sent suggestions. I had some requests for
summaries, so edited comments follow. In essence all replies focussed on

A modified AIC approach has also recently been reported in the literature,
with Splus code:

Hurvich, C. M., Simonoff, J. S. & Tsai, C. L. (1998) Smoothing parameter
selection in nonparametric regression using an improved Akaike information
criterion. Journal of the Royal Statistical Society Series B -
Methodological 60, 271-293.

and web site: http://www.blackwellpublishers.co.uk/rss

So far I've used the improved AIC and have minimised the $cv.crit; with our
data the latter gives more consistent results in an automated setting and
is simple to apply. I've still to test the other 2 approaches.

thanks for the help..

######the original request with Brian Ripley's comments:

> I would appreciate some advice on the following (using Splus v3.3, win95):
> We have a set of data: 790 pairs of measurements on trees, each describing
> height and reldbh (relative diameter at breast height). We want to specify
> a function to describe the reldbh vs ht relationship, and have been looking
> at smoothing splines. We require a method for balancing the trade-off
> between bias and variance in a given smooth. Our question is how to select
> a value for the smoothing parameter through approximation of an appropriate
> number of degrees of freedom. Four different approaches have yielded four
> different 'answers'. More explicitly:
> 1. Use of step.gam with the intial model:
> reldbh ~ 1
> and the scope argument:
> scopeall
> $ht:
> . ~ 1 + s(ht, 2) + s(ht, 3) + s(ht, 4) + s(ht, 5) + s(ht, 6) + s(ht, 7) +
> s(ht, 8) + s(ht, 9) + s(ht, 10)
> uses AIC and selects 6df
Um. It uses Hastie's AIC not AIC, and you need a good initial model.

> 2. Use of step.gam with initial model a smooth ..eg:
> reldbh ~ s(ht,2)
> and a matched scope argument with the lowest df = 2 in this case
> uses AIC and selects 11df (as does starting with eg 3 or 5 or 7df)
> 3. Use of an anova (test="F") to compare a number of gams (each simply
> gam(reldbh ~ s(ht,x))), with x varying eg between 3 and 12, shows that the
> reduction in RSS is significant in sequential comparisons until df = 10.

AIC will choose a larger model than tests like this, and be better for

> 4. Direct use of smooth.spline without specifying the df and allowing GCV
> selects ~108 df.

I think approach 4 is the correct one, but smooth.spline has a bug (look
at the example on the help page, which gives silly answers for me). Try
using cv not gcv in smooth.spline or Ramsey's pspline library instead of

> Since we will be using randomisation tests later on and fitting a new
> smooth to each set of data in each randomisation we would like to be
> confident that our approach to selecting df for the smooth is a sound one.
> We have read the apparently relevant sections of Hastie & Tibsharani,
> Chambers and Hastie, Venables & Ripley and the Splus guides, but without
> stats and maths training are struggling to come to a conclusion about the
> best approach.
> Thanks for any help
> Jane Elith and Terry Walshe
> PhD students, Environmental Science.

###### George Watters:

I have recently coauthored a paper on using smoothing splines to model the
relationship between claw size and body size in king crabs. ..I offer our
approach as one alternative... We used an approach to choosing an
appropriate df which you don't seem to have included in your list.
Basically, we used an approximate F-test, but compared smooth.spline()
objects instead of gam()objects. I guess our approach would be something
between your ideas 3 and
4. The reference for our paper is

Watters, G. and Hobday, A.J. 1998. A new method for estimating the
morphometric size at maturity of crabs. Canadian Journal of Fisheries and
Aquatic Sciences 55:704-714.

##### David J Cummins:

I have found that the best approach is to use a penalized version of
generalized crossvalidation. GCV was fourth on your list. GCV is unstable
and often results in under-smoothing, as you saw with the 108 df, but if
you place a stronger penalty on the variance side of the optimality
criterion, you can get a much more stable and still asymptotically
efficient result.

####Henrik Aalborg-Nielsen:

I think I would use a CV criterion and use smooth.spline() directly,

CV <- rep(NA,50) # Or whatever max you tkink is appropiate
for(df in 2:50) CV[df] <- smooth.spline(x=x, y=y, df=df, cv=T)$cv.crit

And chose df to minimize CV. For "automatic" handling of future data
sets "of the same kind" you may just want to use the same df.

NOTE: Hastie & Tibshirani (1990) has a section on automatic selection
of smoothing parameters (section 3.4) in their book 'Generalized
Additive Models' (Chapman & Hall). I think one of the messages is
that one should be carefull about minimizing CV in that the optimum is
very flat.

This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news