Re: [S] confusion about prune.tree

Prof Brian D Ripley (ripley@stats.ox.ac.uk)
Wed, 22 Jul 1998 07:09:59 +0100 (BST)


On Tue, 21 Jul 1998, Andrea Brown wrote:

> I am calculating some regression trees but when I use prune.tree() and
> best=argument I find that I don't always get back a tree the size I
> indicated. Instead the tree is grown to a larger size and then the
> deviances are calculated. Is this an error in the code, or does this
> have something to do with a minimum default value related to the
> cost-complexity paramter. I am aware that prune.tree has some bugs but
> these don't seem to be related to this problem.

You do need to tell us what version of S-PLUS you are using in questions
like this, as prune.tree in 3.4 is very different from that in 3.3, and
there are minor differences since. I do try to get MathSoft to
sort this out, but the code in my treefix library is always later.
If perchance you are using 3.3 or earlier do use treefix; those bugs are
serious.

First, under no circumstances does (AFAIK) prune.tree `grow a tree to
a larger size'. It chooses one of a set of pruned versions of the
original trees for a cost-complexity index alpha (or k to the S-PLUS code),
minimizing

fit + alpha * size

Those trees are not of all sizes, and really you should be selecting one by
choosing alpha (e.g. by cross-validation) and not size. Given that,
in their wisdom Pregibon and Clark introduced the best= parameter and it
has been maintained for backwards compatibility. In the latest version
the code is

if(!missing(best))
index <- ind[sum(best <= size)]

So, in the example in V&R

> prune.misclass(bwt.tr1)
$size:
[1] 19 11 5 2 1
.....
> prune.misclass(bwt.tr1, best=8)

gives a tree of size 11, as the nearest (larger) match. But you should
really only ask for one of these sizes. I think the help page is
actually quite clear

best: integer requesting the size (i.e. number of terminal
nodes) of a specific subtree in the cost-complexity
sequence to be returned. This is an alternative way to
select a subtree than by supplying a scalar cost-
complexity parameter k. If there is no tree in the
sequence of the requested size, the next largest is
returned.

but maybe only in current versions (the last sentence is not in
the help page for 3.4).

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

----------------------------------------------------------------------- This message was distributed by s-news@wubios.wustl.edu. To unsubscribe send e-mail to s-news-request@wubios.wustl.edu with the BODY of the message: unsubscribe s-news