[S] Apology for private messages / validating models

Frank E Harrell Jr (fharrell@virginia.edu)
Thu, 12 Mar 1998 09:42:19 -0500


I'm sorry for sending private replies to s-news. I finally figured out what I'm doing wrong:
I set Microsoft Outlook Express to put all messages with [S] in the subject into an S folder,
and I mistakenly thought that the s-news handler was not forming 'reply to' fields correctly so I got into
the habit of manually specifying s-news in replies.

Regarding requests for summaries of the discussions about bootstrap and cross-validation
of predictive models, I'm now writing up a summary of my simulation studies, at least. This should
be ready in a few hours.

I would like to raise a couple of issues based on some perceptive queries by Brian Ripley:

1. When stepwise variable selection is being done, Efron and Gong have stated that the ordinary
bootstrap provides the "right" amount of variation when the stepwise algorithm is repeated
for each bootstrap repetition. Is it "right" enough for everyday use when someone really does
force us to do stepwise model development?
2. The bootstrap uses on average 0.632 of the original observations during each re-sample.
One could say that it is only validating a 0.632 sample model fit. On the other hand, duplicate
and triplicate observations, etc., fill out the bootstrap sample to be of the original sample size.
I've often thought of the duplicate observations as resulting in "increased overfitting" so that when
the bootstrap model fit is testing on the original sample, where we get "ordinary overfitting" the
difference in performance (estimate of optimism in R-squared, etc.) estimates the net
effect of ordinary overfitting. Be that as it may, does the bootstrap validate the process of
fitting a model of size n or of size .632n?
3. To add more confusion, in the context of cross-validation, Shao
showed that the number of observations held back
for validation should often be larger than the number used to train
the model. This is because in this case one is not interested in an
accurate model (you fit the whole sample to do that), but an
accurate estimate of prediction error is mandatory so as to know which
variables to allow into the final model. Shao suggests using a
cross--validation strategy in which approximately n^.75
observations are used in each training sample and the remaining
observations are used in the test sample. A repeated balanced or
Monte--Carlo splitting approach is used, and accuracy estimates are
averaged over $2n$ (for the Monte--Carlo method) repeated splits.

@article{sha93lin,
author = {Shao, Jun},
journal = JASA,
pages = {486--494},
title = {Linear model selection by cross--validation},
volume = {88},
year = {1993},
annote = {cross-validation; jackknife; bootstrap; model validation; variable selection; regression (general)}}

Another interesting article is:

@Article{efr97imp,
author = {Efron, Bradley and Tibshirani, Robert},
title = {Improvements on cross--validation: {The} .632+
bootstrap method},
journal = JASA,
year = 1997,
volume = 92,
pages = {548-560},
annote = {internal and external variability;internal and
external validation;model
validation;bootstrap;simulation setup}
}

Whichever method statisticians use, I'm glad to see the amount of interest in this
area. We see far too many models published that are unvalidated or are validated in
an insufficiently sized hold-out sample.

---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news