Modern Model Selection Methods
Robert Stine, University of Pennsylvania
This workshop includes five lectures on
statistical model selection. The lecture topics are listed below, with some
elaboration following each. The problem of variable selection in regression is
used throughout to unify the methods, though the ideas generalize to many other
types of models.
(1) An overview of model selection
Automated data collection, data warehousing, and ever-faster computing combine
to make it possible to fit many variations of complex models. The combination of
many predictors, large samples, and powerful software make it easy to build
such models that hold the potential to reveal hidden structure. The problem for
the statistician is to decide whether what was found is meaningful.
Well-known criteria like Mallows' Cp, AIC, and BIC often produce very
different solutions. Which is right? All three originated to solve different
problems with simpler models, and one must ask whether they remain useful in
judging models with so many parameters. Several (two or three) examples will
be used to illustrate the use of these methods.
This first lecture reviews these classical selection criteria and contrasts them
with some recent innovations in model selection. New tools for model selection
have originated from various perspectives, including risk inflation, minimax
estimation, empirical Bayes, multiple hypothesis testing, and information theory.
The trend has been to make the selection criteria adaptive to the problem at
hand, and the following lectures explore some of these developments more
(2) Thresholding and multiple shrinkage
A classical problem in model selection is regression with just as many orthogonal
predictors as observations, such as in the use of Fourier methods (trigonometric
regression) in time series analysis. Recently, Donoho and Johnstone have shown
how a simple thresholding procedure provides the means to decide which coefficients
are important to retain for the final model. The solution has certain optimal
properties from a minimax perspective.
One arrives at a similar variable selection method though multiple shrinkage in a
Bayesian setting and through a decision-theoretic idea known as risk inflation.
This lecture explores these dissimilar, but convergent, approaches to variable
selection. Examples with the use of wavelet regression for nonlinear smoothing
(3) Adaptive methods
Thresholding methods are appropriate when only few of the underlying parameters
are large. In other problems, many of the coefficients are important to
retain and thresholding misses important parts of the ``signal.'' Several
recent variations on thresholding methods address this problem, and these
originate in diverse areas, including empirical Bayes and multiple hypothesis
testing. Time permitting, we will also discuss some more computational
approaches to this problem, such as those based on cross-validation.
Examples extending those from the previous lecture are used to see the value
of these improvements.
(4) Information theory and statistics
Information theory offers an alternative view of many statistical problems, in
particular model selection. Coding theory, a part of information theory, concerns
the efficient compression of data into a minimal length message. These same ideas
can be applied to statistical model selection. The connection is relatively
intuitive: a good-fitting statistical model is able to compress its data well.
Think of how a regression model reduces the initial sum of squares down to the
residual sum of squares.
Before we can talk about these applications to selection in any depth,
we need to lay some foundations. This lecture introduces the important
results from coding theory that are needed to see the relationship between
information theory and modeling that are developed in the next class. Anyone
who has ever wondered how those disk compression tools like WinZip work will
find the answer here as well.
(5) Information theory and model selection
Coding automatically protects from over-fitting. When coding, the
compressed message must include enough information so that the receiver can
recover the original data. For a statistical model, this means that the
message must include the parameters that identify the model used to compress
the data. A good fit alone is not enough since th number of bits needed to
encode the parameters may be larger than the bits saved in compressing the
data. The resulting two-part codes (model parameters, model data) then
suggest a means to model selection: pick the model with the shortest
message. This the approach of Rissanen's MDL criterion.
Going further, all of the previous approaches to model selection (AIC, BIC,
thresholding, empirical Bayes, adaptive thresholding) can be cast as
particular methods for coding a model. This commonality reveals further
connections among the methods and suggests how they can be customized and
Robert Stine received his PhD in Statistics from Princeton University and
currently teaches Statistics at the Wharton School of the University of
Pennsylvania. Professor Stine has published widely on resampling methods
for assessing statistical variation, and maintains interests in
exploratory data analysis, statistical graphics, and statistical