Robust (Linear) Regression Discussion

Doug Martin (doug@statsci.com)
Wed, 07 Jan 1998 13:26:56 -0800


Folks following the robust regression discussion (others can ignore):

This has been an interesting discussion, to which I'll add my two-cents
worth.

Theoretical Considerations
--------------------------
--------------------------

There has been a long and on-going line of work on the theoretical side,
part of which I have been involved in with Victor Yohai and Ruben Zamar
(and are currently continuing). I am a firm believer that we need to
use such work as a foundation rationale to at least some extent, even
though the results are typically asymptotic, as a guide to providing
software tools.

Robustness Concepts
-------------------

The initial focus on robustness, going back to Tukey (1961), was on
robustness of efficiency under symmetric error distributions. Huber (1964)
laid the first theoretical groundwork for contaminated (normal emphasis)
mixture distributions with symmetric! contamination distribution, i.e.,
his min-max asymptotic variance result. Then maximum bias due to
(vanishingly) small fractions of asymmetric contamination appeared with
Hampel's influence curve gross-error-sensitivity, and optimal control
of this bias with Hampel optimal estimates (that minimize variance at
the Gaussian model, subject to a bound on influence). The research
community felt that it was unfortunate that such estimates had a breakdown
point (smallest fraction of contamination that can ruin the estimate)
that decreased like 1/(1+p) with p the number of independent variables.
This led to an exciting flury of research on high-breakdown point estimates,
i.e., breakdown point one-half estimates, initiated by the independent
work of Donoho and Stahel on covariance and multivariate scatter in the
early 1980's, and Rousseuuw's work on LMS and LTS also in the early 80's
(with LMS apparently proposed much earlier by Hampel). I come to more
recent bias-robust regression work shortly.

But first, as for LMS: In spite of the beauty of concept and geometric
interpretation, this estimator has been known to have poor asymptotic
rate of convergence (and implied small-sample size inefficiency) for some
time now. It is for this reason that LMS was deprecated in favor of LTS
in an earlier release of S-PLUS. There is a simple empirical way to see
why LMS is far from a desirable estimator: It involves minimizing a very!
rough function. Just generate 100 N(0,1) R.V.'s, subtract a constant mhu
from each and plot the median of the squared (or absolute) residuals verus
mhu, and you will see what I mean. This is because the median is a
relatively rough function of the data.

What then is the role of LMS? Peter Rousseeuw has a number of very
beautiful examples in engineering and science contexts where the signal-to
noise ratio is high, and the context seems natural, e.g., in computer
vision. LMS will no doubt survive for some time. But users need to be
aware of the above caveats.

Now as for LTS: Things are quite a bet better here, but not so good as
one might think. The function to be minimized is still a little! rough
compared with smoother alternatives -- S-estimators in particular. The
asymptotic variance of LTS is that of an M-estimate with a hard-rejection
psi-function, the rejection point being determined by the trimming fraction
and at the upper and lower quartile for 50% trimming (see Rousseeuw, 1984).
We (actually Jeff Wang with whom I am working at MathSoft) computed the
Gaussian efficiency of LTS for a range of trimming fractions (read
breakdown point) and compared with a smooth S-estimator with matched
breakdown point. The results were a bit surprising. At 50% trimming/
breakdown LTS has a very low Gaussian efficiency of 7.1%. The smooth
50% breakdown point S-estimate has a low, but better efficiency of 28.7%,
and strongly dominates LTS at other matched breakdown points (this and
other results will be available in a month or so in a paper Jeff and
I are preparing). More on S-estimates below (see Rousseeuw and Yohai, 1984).

This does not mean one should not use LTS, just that one should be
aware of the Gaussian efficiency price one is paying, as a function
of fraction of contamination/breakdown point (and probably opt for
lower breakdown points when confident that the fraction of contamination
is fairly lower than the trimming fraction).

In passing I want to publicly thank Peter Rousseeuw and his team for their
pathbreaking work and ongoing contributions of very solid software to
S-PLUS. This has and will continue to be a substantial contribution.

Now for the bias-robust approach. Peter Huber showed in a very small
part of his 1964 paper, that the median asymptotically minimized the
asymptotic variance under asymmetric contamination (among all translation
equivariant estimates). Either in his paper or book (1981) he downplayed
the result as being relatively uninteresting (theoretically). I did not
believe this to be the case and initiated work with Ruben Zamar which
led to several papers on bias robust estimation of location and scale,
and then to the bias robust regression estimate results of Martin, Yohai
and Zamar (1989). By bias robust we mean minimizing the maximum asymptotic
bias under asymmetric contamination models, i.e., minimizing the maximum
bias curve (vs contamination fraction). This is the complete bias
breakdown picture, giving Hampel's local results (under regularity) at
one end of the curve, and the breakdown point at the other (singularity)
end of the curve. Quite fun, and one part of the paper showed the
following: LMS is approximately the min-max bias robust estimate for all
fractions of contamination. In this sense, contrary to Brian's remark,
there is a well-defined sense in which LMS is robust, and a fairly
satisfying one --- except for the variance/Gaussian efficiency side,
which is dismal.

The obvious approach would be to minimize the maximum bias curve for
all fractions of contamination, subject to a constraint on Gaussian
efficiency being at your desired level. Yohai and Zamar have a paper
coming out this year, which does precisely this for small fractions
of contamination (and maintains BP=.5), and the result is very
satisfying, I think: It is an M-estimate with a redescending psi-function,
whose character is that of a smoothed-out version of the hard-rejection
rule. Kind of what one would intuitively want to use anyway, assuming
a good computational algorithm. (I believe it remains to extend the
result to all fractions of contamination).

Aysmptotic vs Finite Sample Size
--------------------------------

Virtually all theoretical results on robustness are asymptotic in
nature, and that's life. However, among the various estimators
available, I believe that M-estimates with smooth psi-functions
probably give among the best approximation/projection to finite
sample sizes. Jeff and I with input from Victor Yohai are currently
doing some Monte Carlo studies to be included with a paper under way.

Computation
-----------

Ah yes, the real issue, as Brian in essence pointed out. I believe there
is no clear solution -- we just have to do the best we can. At the
present time, I believe it is the Yohai, Stahel and Zamar (1991) method:
compute a good initial regression S-estimate via the sampling approach
(regression coefficients and robust scale of residuals), and then compute
the nearest local minimum M-estimate (with smooth rho as above). See
also Yohai (1988). Use at least exhaustive sampling approach
in step one for small p and n combinations.

The real role of a Smooth S-estimate is (going to be) as an initial
estimate for an M-estimate based on a bounded rho function (redescending
psi-function in the estimating equation).

S-PLUS Implementation
---------------------
---------------------

We currently have a working version of the above approach, and with
a little luck it will appear in version 4.1 for NT/Windows this Winter.
If that happens, it will also appear in the beta release, and we are
most interested in feedback from interested users. (We emphasize
ease of use in the lm paradigm, convenient tabular and graphical
comparison of LS and robust fits, and approximate inference). This
should no doubt spark further discussion.

OTHER COMMENTS
--------------
--------------

Blindly Dropping it In
----------------------

Brian made a number of really on-target comments, as did some others.
I am completely in the camp of there being "too many blind uses of LS",
without the robust alternative with which to compare. Proper comparisons,
especially graphical, are essential here, and will be used if software
provides them conveniently.

Finance Applications
--------------------

It is this arena which has sparked my renewed interest in robustness.
I have a paper or two with Tim Simin to be finished in the next month,
that apply robust regression to estimation of "beta" (the measure of
risk and return). There are often very potent outliers in returns,
causing LS to provide a very misleading fit. This seems to have been
completely overlooked by both the finance research community and the
commercial providers of beta calculations. By the way, there is a recent
paper by Knez and Ready, Jour. of Finance, 1997, who use LTS with
a small trimming fraction to model monthly cross-sectional equity
returns versus beta, market cap., firm size, etc. This exceptionally
good paper is up for an award, it has really opened the door for the
application of robustness in finance, and it might not have happened
(so soon) if it had not been for the fact that LTS was available in S-PLUS.

There is also a very challenging issue in financial data: Yes, there are
outliers, but they are not always just errors (sometimes they are), they
are often generated by heavy-tailed distributions for returns. When
constructing a portfolio, do you down-weight or not? The down-weighted
points might have given you better portfolio weights for the future, or
worse? What is the predictable part of the distribution. I have some
tentative thoughts and directions (but I'm not saying for awhile).

The Hettmansperger and Sheather Example
---------------------------------------

I have not looked at this example, but Victor Yohai once told
me that he thought this was not a legitimate criticism of BP=.5
estimators, but that instead such estimates could point to
special structure in the data. Anyway, he arrives here tomorrow
for a week, and we can follow up on this one.

My Overall Experience
---------------------

In regression, in time series, wherever: You want a good robust
estimate to compare with the standard procedure, and nice graphical
comparisons, and good approximate inference. More often than not, you
will discover things in the data and model building effort that you
would have overlooked (for a truly striking example in a "time-series"
like application, see Kleiner, Martin and Thomson, JRSS Discussion Paper
1979, and Martin and Thomson, IEEE Proceedings, 1982 or so).

Robust methods should be more available in a natural bread-and-butter way.
This is the software providers challenge. It is not a be-all, end-all,
just one of many useful tools the data analyst statistician should have
at their fingertips.

Sorry if this has ended up a bit long (once you get me started ....).

Cheers,

Doug