Re: Summary of Robust Regression Algorithms

Prof Brian Ripley (
Wed, 7 Jan 1998 08:30:48 +0000 (GMT)

David Ross wrote:
> Recently I posted a simple example of how a 'robust' method (simple
> median) can seem non-robust. This was a bit of a troll; most respondants
> have characterized the example as pathological, or
> model-inappropropriate, or not relevant to practice. Now, let me refer
> you to a little paper by Hettmansperger and Sheather in the American
> Statistician (May 1992 Vol 46 p79) in which they give an example of an
> lms regression where one small data entry error made a huge difference
> in the fit (on data for which a linear model is not unreasponable).
> (My thanks to colleague John Grove for showing me this paper.)
> The phenomenon behind this real world example is precisely the same as
> in my toy example, but it would be immune to much of the criticism the
> latter received.

Whoa. This line is drifting. The original comment was about `robust'
tools, then least trimmed squares were mentioned. Now LTS is not, in
my understanding, strictly a `robust' method but rather a `resistant'
one. We then moved on to least median of squares (LMS) which is not the
same thing, and was indeed replaced by LTS for this sort of reason.

I think the moral is that methods do get analysed and that it is worth
taking the trouble to follow what is known about the methods you use. (In
this case there were references to this in section 8.3 of both editions of
V&R, including to a deeper analysis of the problem, so perhaps it would
not have been too hard, but in other cases it is time-consuming.) Another
caveat is that LMS (and LTS) are theoretical procedures; all we have in
S-PLUS are unproved computational approximations (that is, not proved to
be good approximations, as far as I know).

My best example of this not knowing the literature is the Hauck-Donner
(1977) phenomenon: a small t-value in a logistic regression indicates
either an insignificant OR a very significant effect, but step.glm
assumes the first, and I bet few users of glm() stop to think.

> The moral is that robust methods should be taken as just one more set of
> exploratory tools, part of a broader analysis. However, I've seen lots of
> examples where people have blindly dropped data into a robust procedure,
> and talken the result without further analysis (under the assumption
> that all they've lost is some power). This is likely to become more
> common, especially in areas of engineering and finance where there is
> motivation to have computers build quite complex models in real time
> with little or no human intervention. Heck, I'm often tempted to do
> something similar myself.

Well, I've seen many, many more examples where people have blindly dropped
data into a non-robust procedure. The distinction between `robust' and
`resistant' in my book is that `robust' methods have some guarantees on
efficiency over a range of distributions, and it is those bounds that
provide the insurance. LMS and LTS are not robust (and not efficient)
and I see their main value in helping to `unmask' the effect of
outliers and to provide good starting values for robust procedures.

There are many areas (yes, including finance and engineering) where there
is so much data that human intervention is not a viable option. So we
do need automated tools, and we do need to prove useful results about
them. If we do not provide these, other people (neural networks, data
mining ...) will if they have not already done so. One example is the
thread on data visualization, which to those disciplines means looking at
multivariate datasets with thousands to millions of points, not the sort
of problems in Cleveland's book.

Maybe if Doug Martin still wants to he should say what HE had in mind.

Brian D. Ripley,        
Professor of Applied Statistics,
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595