Whoa. This line is drifting. The original comment was about `robust'
tools, then least trimmed squares were mentioned. Now LTS is not, in
my understanding, strictly a `robust' method but rather a `resistant'
one. We then moved on to least median of squares (LMS) which is not the
same thing, and was indeed replaced by LTS for this sort of reason.
I think the moral is that methods do get analysed and that it is worth
taking the trouble to follow what is known about the methods you use. (In
this case there were references to this in section 8.3 of both editions of
V&R, including to a deeper analysis of the problem, so perhaps it would
not have been too hard, but in other cases it is time-consuming.) Another
caveat is that LMS (and LTS) are theoretical procedures; all we have in
S-PLUS are unproved computational approximations (that is, not proved to
be good approximations, as far as I know).
My best example of this not knowing the literature is the Hauck-Donner
(1977) phenomenon: a small t-value in a logistic regression indicates
either an insignificant OR a very significant effect, but step.glm
assumes the first, and I bet few users of glm() stop to think.
> The moral is that robust methods should be taken as just one more set of
> exploratory tools, part of a broader analysis. However, I've seen lots of
> examples where people have blindly dropped data into a robust procedure,
> and talken the result without further analysis (under the assumption
> that all they've lost is some power). This is likely to become more
> common, especially in areas of engineering and finance where there is
> motivation to have computers build quite complex models in real time
> with little or no human intervention. Heck, I'm often tempted to do
> something similar myself.
Well, I've seen many, many more examples where people have blindly dropped
data into a non-robust procedure. The distinction between `robust' and
`resistant' in my book is that `robust' methods have some guarantees on
efficiency over a range of distributions, and it is those bounds that
provide the insurance. LMS and LTS are not robust (and not efficient)
and I see their main value in helping to `unmask' the effect of
outliers and to provide good starting values for robust procedures.
There are many areas (yes, including finance and engineering) where there
is so much data that human intervention is not a viable option. So we
do need automated tools, and we do need to prove useful results about
them. If we do not provide these, other people (neural networks, data
mining ...) will if they have not already done so. One example is the
thread on data visualization, which to those disciplines means looking at
multivariate datasets with thousands to millions of points, not the sort
of problems in Cleveland's book.
Maybe if Doug Martin still wants to he should say what HE had in mind.
-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595