# Re: [S] confused by cov.mve

Thu, 11 Jun 1998 11:18:04 +0930

patrik ohagen writes:
>
> Dear S-PLUSers
>
> I am using v4.5 on a win95 machine.
>
> I find cov.mve to be a bit confusing. This is my
> interpretation of what is going on:
>
> 1) we estimate the minimum volume ellipsoid[e]
> 2) this ellipsoid[e] is used to detect "outliers"
> 3) the outliers get weight=0 (other obs get weight=1)
> 4) the final estimates are calculated with cov.wt(X, wt=weight)
>
> So if no observation is classified as an outlier we would
> expect the cov.mve function and the cov.wt function to give
> the same estimates, right ?

Here is a more precise specification of what is going on (from
p. 266 of the first reference I can find in the shambles on my desk):

"Let there be n observations and p variables. The method seeks
an ellipsoid containing [at most] (n + p + 1)/2 points of minimum
volume. Having found such an ellipsoid by random search it
returns a (product-moment) covariance estimate of those points
whose Mahalanobis distance from the mean, computed via the
ellipsoid covariance is not too large (specifically within the
97.5% point)."

[Open, slightly frivolous question: Since the variance matrix
estimate provides another metric for sample space, what happens
if you iterate this procedure, i.e. start again using this metric
to identify a new minimum ellipsoid, &c.?]

> BUT as far as I understand the matrix of squares and
> crossproducts for the cov.mve procedure is normalized by
> sum(weight)-1 and the other matrix of squares and
> crossproducts is normalized by sum(weights) (i.e. n)
>
> Why ?

This is a question more to do with cov.wt than with cov.mve.

If you supply weights in a variance matrix calculation in the
general case there is no particular reason why sum(weights) - 1
(= n' - 1, say) should be an appropriate divisor.

The present case is not the general case, of course, but again
since the omission of points from the calculation is non-random,
there is no particular reason to favour a divisor of n'-1 (where
n' is the number of points not rejected), since this is not the
size of a random sample. What should be the divisor is something
of a moot point, but n' is simple and probably as good as any.

It might be argued that the final weights would have been a
useful thing to include in the output from cov.mve so that you
can identify the points cov.mve considered worthy of exclusion
(and much more so than the gizzards of the genetic algorithm) but
for some reason they are explicitly removed with code:

ans <- cov.wt(x, wt = weights, cor = cor)
ans\$wt <- NULL

I cannot say why this information is considered too dangerous to
publish...

Bill Venables.

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news