> I have a problem, that I hope some might be able to help me with.
> I am planing to use S+ cluster analysis for N=5000. I have the
> dissimilarity matrix (lower diagonal) in a file in the following format
> (the data is wrapped to one number per line)
>
> 1
[...]
> 15
> The usual way to think about and present this kind of data is
> probably something like
>
> 1
> 2 3
> 4 5 6
> 7 8 9 10
> 11 12 13 14 15
>
> Where 1 is the distance between elements a1 and a2; 2 is the distance
> between a1 and a3; 3 is for a2 and a3; 4 is for elements a1 and a4 etc.
> thus the 0-distance for d(a1,a1), d(a2,a2)... d(aN,aN) is not included in
> the data. I might be able to add these
> zero-distances in if needed (I prefer not to, if possible).
> The question is how is a lower diagonal
> dissimilarity matrix read in for cluster analysis. The chapters on
cluster
> analysis, which I currently have, mention that some cluster functions
> from the cluster library accept dissimilarity matrix (like pam and fanny)
> but I have not been able to find out how to specify this kind of data
> format and read it in to S-Plus.
>
> Having a petential problem with the matrix-file format and not knowing
> how to read the data to S-Plus my question is in three parts:
>
> a) Given a complete dissimilarity matrix in an external text file (lower
> diagonal and zero-diagonal line included, not wrapped) how do I read in
> the data?
>
> b) Is it possible to input the data without the having the d(ax,ax)
> zero-diagonal line
>
> c) Does it matter that the matrix is wrapped to one number per line?
>
There is no problem in principle here, but I don't think your problems
stop with file formats (see below). Most of the cluster analysis functions
cope with the format generated by dist(). Steps needed to create the
dissimilarity matrix:
(1) read in the data as a vector
tmp <- scan("file.dis")
This reads in free format, so one-per-line does not matter.
(2) read it into the upper triangle of a matrix
A <- matrix(0, 6, 6)
A[col(A) > row(A)] <- tmp
so A is
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 1 2 4 7 11
[2,] 0 0 3 5 8 12
[3,] 0 0 0 6 9 13
[4,] 0 0 0 0 10 14
[5,] 0 0 0 0 0 15
[6,] 0 0 0 0 0 0
(3) Complete the matrix by
A <- A + t(A)
so A is
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 1 2 4 7 11
[2,] 1 0 3 5 8 12
[3,] 2 3 0 6 9 13
[4,] 4 5 6 0 10 14
[5,] 7 8 9 10 0 15
[6,] 11 12 13 14 15 0
The vector format given by dist and daisy is the lower-triangle
in S = Fortran row-first order, that is
1 2 4 7 11 3 5 8 9 12 6 9 13 10 14 15
here. See help(dist) for the details.
To convert to that format use
d <- A[lower.tri(A)]
attr(d, "Size") <- nrow(A)
and (from the help page)
dist2full <- function(dis)
{
n <- attr(dis, "Size")
full <- matrix(0, n, n)
full[lower.tri(full)] <- dis
full + t(full)
}
goes back.
However, a 5000-square matrix will take up a lot of space (200Mb) and the
cluster analysis will be very large. I don't fancy your chances of doing
this on S-PLUS 3.3 for Windows. Even the dissimilarity file will be of the
order of 100Mb. You need a specialist cluster analysis program to handle
such a large problem. I would also advise that hierarchical cluster
analysis is not going to be very helpful on such a large problem, and I
doubt if the specialist programs have fast methods for partitioning
techniques such as such as PAM and FANNY. Indeed, the combinatorial
explosion in the number of partitions makes all such partitioning
techniques slow down rapidly as N increases.
-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595----------------------------------------------------------------------- This message was distributed by s-news@wubios.wustl.edu. To unsubscribe send e-mail to s-news-request@wubios.wustl.edu with the BODY of the message: unsubscribe s-news