# [S] Need help on Cluster Analysis- Input distance matrix (fwd)

Stefan H. Jonsson (stefanj@pop.psu.edu)
Tue, 29 Sep 1998 16:41:03 -0400 (EDT)

Dear S-news list members

I have a problem, that I hope some might be able to help me with.
I am planing to use S+ cluster analysis for N=5000. I have the
dissimilarity matrix (lower diagonal) in a file in the following format
(the data is wrapped to one number per line)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

The usual way to think about and present this kind of data is
probably something like

1
2 3
4 5 6
7 8 9 10
11 12 13 14 15

Where 1 is the distance between elements a1 and a2; 2 is the distance
between a1 and a3; 3 is for a2 and a3; 4 is for elements a1 and a4 etc.
thus the 0-distance for d(a1,a1), d(a2,a2)... d(aN,aN) is not included in
the data. I might be able to add these
zero-distances in if needed (I prefer not to, if possible).
The question is how is a lower diagonal
dissimilarity matrix read in for cluster analysis. The chapters on cluster
analysis, which I currently have, mention that some cluster functions
from the cluster library accept dissimilarity matrix (like pam and fanny)
but I have not been able to find out how to specify this kind of data
format and read it in to S-Plus.

Having a petential problem with the matrix-file format and not knowing
how to read the data to S-Plus my question is in three parts:

a) Given a complete dissimilarity matrix in an external text file (lower
diagonal and zero-diagonal line included, not wrapped) how do I read in
the data?

b) Is it possible to input the data without the having the d(ax,ax)
zero-diagonal line

c) Does it matter that the matrix is wrapped to one number per line?

I am using Version 3.3 Release 1 for Sun SPARC, SunOS 5.3 : 1995

PS the distance matrix is actually in a file with (n*n-1)/2 lines each
line in the following format

A B C

were A is the number-id of element a
B is the number-id of element b
C is the distance between a and b

I am planing to cut out the first 2 columns, to save space and,
hopefully complexity.