data analysis example

Bob Hayden (hayden@oz.plymouth.edu)
Mon, 12 Jun 1995 22:01:16 -0400


About a year ago I posted a review of _A Handbook of Small Data Sets_,
by D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski,
Routledge, Chapman Hall, 29 West 35th Street, New York, NY 10001, ISBN
0-412-39920-2, $64.95 in the U.S. The problems with the data disk
that I complained about then are not obvious to beginners. I analyzed
the first data set to illustrate those problems and as an example of
how I would approach this data. I'd be most interested to see such
analyses from others. You need not be as polemical as I have been.
We just have not had a good fight on EdSTat-L in a while ;-). The
current version of the review is tacked on at the end.

--------------------------------------------------------------------

The first data set in _A Handbook of Small Data Sets_ provides a
good illustration of the problems with the data disk that
accompanies the book. Looking at the data and the problems may
be helpful in unscrambling some of the other data files provided.
It also illustrates some important statistical ideas. This is
because the problems with the disk do not reflect just the
pickiness of computers. Instead, the problem is usually that the
arrangement of the data in the disk file does not reflect the
logical structure of the data.

The data here concern an experiment in which seeds are planted
and watered in either a covered or uncovered container. Here is
the data file as it appears on disk (and in the book).

22 41 66 82 79 0
25 46 72 73 68 0
27 59 51 73 74 0
23 38 78 84 70 0

45 65 81 55 31 0
41 80 73 51 36 0
42 79 74 40 45 0
43 77 76 62 * 0

The numbers represent the number of seeds germinating under the
given conditions. The first 4X6 matrix represents the uncovered
boxes and the second the covered boxes. The six columns
represent six levels of watering, from least on the left to most
on the right.

The first thing we need to do is to identify the variables.
Obviously, one is number of seeds germinating. Presumably, this
is the dependent variable. It is the only variable contained in
the file on the disk. Missing are the two independent variables.
One of these is a categorical variable with two values -- covered
and uncovered. Although these are implicit in the layout of the
numbers in the table above, we need to be explicit in coding this
information into the computer. The other independent variable is
the amount of water applied. If this had been measured and
reported, it would be a measurement variable, but what we
actually have are six ordered categories labeled 1-6 in the book. We
do not know if Level 2 represents twice as much water as Level 1
-- it might be three times as much or only 20% more. We used the
given numerical codes of 1-6 for the level of watering, although
very low, low, medium low, medium high, high, and very high might
have been more appropriate. We also coded covering numerically
-- 0 for uncovered and 1 for covered. Most statistics (and
database) software puts values of the variables in the columns and
uses the rows to record the individual cases, records,
observations, boxes, or whatever. Here is how our data set looked
after being placed in this form. The columns represent number
germinating, level of watering, and presence of covering, respectively.

22 1 0
25 1 0
27 1 0
23 1 0
45 1 1
41 1 1
42 1 1
43 1 1
41 2 0
46 2 0
59 2 0
38 2 0
65 2 1
80 2 1
79 2 1
77 2 1
66 3 0
72 3 0
51 3 0
78 3 0
81 3 1
73 3 1
74 3 1
76 3 1
82 4 0
73 4 0
73 4 0
84 4 0
55 4 1
51 4 1
40 4 1
62 4 1
79 5 0
68 5 0
74 5 0
70 5 0
31 5 1
36 5 1
45 5 1
* 5 1
0 6 0
0 6 0
0 6 0
0 6 0
0 6 1
0 6 1
0 6 1
0 6 1

If we had had the misfortune to have been programmed in a
traditional statistical methods course, we would now note that we
have a measurement variable as the dependent variable and two
categorical independent variables. The technique designed for
such situations is two-way analysis of variance. We do not think
it is appropriate here, but you might want to try it to see what
you learn.

If we were a bit more clever, we might note that the levels of the
watering variable are ordered. This suggests that we might be
able to do a regression with this as the independent variable.
We might do separate regressions for covered and uncovered, or we
might include the covering variable as a second independent
variable. This approach would enable us to test hypotheses about
the difference in the slopes between the two situations. Or, we
could do an Analysis of Covariance with covering as the
independent variable and watering as the covariate. We do not
think any of these methods would be appropriate here, but you may
want to try them to see what happens. If word gets around that
you know anything at all about statistics, you may find people on
your doorstep looking for help. It can be valuable to see the
computer output of a large number of inappropriate analyses so
that you learn to recognize them quickly.

Since we were too dumb to take the methods course, we just made
graphs of the data. (When two points fall (approximately) on top
of one another, we used a 2 as the plotting symbol.)

Number Germinating vs. Water Level for Uncovered

-
90+
- 2
- * *
- * 2 2
- * *
60+ *
- *
- *
- *
- *
30+ *
- 3
-
-
-
0+ 4
----+---------+---------+---------+---------+---------+--
1.0 2.0 3.0 4.0 5.0 6.0

Number Germinating vs. Water Level for Covered

-
90+
- *
- 3 *
- 2
- *
60+ *
- 2
- * *
- 3 *
- *
30+ *
-
-
-
-
0+ 4
----+---------+---------+---------+---------+---------+--
1.0 2.0 3.0 4.0 5.0 6.0
N* = 1

>From the graphs, we reached these conclusions, expressed here as
a report to the farmer.

For uncovered plants, use Level 4. Level 3 or 5
might be OK, too, but if you choose Level 5 be
sure to measure accurately as Level 6
is definitely not recommended.

For covered plants, use Level 3. Level 2 might
be OK, too.

If you have to pick a single level for both covered
and uncovered plants, pick Level 3.

These recommendations illustrate the difference between data
analysis and "statistical methods" as traditionally presented. A
standard regression or Analysis of Covariance would assume that
the relationship between number germinating and watering level is
linear. A simple graph shows that any analysis based on such an
assumption will lead us far astray. One such analysis we tried
suggested that germination rate decreased with watering level,
while covering the plants made no difference. Thus the optimum
plan would be Level 1 for all plants, and giving them more would
be harmful. Look at the graphs and imagine trying to explain the
results of a traditional analysis to a farmer. An analysis of
variance would tell us that watering made a difference, while
covering made none. Again, try explaining this result to a
farmer as an example of the power of research and statistical
methods.

We do not mean to suggest that traditional methods have no place.
They do. But traditional methods make certain assumptions, and
anyone who expects no surprises in data has never collected any.
We suggest that you _always_ explore your data, and test a
hypothesis only when it seems both reasonable and unavoidable.
There was a time when Data Analysis was considered a new branch
of statistics. We consider statistics to be an old branch of
Data Analysis.
----------------------------------------------------------------------

A Handbook of Small Data Sets

Reviewed by Robert W. Hayden, Plymouth State College, Plymouth,
NH, 03264, hayden@oz.plymouth,edu.

Real statisticians do not analyze fake data, so the data our
students see should usually be real. However, not many of us
have a mass of appropriate data of our own to share with our
students. There have been a number of attempts to provide
statistics teachers with data sets, as part of a textbook, in a
supplement to a textbook, or in separate collections. This book
is my own favorite collection. Unfortunately, it does have some
problems, and I feel a need to dwell on those. One of the things
that discourages people from using technology is the vast
collection of problems that crop up during your first attempt.
This collection is a bit clumsy to use, it has an error-prone
index, and the data disk is so badly scrambled that it would take
weeks to straighten it out. Still, my hope is that you will buy
this book, because it really is a wonderful collection of data
sets, but be forewarned that there may be some glitches,
especially with the data disk. I'll spend some time on the
problems with the disk, so you can check for and correct any
problems in data sets you might use with your students. With
those caveats in mind, let's turn to the book's virtues.

Among the things I like about it are

there are over 500 data sets

all data sets are provided on a disk with the book

the context of the data is usually clearly explained,
and meaningful to a layperson

many of the studies are ones that could easily be
replicated in class or as student projects

there are references to the source, and sometimes to
published analyses, of the data

some (too few) have suggestions on how the data
might be used in teaching

a wide range of application areas are included

On the down side, for too many of the data sets you are left to
guess what it is you are supposed to find out from the data. All
the rest of the things I don't like have to do with the mechanics
of locating and using the data sets you might want. They are
listed in random order in the book. You will need to turn to a
table in the back of the book to discover the name of the
computer file containing the data. (There is no systematic
naming system.) When you try to read one of those files into
your software, you may be in for a surprise. One good use for
fake data is to create a very simple illustration. Here is an
illustration of the kinds of problems you will find on the data
disk. (The problems are not made up, and, yes, all these
problems did occur in a single file, and similar problems can be
found in most of the other files!) The made-up data on
pianists might appear in the book in a table like this. For each
pianist, we have measurements on two variables.

*Bachauer 23 51 Richter 32 52
*Haskil 12 33 Rubinstein 23 44
Lipatti 43 45

The asterisks denote female pianists. On the disk, the data might
look like this:

23 51 32 52
23 44 12 33
43 45

The names and sexes of the pianists have been lost. A less
obvious problems is that most statistical packages will interpret
this data file as having four measurements on each of three
subjects -- except that two measurements appear to be missing for
the last subject, which may cause an error message or may cause
the package to refuse the data. Even if it does not, and you ask
for the mean of the first variable, you will get the mean of
three numbers, not five. If you give data files like this to
students, you will need to drastically increase your life
insurance coverage.

To add the missing information, you could try typing in 0's and
1's to represent male and female respectively, but when you do,
you may discover (or worse, you may not!) that Haskil and
Rubinstein have been switched on the disk. Assuming you are
content to keep the order on the disk, the data file would look
like this after you finish editing it.

1 23 51
0 23 44
1 43 45
1 32 52
0 12 33

This is a lot of work! Perhaps those of us who use the book can
share cleaned up versions of the data files, and/or convince the
publisher to do some cleaning. Still, the book is great for
browsing, and it's great to have the data on disk in any form!

I could not find any clue on the disk or the book's cover what
kind of computer might be able to read the disk, but it looked
like a 720k DOS disk to my PC clone. I doubt you could read this
with an 800k Mac drive; I'm not sure about the higher density Mac
drives. The files take up about 500 times the smallest chunk of
diskspace you can allocate -- about 0.5Mb on the floppy
provided, about 4Mb on my 340 Mb hard disk. That's a lot of
space. I have heard tales of some systems going off to
never-never land when asked to convert the more than 500 DOS
files to Mac files. Since most of the files are unusable in
their current state anyway, it is probably best to work with just
one at a time, converting, editing, moving to your hard disk, and
importing to your stats package as needed.

Despite the problems, I still think it is a great book. The
publishers could rectify the problems by cleaning up the data
sets on the disk and adding to the disk a proofread (I found too
many errors) and corrected version of the data index in an
electronic form that one could sort and search. Even if you
threw out the disk and index and only used 20 of the given data
sets (typing them in yourself), the book would be worthwhile.


_
| | Robert W. Hayden
| | Department of Mathematics
/ | Plymouth State College
| | Plymouth, New Hampshire 03264 USA
| * | Rural Route 1, Box 10
/ | Ashland, NH 03217
| ) (603) 968-9914
L_____/ hayden@oz.plymouth.edu