Last updated: Sept. 2, 2001
Downloading files and using them in S-Plus:
Datasets from the book by Fox (1997) can be found in the directory:
http://www.math.yorku.ca/~georges/Data/Fox/
Each dataset has two files. One, with extension .cbk, is a codebook with
information on the dataset and its variables, the other, with extension .dat,
contains the actual data.
Here are the steps involved in downloading a dataset and creating an
S-Plus data.frame:
1) Have a look at the codebook. For example, prestige.cbk contains:
1971 Canadian Occupational Prestige Data
[1] Occupational title
[2] Average education of incumbents, years
[3] Average income of incumbents, dollars
[4] Percent of incumbents who are women
[5] Pineo-Porter prestige score for occupation
[6] Canadian Census occupational code
[7] Type of occupation
prof = professional and technical
wc = white collar
bc = blue collar
? = missing (not classified)
Source: Census of Canada, 1971, Volume 3, Part 6, pp. 19-1--19-21;
and personal communication from B. Blishen, W. Carroll, and
C. Moore, Departments of Sociology, York University and
University of Victoria.
This tells you that the data set has 7 variables.
2) Have a look at the data set. A sample of lines from prestige.dat:
GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof
GENERAL_MANAGERS 12.26 25879 4.02 69.1 1130 prof
NURSES 12.46 4614 96.12 64.7 3131 prof
NURSING_AIDES 9.45 3485 76.14 34.9 3135 bc
PHYSIO_THERAPSTS 13.62 5092 82.66 72.1 3137 prof
PHARMACISTS 15.21 10432 24.71 69.3 3151 prof
MEDICAL_TECHNICIANS 12.79 5180 76.04 67.5 3156 wc
COMMERCIAL_ARTISTS 11.09 6197 21.03 57.2 3314 prof
RADIO_TV_ANNOUNCERS 12.71 7562 11.15 57.6 3337 wc
ATHLETES 11.44 8206 8.13 54.1 3373 ?
SECRETARIES 11.59 4036 97.51 46.0 4111 wc
TYPISTS 11.49 3148 95.97 41.9 4113 wc
....
Notice the following:
1) Fields are separated by white space.
2) Fields do not contain white space. Note that '_' has been used
to separate words within fields to avoid having white space
within fields.
3) "?" is used for missing data instead of the default NA for Splus.
3) To enter this dataset into Splus you must then download the data file
to the computer on which you use Splus. Suppose your 'home drive' is
Q: on a PC and suppose that you have copied the file (e.g. by right-clicking
in Netscape on 'pretige.dat') to Q:\prestige.dat.
4) You then prepare a script file under S-Plus to read in the data set.
Here is a command that would read the prestige.dat file:
> prestige <- read.table( 'Q:\\prestige.dat',
col.names = c('Title','Education','Income','Percent.women',
'Prestige','Code','Type'),
row.names = NULL,
na.strings = c('?'))
The 'col.names' argument specified the variables names. You may not
have blanks or '_' in the variables names but '.' is allowed.
Note the double '\' in the name of the file. A single '\' is an
'escape' character so you need two '\'s to make Splus think you
want one '\'.
Note also that 'row.names = NULL' prevents 'Title' from becoming the
variable that supplies row.names.
5) You now have a data.frame called 'prestige' that you can use as an
Splus data.frame.
NOTES:
1) If the first row of the your dataset contained variable names, e.g.
Title Education Income Percent.women Prestige Code Type
GOV_ADMINISTRATORS 13.11 12351 11.16 68.8 1113 prof
GENERAL_MANAGERS 12.26 25879 4.02 69.1 1130 prof
NURSES 12.46 4614 96.12 64.7 3131 prof
then you could read the dataset with
> prestige <- read.table( 'Q:\\prestige.dat',
headers = T,
row.names = NULL,
na.strings = c('?'))
2) Look up the help page for 'read.table' to find out what to do if
a) fields are separated by tabs or some other character, or
b) fields start in fixed columns.