Last updated: Sept. 2, 2001

Downloading files and using them in S-Plus:

Datasets from the book by Fox (1997) can be found in the directory:
http://www.math.yorku.ca/~georges/Data/Fox/

Each dataset has two files. One, with extension .cbk, is a codebook with
information on the dataset and its variables, the other, with extension .dat,
contains the actual data.

Here are the steps involved in downloading a dataset and creating an
S-Plus data.frame:

1) Have a look at the codebook. For example, prestige.cbk contains:

        1971 Canadian Occupational Prestige Data
        [1] Occupational title 
        [2] Average education of incumbents, years 
        [3] Average income of incumbents, dollars
        [4] Percent of incumbents who are women
        [5] Pineo-Porter prestige score for occupation 
        [6] Canadian Census occupational code
        [7] Type of occupation
                prof =  professional and technical
                wc   =  white collar
                bc   =  blue collar
                ?    =  missing (not classified)
        Source: Census of Canada, 1971, Volume 3, Part 6, pp. 19-1--19-21;
        and personal communication from B. Blishen, W. Carroll, and
        C. Moore, Departments of Sociology, York University and
        University of Victoria.

  This tells you that the data set has 7 variables.

2) Have a look at the data set. A sample of lines from prestige.dat:

        GOV_ADMINISTRATORS            13.11 12351 11.16 68.8  1113  prof
        GENERAL_MANAGERS              12.26 25879  4.02 69.1  1130  prof
        NURSES                        12.46  4614 96.12 64.7  3131  prof
        NURSING_AIDES                  9.45  3485 76.14 34.9  3135  bc
        PHYSIO_THERAPSTS              13.62  5092 82.66 72.1  3137  prof
        PHARMACISTS                   15.21 10432 24.71 69.3  3151  prof
        MEDICAL_TECHNICIANS           12.79  5180 76.04 67.5  3156  wc
        COMMERCIAL_ARTISTS            11.09  6197 21.03 57.2  3314  prof
        RADIO_TV_ANNOUNCERS           12.71  7562 11.15 57.6  3337  wc
        ATHLETES                      11.44  8206  8.13 54.1  3373  ?
        SECRETARIES                   11.59  4036 97.51 46.0  4111  wc
        TYPISTS                       11.49  3148 95.97 41.9  4113  wc
        ....

   Notice the following:
      1) Fields are separated by white space.
      2) Fields do not contain white space. Note that '_' has been used
         to separate words within fields to avoid having white space
         within fields.
      3) "?" is used for missing data instead of the default NA for Splus.

3) To enter this dataset into Splus you must then download the data file
   to the computer on which you use Splus.  Suppose your 'home drive' is
   Q: on a PC and suppose that you have copied the file (e.g. by right-clicking
   in Netscape on 'pretige.dat') to Q:\prestige.dat.

4) You then prepare a script file under S-Plus to read in the data set.
   Here is a command that would read the prestige.dat file:

   > prestige <- read.table( 'Q:\\prestige.dat',
                col.names = c('Title','Education','Income','Percent.women',
                        'Prestige','Code','Type'), 
                row.names = NULL,
                na.strings = c('?'))
                
   The 'col.names' argument specified the variables names.  You may not
   have blanks or '_' in the variables names but '.' is allowed.
   Note the double '\' in the name of the file. A single '\' is an
   'escape' character so you need two '\'s to make Splus think you
   want one '\'.
   Note also that 'row.names = NULL' prevents 'Title' from becoming the
   variable that supplies row.names.  

5) You now have a data.frame called 'prestige' that you can use as an
   Splus data.frame. 

NOTES:
   1) If the first row of the your dataset contained variable names, e.g.

        Title          Education Income Percent.women Prestige Code Type
        GOV_ADMINISTRATORS            13.11 12351 11.16 68.8  1113  prof
        GENERAL_MANAGERS              12.26 25879  4.02 69.1  1130  prof
        NURSES                        12.46  4614 96.12 64.7  3131  prof

      then you could read the dataset with

      > prestige <- read.table( 'Q:\\prestige.dat',
                        headers = T,    
                        row.names = NULL,
                        na.strings = c('?'))

   2) Look up the help page for 'read.table' to find out what to do if
      a) fields are separated by tabs or some other character, or
      b) fields start in fixed columns.