Fall 2001 Math 3330 3.0 BF: Regression Analysis

Getting Started with S-Plus in the Gauss Lab

Lesson 2: Downloading and viewing data

Last update: September 21, 2001
Please send comments, problems or suggestions to: Georges Monette

Preliminaries

This lesson assumes that you have completed Lesson 1: Using S-Plus under Windows and Creating a Report in Word.

Sources of Data

The data you analyze when you work as a statistical analyst will come from many sources. Sometimes, you will enter it yourself by hand. For small to medium-size rectangular data sets, Microsoft Excel is a reasonable choice.  Type variable names in the first row. Save the spreadsheet. Start S-Plus and import the spreadsheet using the S-Plus menu: File -> Import Data -> From File ...

In this lesson we will learn how to import from the internet into S-Plus. The exact method will depend on the format of the data set being imported. The most general type of data set is an 'ASCII' file (raw text) with variable values separated by a character( e.g. : or , or TAB or arbitrary amounts of white space).

The data sets used in the textbook can be downloaded from  http://www.math.yorku.ca/~georges/Data/Fox/.
They are in ASCII format. Each data set has two files associated with it. A '.dat' file contains the data and a '.cbk' file is a codebook describing the data set and its variables. Other sources of data might present the data in very different ways. With experience you will know how to transform the data into a form suitable for your statistical package, be it S-Plus or some other package like SAS. There are commercially available programs (e.g.  DBMSCOPY) that facilitate the transformation of data from one format to another.
 

Downloading ASCII data and creating an S-Plus data frame

Three steps are required:
  1. Download the data to your computer.
  2. Examine the structure of the data, names and types of variables and codes for missing values.
  3. If necessary, edit the data so it conforms to a format that can be read into S-Plus.
  4. Prepare an S-Plus command (generally read.table)  to read the data into S-Plus and run the command.

1. Downloading the data

2. Examine the data

3. Editing the data

This data can be read as is.

4. Write and run an S-Plus command to read the data

Start S-Plus and open a script file. Write and run the following command:
 
prestige <- read.table( 'Q:\\prestige.dat',
        col.names = c('Title','Education','Income','Women',
            'Prestige','Code','Type'),
        na.strings = c('?'))
prestige
summary(prestige)
xyplot(Income ~ Education | Type, prestige)
identify(xyplot(Income ~ Education | Type, prestige))

Exercise:

Consider how the relationship between Income and Education differs between occupations with different proportions of women. Later we will learn how to do this with regression models. Here we explore the ideas graphically.

Create a categorical variable with that splits occupations into four approximately equal groups depending on the proportion of women in each occupation.

quantile( prestige$Women )

will show the minimum, the maximum and the three quartiles.  Suppose the three quartiles are 10, 30, 60. Then

prestige$Women.quartile <- cut( prestige$Women, breaks = c(-1, 10, 30, 60, 101))

will create a new variable in the 'prestige' data frame. Now, try

identify(xyplot( Income ~ Education | Women.quartile, prestige) )
identify(xyplot( Income ~ Education | Women.quartile, prestige) )
 

Now that you know how to produce basic graphs and copy output and graphs to a Word file, you can explore the relationships among the variables in this data set.

Write a report (a maximum of  four double-spaced typed pages of comments plus appropriage graphs) on what you see.

Once you have completed this assignment, you can continue exploring S-Plus by working your way through an on-line tutorial originally written by Annie Dupuis at Dalhousie University:  http://www.utstat.toronto.edu/splus/contents.html . Expect to take about 10 hours to work your way through this tutorial. It would be a good idea for you to complete the tutorial by the end of September.