This lesson assumes that you have completed Lesson 1: Using S-Plus under Windows and Creating a Report in Word.
The data you analyze when you work as a statistical analyst will come from many sources. Sometimes, you will enter it yourself by hand. For small to medium-size rectangular data sets, Microsoft Excel is a reasonable choice. Type variable names in the first row. Save the spreadsheet. Start S-Plus and import the spreadsheet using the S-Plus menu: File -> Import Data -> From File ...In this lesson we will learn how to import from the internet into S-Plus. The exact method will depend on the format of the data set being imported. The most general type of data set is an 'ASCII' file (raw text) with variable values separated by a character( e.g. : or , or TAB or arbitrary amounts of white space).
The data sets used in the textbook can be downloaded from http://www.math.yorku.ca/~georges/Data/Fox/.
They are in ASCII format. Each data set has two files associated with it. A '.dat' file contains the data and a '.cbk' file is a codebook describing the data set and its variables. Other sources of data might present the data in very different ways. With experience you will know how to transform the data into a form suitable for your statistical package, be it S-Plus or some other package like SAS. There are commercially available programs (e.g. DBMSCOPY) that facilitate the transformation of data from one format to another.
Three steps are required:
- Download the data to your computer.
- Examine the structure of the data, names and types of variables and codes for missing values.
- If necessary, edit the data so it conforms to a format that can be read into S-Plus.
- Prepare an S-Plus command (generally read.table) to read the data into S-Plus and run the command.
1. Downloading the data
- This can be done with Netscape, Microsoft Explorer or an FTP program. We will illustrate the procedure with Netscape Communicator 4.7.
First navigate to the directory or web page with links to the data. In our case, that is http://www.math.yorku.ca/~georges/Data/Fox/. (to open this link in a new window, right-click on the link and choose Open in New Window.)- We will download a dataset containing data on the prestige of a selection of Canadian occupations. The data was collected through the 1971 Canadian census. Have a look at 'prestige.dat' and 'prestige.cbk', Depending on the settings of your browser, this might involve simply clicking on these file names in the window http://www.math.yorku.ca/~georges/Data/Fox/.
- To download 'prestige.dat' right-click on the file name and select 'Save Link As'. Enter a name in your permanent data directory, for example: 'Q:\prestige.dat' and click on 'Save'.
2. Examine the data
Look at 'prestige.dat' and 'prestige.cbk' Note the following:
- There are 7 variables:
- Occupational title: We'll call it 'Title'. Note that the blanks between words have been replaced by underlines '_' so blanks separate variables.
- Average education of incumbents, years: Call it 'Education'
- Average income of incumbents, dollars: 'Income'
- Percent of incumbents who are women: 'Women'
- Pineo-Porter prestige score for occupation: 'Prestige'
- Canadian Census occupational code: 'Code'
- Type of occupation: 'Type'. Note that this is a 'categorical' variable and that '?' represents missing values.
- Blanks separate variables.There are no blanks within values of individual variables.
- There is no 'header' row in the data set with variables names.
3. Editing the data
This data can be read as is.4. Write and run an S-Plus command to read the data
Start S-Plus and open a script file. Write and run the following command:
prestige <- read.table( 'Q:\\prestige.dat',
col.names = c('Title','Education','Income','Women',
'Prestige','Code','Type'),
na.strings = c('?'))
prestige
summary(prestige)
xyplot(Income ~ Education | Type, prestige)
identify(xyplot(Income ~ Education | Type, prestige))
Consider how the relationship between Income and Education differs between occupations with different proportions of women. Later we will learn how to do this with regression models. Here we explore the ideas graphically.Create a categorical variable with that splits occupations into four approximately equal groups depending on the proportion of women in each occupation.
quantile( prestige$Women )
will show the minimum, the maximum and the three quartiles. Suppose the three quartiles are 10, 30, 60. Then
prestige$Women.quartile <- cut( prestige$Women, breaks = c(-1, 10, 30, 60, 101))
will create a new variable in the 'prestige' data frame. Now, try
identify(xyplot( Income ~ Education | Women.quartile, prestige) )
identify(xyplot( Income ~ Education | Women.quartile, prestige) )
Now that you know how to produce basic graphs and copy output and graphs to a Word file, you can explore the relationships among the variables in this data set.
Write a report (a maximum of four double-spaced typed pages of comments plus appropriage graphs) on what you see.
Once you have completed this assignment, you can continue exploring S-Plus by working your way through an on-line tutorial originally written by Annie Dupuis at Dalhousie University: http://www.utstat.toronto.edu/splus/contents.html . Expect to take about 10 hours to work your way through this tutorial. It would be a good idea for you to complete the tutorial by the end of September.