York University Statistical Consulting Service S-Plus short course October 8, 1999 Getting dirty data into S-Plus using sed ========================================= Syntax for sed: % sed 'list of ed commands' filenames ... > output % sed -f commandfile filenames ... > output Examples: % sed 's/na/NA/g' dirty > clean changes all 'na' to 'NA' % sed 's/na/NA/g s/,// /^#/d' dirty > clean changes 'na' to 'NA', gets rid of commas and deletes lines starting with # Why sed: A convenient way to read data sets into S-Plus is with the "read.table" command. For example, the S-Plus command: > dd <- read.table("dataset", header=T, sep="\t") would create a "data.frame" from the the file named dataset: name height weight age sex income Brown Mary NA 130 NA Female 1500.00 Wong Jane 62 110 26 Female 150.00 Bruce James 70 180 21 Male NA Caron Jean Paul 65 130 22 Male 100.00 Most data comes in a form that is too dirty to feed to S-Plus uncleaned. We can edit small datasets by hand but with medium sized datasets, especially datasets we need to reprocess a number of times, it is much more efficient to use a tool like sed. Imagine we received the following "dirty" dataset: # here's the data: name height weight age sex income Brown Mary N/A 130 N/A F $1,500.00 Wong Jane 62 110 26 F 150.00 Bruce James 70 180 21 m . - recheck Caron Jean Paul 65 130 22 m $100.00 (the blanks are all spaces) What needs to be changed to feed this to S-Plus? Summary of sed commands: Commands in sed operate on each line of the input. They can have the form: command applied to every line /pattern/command applied to lines matching command line1,line2command applied to range from line1 to line2 where line1 and line2 can be line numbers or have the form /pattern/ Line numbers and patterns: n absolute line number $ last line /pat/ line matching regular expression pat Commands: s/old/new/ substitute first "old" for "new" (see regular expressions and replacement patterns for vi) s/old/new/g substitute all "old" in line y/string1/string2/ replace each character from string1 with the corresponding character from string2 d delete line w file write line to file (can create multiple output files) r file read file and copy to output a\ append lines to output until one not ending in \ c\ change line by replacing with next line i\ insert before line /pat/!cmd apply cmd if line does NOT match : label label a line for gotos b label goto label t label goto label is substitution made to current line { ... } treat commands as a group EXERCISE: Use vi to enter the "dirty" file above. Make lots of mistakes and practice correcting them. Write a sed script "in.sed" to clean "dirty". Try it out with % sed -f in.sed dirty until it looks okay then create "dataset" with % sed -f in.sed dirty > dataset Test the result by reading dataset into S-Plus: % Splus -e > dd <- read.table("dataset",header=T) > dd > is.na(dd$height) > is.na(dd$income) > sapply(dd,data.class)