Re: [S] extraction from big datafile

Douglas Bates (bates@stat.wisc.edu)
30 Apr 1998 14:05:40 -0500


srosenfeld@nesdis.noaa.gov writes:

> I have a 127MB file containing about 2,000,000 lines of the following
> type:
>
> julian.day latitude longitude TB1 TB2 TB3.......
>
> I need to extract and to store as an S+ object the lines related to specific
> list of locations (i.e. pairs lat&lon)
>
> Normally, I do this kind of work in FORTRAN. Is there any good way (comparable
> in speed) to perform this kind of extraction in s+?. I work with S+ 3.3,
> pent/166/32MGB RAM.

Assuming that the data are in an ASCII file it may be easiest to use a
short perl script to extract the subsets of the data then read those
into S+. As one other respondent has mentioned, you really don't want
to try to use the S-PLUS read.table function on a file of 2,000,000
lines.

The perl approach is easy because the split function in perl splits
the line into white-space-delimited fields. Assuming you wanted
latitude 43.1 and longitude -89.3 only, the perl script would be a
single loop of the form

while (<>) {
my ( $day, $lat, $long ) = split;
next unless $lat == 43.1;
next unless $long == -89.3;
print;
}

This would create a new file that contained only those records from
that latitude and longitude. You could then use read.table on that
file. To select on several latitude/longitude combinations you could
change the logic to something like

while (<>) {
my ( $day, $lat, $long ) = split;
print if $lat == 43.1 && $long == -89.3; # Madison, WI
print if $lat == 39.8 && $long == -105.0; # Denver, CO
...
print if $lat == -18.0 && $long == 178.1; # Fiji
}

Other examples are given in ``Data Manipulation in Perl'', in {\em
Proceedings of Computer Science and Statistics: Twenty-fourth
Symposium on the Interface}, ed. J.~Newton, 456--462, Interface
Foundation, Fairfax, VA, 1992.

Since you are at NOAA I was a little surprised that you are not
storing the data in netCDF format. The advantage of netCDF format
(the CDF is for "Common Data Format", not cumulative distribution
function) is that it provides a more compact representation of the
data but still retains a reasonable amount of meta-data. It is also
portable across computer systems. Several systems can use netCDF
datasets directly.

If I could add something to the wishlist of capabilities for S-PLUS it
would be the ability to read netCDF datasets and to write at least an
S-PLUS data.frame as a netCDF dataset.

More information about netCDF can be found at:
http://www.unidata.ucar.edu/packages/netcdf

-- 
Douglas Bates                            bates@stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news