[S] Producing a data frame of summary statistics

Frank E Harrell Jr (fharrell@virginia.edu)
Sun, 26 Apr 1998 21:05:21 -0400

Thanks to John Wallace and Todd Taylor for providing using information.
But for the general case where the function used to summarize the data may
be multi-valued (e.g., quantile), aggregate will not work; in this case "dfify" will not
work for arrays of lists such as what tapply returns when there are multiple stratification
variables. Also, results from
tapply must be re-formatted carefully because non-existent combinations
will result in NULL as an element of the returned array. I chose to use
sapply (which is what tapply calls) to have more control, and then to
construct a data frame from the combinations of summary stats produced
by sapply. The resulting function is below. I'll soon add it to the Hmisc
library, along with online help. Comments/suggestions welcome.
Note that summarize is a (faster) special case of Hmisc's summary.formula
function when method='cross' overall=F.


summarize(temperature, month, function(x)c(Mean=mean(x,na.rm=T),Median=median(x,na.rm=T)))

w <- summarize(temperature, month, mean, na.rm=T)
xyplot(temperature ~ month, data=w) # plot mean temperature by month

w <- summarize(temperature, llist(year,month), quantile, probs=c(.25,.5,.75), na.rm=T)
xYplot(Cbind(temperature[,2],temperature[,-2]) ~ month | year, data=w)

xYplot is in Hmisc, to handle "error bars".
llist is a little function that preserves the names of its arguments. It also
works with the Hmisc label function, e.g.:

for(v in llist(temperature,rainfall)) { hist(v); title(label(v)) } #label(v)=attr(v,'label')
# If no labels pre-defined for the two variables, will use the variable names as the labels

summarize <- function(X, by, FUN, ...,
type=c('matrix','variables')) {
type <- match.arg(type)
if(!is.list(by)) {
nameby <- as.character(deparse(substitute(by)))
by <- list(by)
names(by) <- nameby
nby <- length(by)
typical.computation <- FUN(X, ...)
nc <- length(typical.computation)
byc <- do.call('paste',c(by,sep='|'))
r <- if(nc==1)sapply(split(X, byc), FUN, ..., simplify=T) else
as.matrix(sapply(split(X, byc), FUN, ..., simplify=T))
ans <- unpaste(if(nc==1)names(r) else dimnames(r)[[2]], sep='|')
names(ans) <- names(by)
if(nc>1 && (nc != nrow(r))) stop('program logic error')
snames <- names(typical.computation)
if(length(snames)) snames <- paste(stat.name, snames) else
snames <- if(nc==1) stat.name else paste(stat.name,1:nc)
wrn <- .Options$warn
.Options$warn <- -1
notna <- rep(T, length(ans[[1]]))
for(i in 1:length(by)) {
byi <- by[[i]]
ansi <- ans[[i]]
ansi <- structure(as.numeric(ansi),
levels=levels(byi), class='factor')
else if(is.numeric(byi)) ansi <- as.numeric(ansi)
names(ansi) <- NULL
ans[[i]] <- ansi
notna <- notna & !is.na(ansi)
if(type=='matrix' || nc==1) {
ans[[stat.name]] <- if(nc==1) r else
structure(t(r), dimnames=list(NULL, snames))
} else {
snames <- make.names(snames)
for(i in 1:length(snames)) ans[[snames[i]]] <- r[i,]
notna <- notna & !is.na(if(nc==1) r else t(r) %*% rep(1,nc))
ans <- structure(ans, class='data.frame',
iorder <- do.call('order', structure(unclass(ans)[1:nby],names=NULL))
## order can bomb if data frame given (preserves names)

llist <- function(..., labels = T)
dotlist <- list(...)
lname <- names(dotlist)
name <- vname <- as.character(sys.call()[-1])
for(i in 1:length(dotlist)) {
vname[i] <- lname[i]
if(length(vname[i]) == 0 || vname[i] == "")
vname[i] <- name[i]
lab <- vname[i]
if(labels) {
lab <- attr(dotlist[[i]], "label")
if(length(lab) == 0)
lab <- vname[i]
label(dotlist[[i]]) <- lab
names(dotlist) <- vname[1:length(dotlist)]

Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine

This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news