[S] New additions to Hmisc library

Frank E Harrell Jr (fharrell@virginia.edu)
Sun, 24 May 1998 15:09:00 -0400


The Hmisc library for UNIX or Windows S-Plus versions 3.3 through 4.0 (but not 4.5)
has several new functions which may be obtained from our web page or from statlib
(when it is updated tonight). Here are excerpts from the help files.

------------areg.boot-----------
areg.boot uses avas or ace to fit additive regression models allowing all variables in the model (including the right-hand-side) to
be transformed, with transformations chosen so as to optimize certain criteria. The default method uses avas, which explicity tries
to transform the response variable so as to stabilize the variance of the residuals. All-variables-transformed models tend to
inflate R^2 and it can be difficult to get confidence limits for each transformation. areg.boot solves both of these problems using
the bootstrap. As with the validate function in the Design library, the Efron bootstrap is used to estimate the optimism in the
apparent R^2, and this optimism is subtracted from the apparent R^2 to optain a bias-corrected R^2.

For method="avas" the response transformation is restricted to be monotonic. You can specify restrictions for transformations of
predictors (and linearity for the response, or its monotonicity if using ace). When the first argument is a formula, the function
automatically determines which variables are categorical (i.e., factor, category, or character vectors). Specify linear
transformations by enclosing variables by the identify function (I()), and specify monotonicity by using monotone(variable).

There is a plot method for plotting transformation estimates, transformations for individual bootstrap re--samples, and pointwise
confidence limits for transformations. There is a Function method for producing a list of S-PLUS functions that perform the final
fitted transformations. There is also a print method for areg.boot objects.

> x1 <- rnorm(200)
> x2 <- runif(200) # a variable that is really unrelated to y
> y <- exp(x1 + rnorm(200)/3)
> f <- areg.boot(y ~ x1 + x2, B=40)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 >
> f

avas Additive Regression Model

areg.boot(x = y ~ x1 + x2, B = 40)

n= 200 p= 2 Apparent R2: 0.865 Bootstrap validated R2: 0.842

f <- areg.boot(response ~ I(age) + monotone(blood.pressure) + race)
# use I(response) to not transform the response variable
plot(f, conf.int=.9)

----------summarize----------
summarize is a fast version of Hmisc's summary(formula, method="cross",overall=F) for producing stratified summary statistics and
storing them in a data frame for plotting (especially with trellis xyplot and dotplot and Hmisc xYplot). Unlike aggregate,
summarize
accepts a multi-valued FUN argument and summarize also labels the variables in the new data frame using their original names.
Unlike methods based on tapply, summarize stores the values of the stratification variables using their original types, e.g., a
numeric by variable will remain a numeric variable in the collapsed data frame. summarize also retains "label" attributes for
variables. summarize works especially well with the Hmisc xYplot function for displaying multiple summaries of a single variable on
each panel, such as means and upper and lower confidence limits.
summarize has similar aims as John Wallace's agg function.

A number of statistical summary functions are also provided for use with summary.formula and summarize (as well as tapply and by
themselves). smean.cl.normal computes 3 summary variables: the sample mean and lower and upper Gaussian confidence limits based on
the t-distribution. smean.sd computes the mean and standard deviation. smean.sdl computes the mean plus or minus a constant times
the standard deviation. smean.cl.boot is a very fast implementation of the basic nonparametric bootstrap for obtaining confidence
limits for the population mean without assuming normality. smedian.hilow computes the median and outer quantiles.
These functions all delete NAs automatically.

set.seed(111)
dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)
attach(dfr)
y <- abs(month-6.5) + 2*runif(length(month)) + year-1997
s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)
# llist is in Hmisc - it is like list but preserves names of arguments
s
month year y Lower Upper

7 1 1997 6.55 6.074 7.09
8 1 1998 7.51 7.021 8.00
9 2 1997 5.52 5.234 6.08
10 2 1998 6.40 5.910 6.99
11 3 1997 4.60 4.045 5.05
12 3 1998 5.43 5.001 5.88
13 4 1997 3.32 2.922 3.75
14 4 1998 4.69 4.159 5.04
15 5 1997 2.55 1.964 2.99
16 5 1998 3.27 2.933 3.64
17 6 1997 1.63 1.085 2.11
18 6 1998 2.51 2.009 3.04
19 7 1997 1.24 0.899 1.88
20 7 1998 2.47 1.892 2.98
21 8 1997 2.51 2.117 2.97
22 8 1998 3.36 2.919 3.93
23 9 1997 3.55 3.128 3.96
24 9 1998 4.56 4.139 5.08
1 10 1997 4.52 3.895 5.05
2 10 1998 5.45 5.087 6.01
3 11 1997 5.43 5.014 6.02
4 11 1998 6.42 5.986 6.91
5 12 1997 6.55 5.907 7.09
6 12 1998 7.40 6.989 7.96

xyplot(y ~ month | year, data=s)
xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s,
keys='lines', method='alt')
# Can also do:
s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),
stat.name=c('y','Q1','Q3'))
xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')
# To display means and bootstrapped nonparametric confidence intervals
# use for example:
s <- summarize(y, llist(month,year), smean.cl.boot)
xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)

---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news