[S] Enhancements to S-Plus

Frank E Harrell Jr (fharrell@virginia.edu)
Tue, 31 Mar 1998 11:00:42 -0500

Here is my vote for what not to expend great efforts adding to S-Plus:
exact methods. We have so many bigger things to worry about such
as non-normal errors, non-linear covariable effects, unaccounted-for
heterogeneity, that I've never been very concerned about getting
an "exact" P-value for an over-simplified model. A classic saying of
Tukey about exact solutions to the wrong problem comes to mind.
Even in the case of a 2x2 table, the presence of strong risk factors can cause a
heterogeneity of risks great enough to make unadjusted analyses
incorrect. I would rather use the bootstrap or a full Bayesian approach
to get confidence intervals or probabilities of positive effects. And I'm
still not a fan of conditioning when marginal cell counts were not
pre-specified by the experimental design (and mine never are). Lastly,
exact methods don't always extend well. On the other hand it is very
easy to extend the bootstrap to account for intra-cluster correlation,
for example.

My second vote on what not to implement is type III sums of squares
and F-tests, which are more problematic than most statisticians assume.

Here are my votes on what would be worth doing, not in any particular

1. Handle NAs in a smart way for all modeling functions. For example,
the survival modeling functions written by Terry Therneau keep track of
which observations were deleted by NAs so that for example
plot(age, resid(fit)) will work, by making sure that resid(fit) properly
aligns with age. [On our web page there is a document "Supplemental
Notes" to my biostatistical modeling course that gives several hints
for dealing with NAs while using the lm function.] Modeling functions
in my Design library use Therneau's technique. This needs to be builtin
to other S-Plus functions.

2. Sample size and power calculations for the normal-errors model, accounting
for uncertainty in the estimate of sigma. For example, the user could
provide the data (or sufficient statistics) used to estimate sigma
and the program could compute
an entire power 'distribution' taking the uncertainty into account. Sample size
calculations to achieve certain precision (e.g., width of confidence intervals)
would also be welcome. A deluxe help system (see item 6 below) would allow
users to quickly find example simulation programs for handling non-normal

3. Continue to expand capabilities for random effects models, with various
post-fit estimation, multi-level hierarchies, and other analytic capabilities.
Some of this can be done by having an elegant interface with the WINBUGS
Bayesian modeling package from Cambridge.

4. Bootstrap and multiple imputation methods for accounting for imputing
missing values when making inferences. Some new na.action functions
would also be welcome. These functions could develop imputation rules
(using tree, nonparametric regression, nearest neighbor, etc.) that
could be saved and re-executed on demand. Imputations can be tedious
and it's a shame to have to re-develop imputation models for each
analysis. The imputation function could save enough information to
be able to repeat the development of the imputation rule as quickly as
possible, so that you could put this step inside a bootstrap look in order
to be able to properly account for this component of variation. Interested
uses may want to look at the impute and transcan functions in my
Hmisc library for some other ideas.

5. Anything that helps with non-randomly missing serial data.

6. A world-class online help facility that allows users to navigate in many
ways, e.g., getting to a comprehensive set of examples of managing
and recoding data. For Windows users, where installing an add-on
library is as easy as unzipping a .zip file, it would be nice to have a
help button that updates the local PC from a master table of contents of
libraries available from statlib; another button would automatically
download and install a library. See how Microsoft (yes they do a few
things right) allows users to easily update Office products.

When deciding on future directions for software all of the debates about statistics
come alive. I know that many will criticize my point of view. I just wanted to give my
$.02 worth from the standpoint of an applied biostatistician.

Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine

This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news