[S] Summary: Statistical Feature Requests

Charles Roosen (roosen@statsci.com)
Tue, 31 Mar 1998 12:56:52 -0800


--=====================_891406612==_
Content-Type: text/plain; charset="us-ascii"

Last week I asked for input regarding statistical enhancements for future
versions of S-PLUS. As promised, I've compiled the responses and am
attaching them as a text file. I've anonymized them, with the exception of
retaining a few names of those mentioning their algorithms.

The summary is that there isn't a consensus as to what the primary needs
are. The most-mentioned direction was probably MCMC methods, with mixed
effect models a close second. Which techniques are suggested vary across
the spectrum of statistical areas.

Thanks to all of those who responded. I'll circulate these anonymized
responses internally to the relevant parties here.

Charlie Roosen

--=====================_891406612==_
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: attachment; filename="statreq.txt"

Time series stuff: functions used in Box-Jenkins type analyses (i.e.
ARIMA)
Limited Dependent Variable and Duration models (trancated/censored
models)
Lag models
Discrete dependent variable models like multinomial logit and
multivariate probit

I know that most if not all of the above are already implemented, some
within the basic library, the rest as user-defined functions. As a
novice S+ user it would be nice to get thoroughly weaned from SAS PROCs
and DOS applications like LIMDEP without having to search through all of
the many S+ function libraries out there in the information jungle to
find them.

Thanks for asking Charlie.

##########

>1) What statistical techniques which you use regularly are not available
>in S-PLUS ?

I wasn't aware of any such animal. :)

>2) What "standard" techniques does S-PLUS need to be an even more
>well-rounded basic statistics package?

There is no more "well-rounded" package.

>3) What hot new techniques should we add to S-PLUS as part of
>commitment to stay on the cutting edge of statistical computing?

If there is a way to execute a bootstrap which doesn't
take ten hours of eat up all the memory, go for it.
I do everything in S-Plus, except bootstrapping. Bootstrapping
not only is here to stay, it's getting bigger everyday.

>4) What user-contributed functions in StatLib have you found to be of
>both high value and high quality, and thus contenders for inclusion in
>S-PLUS?

(1) derivatives for the lgamma function.

(2) deriv3

I can't believe these didn't make it into 4.0

Here's another one. For hist, allow the user to specify
the number of bins. This "recommendation" stuff is
nonsense. S-PLUS does NOT know better than I,
how many bins should be in a histogram. This is
INFURIATING.

###############

I use Jack Lee's blip plots and event charts; Harrell's restricted cubic
splines for logit and Cox PH models, and Bowman and Azzalini's survival
smoothing programs for censored data.

I would like to see you add kernel-based hazard function estimation for
possibly right-censored and/or delayed-entry data. Preferably is should
include local bandwidths and left boundary correction (e.g., Mueller and
Wang, Biometrics, 1994). Egret has this built in, but I'm not aware of any
other programs which offer it.

Below is a simple S+ function for global bandwidth kernel estimation
(without boundary correction)of the hazard function from possibly
right-censored data.

kernhaz<-function(times,status,estgrid,bandwidth){
#times is a vector of survival times
#status is a vector of censoring indicators (0=alive,1=dead)
#estgrid is a vector of points at which to estimate the function
n<-length(times) # num of times
d<-sum(status) # num of events
e<-length(estgrid) # num of points in estimation grid
ord<-order(times)
times<-times[ord]
status<-status[ord]
hazest<-rep(NA,e)
dt<-times[status==1] # death times
udt<-unique(dt) # unique death times
nudt<-length(udt) # num of unique death times
nd<-rep(NA,nudt) # num of deaths at ith unique death time
nr<-rep(NA,nudt) # num at risk at ith unique death time
for (i in 1:nudt){
nd[i]<-sum(dt==udt[i])
nr[i]<-sum(!times<udt[i])}
for (i in 1:e)
{sum<-0
x0<-estgrid[i]
for (j in 1:nudt)
{arg<-(udt[j]-x0)/bandwidth
arg<-arg^2
if (arg<1) sum<-sum+(1-arg)*nd[j]/(nr[j])}
hazest[i]<-sum*.75/bandwidth}
return(estgrid,hazest,nr,nd,udt)}

##########

Hi

Suggestions for improvement:

Splus is far behind in Econometrics, compared to packages such as
SAS and GAUSS. Procedures such as Seemingly Unrelated Regressions,
the Multivariate Linear Regression, Simultaneous Equations should be
incorporated in a user friendly way.

A good Econometrics package should have optimization routines geared
at the standard functions that are optimized in Econometrics, namely
GMM and Likelihood. These should allow for either analytical derivatives
or numerical derivatives.

Multivariate Simulations should be doable simply on the basis of
a vector of means and a Variance Covariance Matrix for the normal
distribution.

The slow performance of SPLUS is also a problem as is documented in
a recent Journal of Applied Econometrics article. Several of my
Graduate Students this semester refused to switch to Splus and stayed with
Gauss on account of the slow performance.

The help is not intuitive. It may be for ATT rocket scientists but is
is difficult to find topics using standard econometrics terminology.

Splus is still not able to easily produce WYSIWYG graphics in Unix.
What I see on my X11 window is not what I will see on the page.

> 3) What hot new techniques should we add to S-PLUS as part of commitment to
> stay on the cutting edge of statistical computing?

Bayesian Analysis

#########

On Fri, 27 Mar 1998 13:02:06 -0800, you wrote:

>2) What "standard" techniques does S-PLUS need to be an even more
>well-rounded basic statistics package?
>
>3) What hot new techniques should we add to S-PLUS as part of commitment to
>stay on the cutting edge of statistical computing?

I'm not sure which category these fall into, but MCMC methods can't
reasonably be done in S-PLUS now because of the loop speed. If you
added a module for those that covered the sorts of models BUGS covers,
and made user-built simulations feasible, that would be great!

For cutting edge, you should use Propp and Wilson or other methods to
do exact MCMC simulation (i.e. no burn-in problems). (This isn't
quite serious: I don't think anyone knows how to do these on general
purpose models yet. But I think it's coming within the next year or
two, and I'd love to be able to develop these methods in S-PLUS.)

##########

I occasionally need a conditional likelihood logistic regression package for
matched
case-controls studies. I use a free program called PECAN.

#########

Dear Dr Roosen

>1) What statistical techniques which you use regularly are not available in
>S-PLUS ?
>
>2) What "standard" techniques does S-PLUS need to be an even more
>well-rounded basic statistics package?
>
>3) What hot new techniques should we add to S-PLUS as part of commitment to
>stay on the cutting edge of statistical computing?

I have been for some time writing software implementing a large family
of multivariate statistical models. It is a natural extension of glm()
and gam(). Details can be found in
http://www.stat.auckland.ac.nz/yee/vgam/latest.html. Completion is in
sight.

I think the software is very significant since it fits many important
multivariate models in a single program/algorithm. The software addresses
and fits in with the three aims above. Indeed, I'm sure the software
will one day be a part of Splus itself. Ross Ihaka has expressed
interest in putting it in R.

I plan to submit a paper to a journal at the same time the software is
released (Journal of Computational Statistics and Graphics?; this is
because there are implementation details worth publishing). Altogether
I'm excited about it all.

This is to notify you of its existence and its release in the near
future; it wouldn't be good for other developers to duplicate the work
done in this area already. My last day here at Stanford will be next
Monday, and will return to NZ on 6/7 April. I have been collaborating
with Trevor Hastie (the author of glm() and gam()) while I have been
here.

Yours sincerely

######

hi, here are my inputs

> 1) What statistical techniques which you use regularly are not available in
> S-PLUS ?
better model based clustering, i.e. EM, mixture models not limited to
Gaussian/Bernoulis

> 2) What "standard" techniques does S-PLUS need to be an even more
> well-rounded basic statistics package?
again, EM

> 3) What hot new techniques should we add to S-PLUS as part of commitment to
> stay on the cutting edge of statistical computing?
graphical models, bayesian networks

> 4) What user-contributed functions in StatLib have you found to be of both
> high value and high quality, and thus contenders for inclusion in S-PLUS?
possibly a neural network toolbox?

##########

Charles,

Thanks for thinking of us users.

I've used S[+] since the 'old S' language was all you could get.

A few years ago, there was a thread about the SAS'ification of Splus.
The notion was that addition of a feature here and a feature
there---creeping featurism---could eventually turn Splus into a burly
monster like SAS.

Having lots of features is generally less useful to me than having a few
features I can manipulate to good effect. I almost always make minor
'enhancements' of my own in the course of performing even fairly simple
analyses, and I prefer to have a smallish base set of functions to work
with. And I also take on some large programming tasks (e.g. MCMC
approaches to genetic linkage problems and mixed models that don't fit
raov, varcomp, or lme) from time to time. In each of these, it really
helps me to be able to recall from my own memory the tool I need to use
and the syntax/semantics of the tool. Then I can write up the code I
need and produce a useful result in short order.

I recognize that many Splus users are fairly weak programmers. And
nobody wants to waste an afternoon coding up a procedure that is a
point-and-click away in some stat package. But sometimes adding a
package doesn't solve so many problems after all.

I was elated when nmle() came out, but quickly determined that it
doesn't handle most of the problems I would want to throw at it. It is
easy enough to write the code to do those procedures in Splus, but with
datasets of even the modest size I encounter it would take forever to
run the procedures in Splus. So, I spend a lot of time writing C code
after prototyping the functions in Splus. It would really help to be
able to compile the Splus functions, even in a limited way.

The graphical strategies in Trellis ARE cunning, but I find the
implementation even more awkward than the old S routines. I hardly use
the Trellis 'package', even though I have one or more graphics windows
on my screen most of the time. Working through the documentation and the
semantics of Trellis hasn't seemed worthwhile when I can use a few
simple primitives to get useful results.

Splus has a very capable community of users who already provide
extensions to the language. Rather than adding 'features', which you
must then support, it might be better to add functionality that would
allow those users to work more effectively.

A few examples along those lines:

Better documentation of lower level features of Splus would be helpful.

How does dqr differ in Splus from its ancestor in Lapack?

Why isn't there a man page for 'unpaste'?

Some generalizations of inner/outer products in efficient
implementations would be most welcome.

A generalized cumsum would also be useful (semantically, cumfun <-
function( x, fun ) { res <- x; tmpres <- res[1]; for (i in along=x)[-1]
res[i] <- tmpres <- fun(tmpres,x[i]); res} ) if implemented efficiently.

#########

My initial reaction:

I think adding to the automatic visualization routines would be of interest.
I am fascinated by some of the graphic tools found in SPSS Diamond: particularly

Parallel Coordinates (which is developed in a couple of other Stats packages as
well).
I think in general SPlus should always be offering advanced graphical tools,
because this is one of its selling points to those who get really squeamish
about learning a command language.

##########

Dear Charles,

We have bought your splus 4.0 for win NT version. I do enjoy the product.

But one thing I keep thinking: Why don't you provide us with some API
functions that we can called from other languages. A good strategy is to
package all your functions to some .dll files, and you can make it a
comercial library to be sold worldwide. This can alleviate the users a lot
of pain to develop some statistical functions using other languages.

I'd like to hear your opinion.

##########

Mr. Roosen:

You should see what happens when I give the ls() command!

You are asking for suggestions. Here is a simple one.

Make a directory structure easily available in the standard system.

Make the process of doing scientific work once again an orderly and
historically traceable thing. I don't want to have to search forever for
that thing I did last year and whose name has been forgotten.

######

This is in response to Charlie Roosen's (of MathSoft) query regarding
product ideas...

1. In the context of time series analysis, the options for doing transfer
function modelling in the time domain seem to be too limited and
should be extended. This would be particulary useful in applications
to finance.

2. Looping in S-Plus still needs to be improved and constitutes a major
limitation of the language for certain applications such as image
processing, etc.

3. S-Plus should implement a *much* wider variety of the mathematical
special functions. The need for these constantly arises in many problems
(such as in statistical nonparametric inversion problems) and having to
link Fortran code to S-plus to handle such applications is unpleasant.
On a recent occasion I needed the error function for complex valued
arguments. This and a great many other mathematical special functions
should be available in S-plus as a matter of routine and would make the
language more useful to many users.

4. The on-line help pages could be significantly improved.

The 'suggestions' above may perhaps not all be exactly in line with what
you were seeking but: Thank you for asking!!

######

1. I would appreciate finding also the noncentral distributions
handled the same way as the central distributions in S-Plus.
2. User coming from the applications field don't like to worry
about inclusion of C- or Fortran code - independtly of whether it is
easy or not. The system should allow complete and effective solutions in
its own frame. As I understood - one of the basic objectives during the
development of S was: "Don't worry about programming - concentrate on
solving the statistical (scientific) problem."


########

Hello,

concerning the possible enhancements for S-Plus:

In my environnment, we often use experimental designs.
I have a licence of the DOX module, and find it very
convenient. It is however quite expensive.

I would suggest that DOX should be part of future S-Plus
versions, to promote via S-Plus the use of experimental
designs. The surface, contour and image functions for
Response Surface Designs need to be "Trellis-ized".

Mixture designs (where the ingredient factors sum up to
a constant) should also be integrated, with a plot function
to visualize the response surface in triangles (see the
book by Khuri et al.).

##########

Charlie Roosen kindly invited users' suggestions on statistical
enhancements to S+. so here are some initial thoughts :-

1. Put the non-central distributions on the same basis
as the central.

2. Add an option to VARCOMP to obtain estimated variances
by method-of-moments; alternatively, enable RAOV to
deal with unbalanced data.

3. Provide a set of QC facilities along the lines of QTOOLBOX
and perhaps lose the existing qcc-object based stuff ;-)

4. Incorporate DOX facilities into mainstream S+

5. Not really a statistical one this - provide a method of
producing smoothly 'animated' graphics.

##########

fao Charles Roosen--

THESE ARE A FEW OF MY FAVOURITE THINGS:

In fitting GAMs:

(1) the ability to have independent smooths in different factor levels. EG
growth~ lo( temperature) %in% gender

(2) cyclic smoothers, e.g. depth~lo( julian.date, period=365). The "supsmu" function can already do this (but not in GAMs), and
there's absolutely no reason why "lo" and "s" couldn't be adapted. Although one can fudge a result via
depth~lo( sin( 2*julian.date %% 365)) + lo( cos( 2*pi*julian.date %% 365)), this is less satisfactory for several reasons, e.g. how to interpret degrees-of-freedom.

In "survreg":
(3) the ability to handle negative responses without complaint. This behaviour could be governed by a
parameter "allow.negative=F" in the arguments, for consistency with the current version. I have used "survreg" to
analyze censored non-survival data (at the suggestion of s-news; I hadn't realized it was useful outside
survival analysis) and it was awkward to have to adjust my data to fit the program's requirements.

Unlike other respondents, I have had no problems linking external code to S-PLUS (in my case, DLLs written in Delphi); it's easy and incredibly useful. But I appreciate that things are not so straightforward on other platforms.

Thanks for asking--

#########

Charlie,

First, let me start off by saying that I am a most enthusiastic user of SPlus. (At my suggestion, given prior to 4.0, I have had SPlus installed in one bank and one insurance modeling firm.) As far as I'm concerned, SPlus can do more and prettier than any of the other packages out there. With one glaring exception: SPlus 4.0 is awful from a performance point of view. Just this evening I read something on the s-news about having to wait over ten minutes to open SPlus with a moderate data set. I hate SAS but still have to use it just because SPlus can't handle the large datasets that I use. At one of my work stations I have 4.0 but only use the 3.0 capabilities. At my other workstation I haven't even loaded 4.0 because I can't handle the extreme peroformance degredation.

In summary, I guess I would have to say that I think that you should spend all your time and effort on fixing the performance problems introduced in 4.0 and drop all the so-called GUI improvements (-- I still can't make a graph look the way I want unless I use the command line mode) before thinking about adding more statistical capabilities. I don't mind arcane syntax as long as it works, something the drag-and-drop paradigm doesn't!

I love what SPlus can do and does, but you've got to fix these problems before I consider recommending or buying again.

If there is any way at all in which I can help, please let me know.

Thanks

#######

Charlie,
A key concern of mine is to implement 'sliders' in SPLUS. Some of
us have begun to 'brew' our own, but it would be nice if sliders, 1 or
2 dimensional (?) at least, were to be implemented in successive versions
of the language. Can you confirm whether this feature is planned, and if
so when we might expect it?

##########

Dr. Roosen,

Recently I have been using preatty intensively the KernSmooth library
wrote by Matt Wand. Essentially it does binned kernel smoothing (for
2D-density as well). I think it's a worthwile complement to the ksmooth
function of Splus (I am using 3.3 for Unix, I don't know if in 3.4 it is
better), especially in the speed when a lot of data is involved, although
I believe some improvements can be done.

Matt Wand has a newer version than the one available at Statlib. It is
available from his webpage at

http://www.biostat.harvard.edu/~mwand/software.html

I would like also to see some function that does GMM (generalized method
of moments), which is widely used in economics, especially with financial
data. The only time I used it I programmed it myself.

About new topics, I know some people here are working on functional data,
i.e. treating curves as data. I don't know much about it and don't know
if they have come out with anything that could be used in practice,
but I could give you the name and e-mails of those people if you are
interested. They use mainly Matlab, but some can use Splus as well.

I appreciated you asked to users about their needs.

#########

MCMC sampling. Yes, you can write your own algorithms, but from the
posts to this list, it appears people don't have time to write their
own algorithms for things they need.

Maybe just a connection with WinBUGS (BUGS) is enough. I believe BUGS
does MCMC sampling faster than algorithms written in Splus....at least,
_my_ algorithms. This could include the CODA library with updated
diagnostic tools for Splus 4.? .

Maybe also a "Bayesian module" in general...

###########

It's going to be difficult for any company to incorporate (and even more difficult to maintain,
as authors are constantly improving them) all the libraries Tim and others might want.
I suggest having a special help menu item that describes all available libraries for the
platform the user is using, and that has a 'free stuff on the internet' button that will automatically
download and install libraries on demand, much as Microsoft does with Word and other office
products. Some of the libraries come with extensive manuals (e.g., the new rpart (CART)
library). It would be nice to get these online too, say in .pdf format. -Frank

##########

Bernard Silverman and I would like to propose some extensions in the direction
that would be useful for the statistical analysis of curves, or what we
call functional data analysis. We have a suite of functions in preparation
for this, as well as in Matlab, and we're kept them all at the level of
SPLUS. But there are various extensions that would make SPLUS a more
convenient and efficient environment. See our book, Functional Data
Analysis (1997) Springer for more info.

[...]

Jim Ramsay

#########

Hi Charlie,

>1) What statistical techniques which you use regularly are not
>available in S-PLUS ?
>

See (2).

>2) What "standard" techniques does S-PLUS need to be an even more
>well-rounded basic statistics package?
>

I use a lot of Latent Variable Structural Equation Modeling (SEM).
While simple models of this type could be expressed in S-PLUS by
setting up the proper matrix formula and minimizing, there is no
mechanism that I know of for setting constraints (equality, boundary
or nonlinear) or fitting unbalanced multiple group models. For
this many people use LISREL or the plugin for SPSS, Amos. I use
a free program called Mx (Neale 1994).

There are a great variety of these modeling techniques and they
are widely used in psychometric, econometric and educational
statistics. S-PLUS has very little overlap with this field. You
might want to check into a newsgroup called SEMNET (send a message
to LISTSERV@UA1VM.UA.EDU with "SUBSCRIBE SEMNET Charles Roosen" as
the body of the message) that is at least as active as S-PLUS and
will give you a whole different outlook on statistics than the
tradition to which S-PLUS belongs. I subscribe to both lists because
it gives me a broader look at the field.

There is some overlap, for instance people on SEMNET spend a lot
of time discussing factor analysis. However, they make a distinction
between exploratory factor analysis (such as is available in S-PLUS)
and confirmatory factor analysis (in which constraints are placed
on coefficients during minimization).

Another technique that could be easily added that has become quite
popular in education and psychology is Hierarchical Linear Modeling
(see Raudenbush 1995). S-PLUS could add this easily. HLM is a
variant of a more general notion of multiple level modeling (see
Bock 1989, Muthen 1989, 1994). Adding multilevel modeling requires
adding SEM.

References

Bock, RD (1989). Multilevel Analysis of Educational Data. San Diego:
Academic Press.

Muthen, B. O. 1994. "Multilevel covariance structure analysis."
Sociological Methods and Reserach. Vol. 22, No. 3:376-398.

Muthen, B. O. and Satorra, A. 1989. "Multilevel aspects of varying
parameters in structural models." In Darrell Bock (ed) Multilevel
Analysis of Educational Data. New York: Academic Press.

Neale, M. C. 1994. Mx: Statistical modeling. Box 710 MCV, Richmond,
VA 223298: Department of Psychiatry. 2nd Edition

Raudenbush, S (1995). A multivariate hierarchical model for studying
change within married couples. Journal of Family Psychology, 9,
161-174.

>3) What hot new techniques should we add to S-PLUS as part of
>commitment to stay on the cutting edge of statistical computing?
>

I've been working in the field of dynamical systems modeling. This
is certainly a hot topic and quite new. On the other hand, there
are only a few methods that don't require very large data sets.
People have gravitated towards MATLAB for these methods because
S-PLUS has a memory management weakness and also because MATLAB
can translate code into C source for compilation. The sorts of
algorithms that are in use (such as mutual information calculation,
surrogate data techniques, false nearest neighbors, and nonlinear
noise reduction) involve _lots_ of CPU cycles spent in tight loops
so a compiled program becomes a necessity. See Abarbanel et al
(1993) for an overview of some of these methods.

Abarbanel, Brown, Sidorowich, & Tsimring 1993. The analysis of
observed chaotic data in physical systems. Reviews of Modern Physics,
65:4, 1331--1392.

>4) What user-contributed functions in StatLib have you found to be
>of both high value and high quality, and thus contenders for
>inclusion in S-PLUS?
>

###########

One question I would like to ask regards the general performance of S-Plus.
I have a colleague who has been evaluating S-Plus and he claims the system
is extremely slow when compared to other programs he is using such as Systat
and SPSS. Is this consistent with your tests in general, and if so, why is
it the case and is it going to be addressed in the future?

Thanks much,

##############

A suggestion:

A user friendly suit of additional function minimization routines would
be wonderful. Those in splus now work well, but it's not usual to see
questions from people writing a (say mle) function and having problems
passing parameters or data to the call to nls or nlminb (despite the
helpful examples given by V&R).

There are a bunch of other routines which could be added: a fast fortran
or C implementation of the Nelder-Mead and simulated annealing routines
would be a big plus. Also, conjugate gradient and quasi-Netwon routines (other
than those presently in Splus) would be helpful to anyone whose
Newton-Raphson routines has failed. If these ideas are implemented, I
would hope that nonlinear constraints and numerical approximations to
derivatives (which can be returned) would be included.

SAS has implemented a solid, robust suite of these types of functions,
and these are the only things keeping me from using Splus for 99% of my
computing right now.

############

The multivariate distribution free tests of Wei and Lachin and Wei and
Johnson are a good idea. There has been a fortran implementation of
these methods published by Davis (see below) which may save some time
in creating Splus functions.

Some references:

Wei, L. J. and Lachin, J. M. ``Two-Sample Asymptotically
Distribution-Free Tests for Incomplete Multivariate Observations'',
Journal of the American Statistical Association 79:653-661, 1984.

Lachin, J. L. ``Some Large Sample Distribution-Free Estimators
and Tests for Multivariate Partially Incomplete Data from Two Populations'',
Statistics in Medicine 11:1151-1170, 1996.

**** Davis, C. S ``A Computer Program for Nonparametric Analysis
of Incomplete Repeated Measures from Two Samples'', Computer Methods
and Programs in Biomedicine 42:39-52, 1994.

Palesch, Y. Y. and Lachin, J. M. ``Asymptotically
Distribution-Free Multivariate Rank Tests for Multiple Samples with
Partially Incomplete Observations'', Statistica Sinica 4:373-387,
1994.

Wei, L. J. and Johnson, W. E. ``Combining Dependent Tests with
Incomplete Repeated Measurements'', Biometrika 72:359-364, 1985.

###########

Charles,

I would hope that you might consider finally adding quantile regression to Splus.
Doug and I talked about this in the mid-80's, and again about 3 or 4 years ago.
The idea is now 20 years old and there are an increasing number of applications
in econometrics, biostatistics and other fields. Stata has pushed this aspect of
their product fairly hard and have attracted a considerable following in econometrics
partially because of it.

In the latest Stat Science, Steve Portnoy and I describe some new computational ideas
for these methods which make them comparable to least squares speed in large problems
and this, we hope, will expand the scope of their aplicability. I would be happy
to talk further about this, provide references, etc

##########

> 1) What statistical techniques which you use regularly are not available in
> S-PLUS ?
an analysis of agreement between multiple ratings.
reliability analyses

>
> 2) What "standard" techniques does S-PLUS need to be an even more
> well-rounded basic statistics package?
GAMs for Cox regression.
integration of GEE functions (currently available from statlib or
OSWALD) as standard Splus features.
GAMS for longitudinal data (from GEE)

>
> 3) What hot new techniques should we add to S-PLUS as part of commitment to
> stay on the cutting edge of statistical computing?
standard bootstrap functions that can be used with GAMs to estimate
variances comparing two values of the GAMS. (eg., logistic regression in
which one wants to calculate the adjusted odds ratio between values of x,
where x is etsimated from a GAM)

>
> 4) What user-contributed functions in StatLib have you found to be of both
> high value and high quality, and thus contenders for inclusion in S-PLUS?
gee functions

########

At 01:02 PM 3/27/1998 -0800, you wrote:
>Dear S-PLUS users,
>
[...]
>In particular, I'm taking the
>lead on sorting out what new statistical functionality to add. I'll compile
>replies made to me personally (roosen@statsci.com), or discuss amongst
>yourselves.
>
Please also think about contributing a version of the summary for our
soon-to-be updated S-News FAQ(:-)

>Some questions:
>
>1) What statistical techniques which you use regularly are not available in
>S-PLUS ?

sample size/power calculations for multi-stage designs using ranking and
selection approach [Peter Thall, Biometrics ...]

>
>2) What "standard" techniques does S-PLUS need to be an even more
>well-rounded basic statistics package?

Personally, it's not missing very much!!

>
>3) What hot new techniques should we add to S-PLUS as part of commitment to
>stay on the cutting edge of statistical computing?

Perhaps more methods for 'exact' tests (e.g. the sort of things in Cyrus
Mehta's outrageously expensive StatExact). For other methods, I'll defer
to my colleagues on the cutting edge!

>
>4) What user-contributed functions in StatLib have you found to be of both
>high value and high quality, and thus contenders for inclusion in S-PLUS?

[1] Clive Loader's locfit
[2] Alan Z's psfonts

Caveat: I have chosen to use S-Plus v 3.3 in Win 3.1, waiting for perhaps
S-Plus 4.5 (and probably Windows NT 5, or Linux) ... so I may be
recommending something already available with out knowing it!

Good Luck! I think MathSoft has taken on a huge task in trying to expand
the user base by including the new gui (expanding the user base helps the
rest of us, if that helps to keep the prices down) and at the same time,
continue to incorporate cutting edge statistical methods.

########

Charlie:

I suspect you'll hear this from a lot of folks. I think the most needed new
functionality consistent with S philosophy is neural nets. The existing
libraries undoubtedly provide a bse, but you should think hard about a
dynamic interface for setting/updating/investigating parameters and topology
of connections, a la tree().

Second choices: A MARS implementation; upgrade tree() to allow deviance of
mean or better yet median absolute deviations for regression trees. I
realize that this may present computational problems, however.

##########

This is not statistics, but a good tabulate procedure similar or
better than PROC TABULATE in SAS or TABLES in SPSS that can produce
high quality table reports would be very useful.

############

Charlie,
I consider graphical techniques as statistical enhancements.
Therefore, I really wonder if there are considerations of having a
rounded-off version of the interactive graphics in S-PLUS. Brush and spin is
really an old and not very nice implementation, and my feeling is that brush
and spin are only there to have some interactive graphs in S-PLUS. It is not
at all a nice and rounded-off implementation. Look at other programs like
JMP from SAS, they offer much more in terms of interaction with graphs.
As a very simple example, one cannot change the display style in brush and
spin, let alone colors. In our implementation (SGI 3.4), there are some bugs
as well. Zooming in into the 3d graph in brush lets points fly into the
other windows, for example.

S-PLUS used to be on the leading edge of modern graphics and visualization,
and Trellis contributed substantially to it. Nevertheless, there is not so
much emphasis on graphics any more.
To be amongst the most modern systems, it requires quite some enhancements
in terms of interactive data analysis and displaying methods as well. An
example for modern (static) but nevertheless already wide-spread displays is
the mosaic plots.
I would be willing to volunteer to come up with a more detailed proposal,
given there is sufficient interest on your side.

########

three suggestions:

(1)

On Sun, 29 Mar 1998 andrey@utstat.toronto.edu wrote:

> 2. Looping in S-Plus still needs to be improved and constitutes a major
> limitation of the language for certain applications such as image
> processing, etc.

I aggree! Let me add: just in case, that the general looping problem
cannot be solved in S+ as an interpreted language, an efficient function
for a special case of looping would be very helpful:

successive processing of list elements where processing element i depends
on the results of processing element i-1. This could be implemented in a
lapply fashion

Currently we can do elementwise + with lapply(), but not the cumsum()
task.

(2) With not too much effort, the coxph() function could be extended to
allow for Aalen-Johansen-Regression, as described in

Andersen P K, Hansen L S, Keiding N: Non- and Semi-parametric Estimation
of Transition Probabilities from Censored Observation of a
non-homogeneous Markov Process. Scand J Statist 18: 153-167 (1991)

i.e. a proportional-hazards regression model for non-homogenous markov
models. This requires joining the results (separate or joined) cox-models
for all transitions, calculating the Aalen-Johansen (Product Limit)
estimator for the transition matrix and it's variance (both have the
recursive structure of (1), by the way).

(3) Furthermore I would find it helpful to have an efficient read- and
write-access to referenced objects and especially to parts of referenced
objects, as (sub-efficiently) demonstrated in my library REF. Having this
would help solving (1), beside other advantages.

Best regards

##############

Hi Charlie,

How are things going in Seattle?
We are promoting S-PLUS quit a lot in the Netherlands and are giving many
demonstrations. I saw your mail on s-news the other day, what we here from
people here on product enhancements/improvements:

*) Improved data manipulation and calculation of simple stats via the GUI
*) Wizards to create or modify GUI objects (especialy customzed dialogs)
*) Improved general (contrained) optimization routines
*) Structural equations (systems of linear regression)
*) Improved ode solver and documentation of that.

###############

Dear Dr.Roosen:
I am an author of a set of innovative techniques for generation of random time
series & fields (both scalar and vector)with predetermined stochastic structure.
(for example, time series with pregiven spectrum and probability distribution
function SIMULTANEOUSLY). I am working on commercial package covering this
topic. Unfortunately, due to overloading with my current research, the work's
progressing slowly. In order to speed up this work, I would be happy to consider
some form of cooperation with MathSoft. If you are interested in anything of the
kind, would you please advise what kind of framework is ordinary accepted for
this kind of cooperation?

###########

For what it's worth, here are some suggestions:

1) expanded nonlinear optimization routines similar to the MAXLIK and
CML libraries in GAUSS-- in addition you may want to add a simulated
annealing algorithm for completeness

2) structural equation modeling capapbilities similar to those found
in LISREL or SAS's PROC CALIS

3) A GHK probability simulator as well as the ability to fit
multinomial probit models using the GHK simulator

4) It would also be nice if users were given the option of fitting
models from a Bayesian perspective. Obviously, for some models this
isn't possible from either a theoretical or computational standpoint,
however, for the linear model with normal disturbances, the linear
model with student-t disturbances, and the basic probit model
estimation is extremely straightforward and MCMC algorithms converge
very quickly and reliably. If this is incorporated, you may want to
add one of the convergence diagnostic functions such as gibbsit()
available at statlib.

Thanks for asking for user input.

#############

Dear S-PLUS users,

For this note I'm wearing the hat of statistical developer looking for
product ideas, as opposed to company spokesmodel.

We are at the point in the development cycle where we are assessing what new
features to add to future versions of S-PLUS. In particular, I'm taking the
lead on sorting out what new statistical functionality to add. I'll compile
replies made to me personally (roosen@statsci.com), or discuss amongst
yourselves.

Some questions:

1) What statistical techniques which you use regularly are not available in
S-PLUS ?

optimization of convex quadratic problems.

2) What "standard" techniques does S-PLUS need to be an even more
well-rounded basic statistics package?

3) What hot new techniques should we add to S-PLUS as part of commitment to
stay on the cutting edge of statistical computing?

Breiman's MARS, any other techniques available from Ripley's pattern
recognition book, nnet from venables-and-ripley, etc.

4) What user-contributed functions in StatLib have you found to be of both
high value and high quality, and thus contenders for inclusion in S-PLUS?

rcorr: correlation despite missing values. I extended it to allow
covariance and also exponential weighting of values
latex.table: print tables in latex format
formatC: formatted printing like C's printf
abind: multi dimensional generalization of rbind/cbind
dapply: dyadic version of apply
xgobi: like brush, only a few more capabilities
kde2D: Two-dimensional gaussian kernel density estimation

##########

About your questions for improve S-plus here are some suggestions

- Simple and Stratified epidemiological analysis for case-control and
cohort studies(Relative Risk, Odds Ratio, Atributable Risk, Incidence
Density, Mantel-Haenszel, Rates, etc.)

- Double-censored survival analyisis both parametric and semi-parametric

- Kaplan-Meier Estimate with staggered-entry

Thanks
#############

1. Documentation, and software if necessary, on how to do and interpret
mixed models in SPlus. I still have to go to PROC MIXED in SAS to do this.

2. I have trouble selecting specific cases to plug into the columns of the
pull down menus. For example, I'd like to specify
myframe[myframe$myfield==1,"mycol"] as the independent variable in a t-test.
I guess usually, the example is more complicated like myframe$myfield==1 &
myframe$myfield2<5.

3. I'd like to be able to do all subsets regression models (both linear and
logistic) so that I could get a report like this:

MODEL# VAR1 VAR2 ... BootstrappedROC
1 pvalue pvalue ... rocmeasure

for all possible subsets in logistic models. If the variable were not
selected for that particular model, it would be blank or NA or something. I
can partially do this with SAS macros but not in SPlus.

4. More documetation on performing and interpreting parametric survival
analyses in SPlus.

###################

Hi Charles,
These are mostly graphics things, but I'd like to see:

1) A a function to determine whether a point is in or out of a polygon.
I know there's a points-in-poly routine in spatial stats, but I don't
have that. May have to get it when I get to that project.

2) Map data for Mexico and Canada including state (province) boundaries.
Again I know there's a world map data set on the internet, but it's
Unix-based and after 2 days of messing with it I gave up.

3) Color-filled contours as an option to contours or images. NOT
trellis, though. Sorry for laziness, but I never have figured out
trellis graphics.

Thanks for asking!

############

Dear Dr. Roosen:

First, I want to say I'm an ardent S-Plus supporter and enthusiast and
a constant S-Plus user. Thank you for your efforts.

One thing I would like in S-Plus is more permutation techniques, a la
StatXact and LogXact. For example, Fisher's exact test on higher
dimensional tables (with only a
few levels on each margin, obviously). I recently finished writing a
function to do non-asymptotic approximations for the test of
independence in general r-way contingency tables, and this should
suffice when scientists here come to me as ask what to do when the
software says, "chi-squared invalid". But it would be nice to have the
exact methods as standard S-Plus for the cases it's computationally
feasible. (I could write them, I suppose....but you asked.)

The abilitity to do non-parametric "stuff" with ties. Confidence
intervals with non-parametric methods, where that's sensible. (Again,
I've written a function to do CIs with two-sample Wilcoxon, but I don't
have ties either.)

If something else comes to mind that's as generally useful, I'll pass
it along.

Here's a marketing note:

You might consider a library/module tailored to
epidemiologic/biomedical applications, which might include things like
ROC curves, matched analyses, and sample size computations for epi
studies, in their language. I know one can do these already, if one's
comfortable programming in S-Plus, but some of the epidemiologists
around here are wary of starting to use S-Plus because, for instance,
the output from the "logistic regression" button/dialogue doesn't report
odds ratios and associated confidence intervals...some work needs to be
done to get them. This might seem not to be a big deal to you and me,
because those are actually easy to compute from what output is
given/available. But before the Windows interface that actually has a
"logistic regression" button to push, the epi folks would not have known
that they would use glm() to do the analysis.
I'm pushing the use of S-Plus here in our division, for a variety of
reasons, but I'll be writing functions to pretty-up the glm() output
and also the sample size things for specific applications here (though
generally epi). [Guess you'd say, "that's why they hired you!"]
This is possible here because there is a statistician to write this
stuff, but if there weren't, S-Plus wouldn't be in the building. If
such a library/module were available, the epi folks would certainly be
more apt to use S-Plus, and so they might even without a statistician in
the building.

Cheers and thank you,

###########

In answer to your questions:

> 1) What statistical techniques which you use regularly are not available in
> S-PLUS ?

SE calculations for overdispersed glm fits, both with modificaton of iterative
weights, and as a post-fit add-on. I have some home-brew functions for
this, but need to wrap it in something publishable to justify the effort of
making these presentable. Encouragement would take the form of a suggested
journal.

> 2) What "standard" techniques does S-PLUS need to be an even more
> well-rounded basic statistics package?

Ordered categorical methods such as cummulative logit.

> 3) What hot new techniques should we add to S-PLUS as part of commitment to
> stay on the cutting edge of statistical computing?

Continued friendliness to user/developers.

> 4) What user-contributed functions in StatLib have you found to be of both
> high value and high quality, and thus contenders for inclusion in S-PLUS?

gee

###########

1. I would like to see the output for some functions better organized.
The lm function for example. I have never understood your choice of lm
immediate and summary output. The immediate output should look more or
less like an ANOVA table.

2. Adaptive sampling estimates.

3. Confidence intervals for nonparametric correlation coefficients.

4. I have found the Windows interface in Splus 4 to be really hard to
work with, and rather bare-bones statistically. SPPS has a
near-excellent interface that is intuitive and fast. I still have not
figured out how to change text sizes in plot labels in splus v 4 and do
not care to put much effort into it as it is always crashing.

##########

A lot of us use Gibbs sampler, a module similar to BUGS will be helpful.

########

Regarding features to add to future version of S-Plus, I request that you
provide a way for users to select a stripped-down version that combines
functionality (when you need it) with the speed and leanness of Version
3.3. I would not consider an upgrade if it demanded more computer than V.
4, which I am not using because it is too slow and cumbersome (P-200, 32MB).

#########

I would greatly appreciate the computation of exact P-values in
nonparametric testing when ties are present. Currently, the function
wilcox.test only returns a warning message that no exact P-values can
be computed. For small samples exact p-values can be obtained by
enumerating the permutation distribution of the test statistic under
the null hypothesis. However, functions like combn (Scott Chasalow)
are to slow to be of any help in this. Exact p-values are of
particular importance in drug safety evaluation when sample size is
small and no unrealistic assumptions about the distribution of the
data can be made. At present, the technique is implemented in StatXact
3.0 (Cytel Software Corporation) and in SAS (Version 6.12, PROC
UNIVARIATE and PROC NPAR1WAY, option EXACT). Cytel also offers a PROC
STATXACT add-on to SAS. As I believe, the StatXact implementation uses
the network-algorithm & importance sampling (Mehta, Patel, Tsiatis:
Biometrics 40:819-825, 1984).

At present, I am only interested in the one and two sample problem
(Wilcoxon signed-rank and Wilcoxon rank-sum tests). The availability
of exact nonparametric inference is essential for further involvement
of S-Plus based systems in drug safety assessment.



##############

Charles Roosen writes:
> Dear S-PLUS users,
>
> For this note I'm wearing the hat of statistical developer looking for
> product ideas, as opposed to company spokesmodel.
>
> We are at the point in the development cycle where we are assessing what new
> features to add to future versions of S-PLUS. In particular, I'm taking the
> lead on sorting out what new statistical functionality to add. I'll compile
> replies made to me personally (roosen@statsci.com), or discuss amongst
> yourselves.
>
> Some questions:
>
> 1) What statistical techniques which you use regularly are not available in
> S-PLUS ?
>

spss reliability procedure

> 2) What "standard" techniques does S-PLUS need to be an even more
> well-rounded basic statistics package?
>
> 3) What hot new techniques should we add to S-PLUS as part of commitment to
> stay on the cutting edge of statistical computing?
>
> 4) What user-contributed functions in StatLib have you found to be of both
> high value and high quality, and thus contenders for inclusion in S-PLUS?
>

Hmisc, Design, Display libraries of Frank Harrell

##########

I agree completely. The two biggest improvements S-Plus could make to
its
statistical functionality would be to

1) Ditch Object Browser completely, and
2) Fix the way loops are handled.

Just as an aside, I have a little program that I run for amusement from
time
to time. It displays a beer bottle on the screen; as you move the mouse
cursor toward it, the beer bottle moves away from the mouse cursor. Thus

you can chase the beer bottle across the screen with the mouse. Object
Browser is just as useless as this program is, without being nearly as
amusing.

##########

> Fortunately people like Trevor Hastie and Brian Ripley made some
wonderful
> stuff available in their mda and MASS libraries, but I think it's time
that
> at least some of this gets integrated into standard S-plus. The last time
I
> tried to use Hastie's mda functions under Windows I failed, because the
> called Fortran and Unix code.

There is a port to Windows at

http://www.stats.ox.ac.uk/pub/SWin

which does work. Actually, I had to correct the Fortran code to be
valid Fortran to get it past my compiler.

I agree with all your requests, none of which have been addressed in 4.x.

#############

Dear Charles,

I am writing in response to your request for possible enhancements
to Splus. There are a few major failings of Splus, which currently mean
that in our organisation I am among two people who ever use it, out
of a potential audience of over 20 people. I list the "failings" in order of
importance for us :

1) Times series handling is limited. For example try doing basic operations

on time series data like:

980201.1030 10565.2
980201.1430 10625.9
980201.1630 10431.6
980202.1030 N/A
980202.1430 10449.1
980202.1630 10496.0

in Splus. We handle this sort of thing all the time, and it is definitely
not
intuitive in Splus, unless I am missing something obvious. I am not
talking fancy analytics here, merely basic time series manipulation.

2) Loops are FAR too slow. I know that many operations can be coded
as vector operations, especially if you're clever. While this may provide
a great amount of entertainment to academics, enabling them to point
out better and faster ways, it is not helpful when code that executes in
seconds in other 4GLs takes ages in Splus. Many things cannot be coded
up as vector operations.

3) Interactive graphics for Unix patforms are limited (non-existent even
?).
Point and click zooming in and out is phenomenally useful for data
cleaning and validation, as well as for understanding the data better.

Were the three points I have listed addressed, I believe your product
would find a large and lucrative market more ready to invest in it, namely
in finance.

I hope this is of some assistance,

#########

Charlie,

Undoubtedly you are swamped! Most of the requests I have seen on S-news
are for very specialized things. Perhaps that does indicate that
contributed code is filling some of the gaps, although it is getting little
mention.

The things I see as most important are

Dynamic graphics, for example as in XGobi.

Better density estimation and smoothing code. Matt Wand's KernSoft
library would be a good start. Locfit looks nice, but seems to be
pretty appalling code inside, and seems to give different results on
different platforms and at each release. I find logspline and its
allies to be too unreliable.

Generalized least-squares, ridge regression, ....

Multiple logistic models, probably also ordinal logistic models.

Something on mixed-effects generalized models, although there are a
lot of possibilities. Carey's GEE library would not be a good start.

Multidimensional scaling (non-classical).
LDA, QDA and allies
Correspondence analysis and multiple correspondence analysis

I think RPART deserves serious consideration.

Then there are things that need to be got right. A very partial list:

step.glm and the `AIC'-based functions
histograms (especially in Trellis)
tree and allies (I've only fixed prune.tree).
A better interface and full-likelihood fitting in the time series module.
Standard errors for regression coefficients in arima.mle

##########

What is really needed is a Linux port, particularly one for 64 bit Linux on an Alpha.

###########

First of all I'd like to thank Charles for his initiative, I think he's
doing an excellent job in liaising between Mathsoft and the wider
S-community.

I believe S-plus definitely needs enhancement of its suite of multivariate
methods. Though with its matrix-oriented approach and powerful graphics
S-plus is ideally suited for descriptive multivariate data analysis,
surprisingly little functionality is built in. I must note here, though,
that I haven't upgraded to version 4.0 yet, which may or may not be better
endowed.

Fortunately people like Trevor Hastie and Brian Ripley made some wonderful
stuff available in their mda and MASS libraries, but I think it's time that
at least some of this gets integrated into standard S-plus. The last time I
tried to use Hastie's mda functions under Windows I failed, because the
called Fortran and Unix code. The discr() function is woefully inadequate
for descriptive work, and without Ripley's lda() I would have had to switch
to another program. Also, it would be nice to be able to go beyond the
standard PCA rotations of the data space as in projection pursuit.

Finally, I believe the spin plot in v3.3 for Windows is outdated (again v4.0
may have improved on this). For my research I'm using a custom developed,
experimental 3D data visualisation program, which allows interactive
'virtual reality' exploration of 3D data spaces and capturing of interesting
views in publication quality colour perspective 'stills'. It should be
possible to add some similar functionality to Splus.

I'm just a tinkerer, and I'd be interested in the opinion of others,
especially the real stats pros.

###########

Dear Charles Rosen

I would appreciate the inclusion of multivariate techniques that I use
for vegetation analysis, most notably:

1) Local and global nonmetric multidimensional scaling.
2) Correspondence analysis (with options for detrending and nonlinear
rescaling of gradients.
3) Twinspan (two way indicator species analysis).

########

It looks as if everyone is now sending messages to Statsoft,
but I support the request of Luc Wouters for 'stat-exact' solutions
too, including exact logistic regression. Exact tests are an
important tool in bio-medical statistics, regarding the small sample
sizes.

########

In bio-medical research some issues are important, and mostly not
handled in ONE package. So my request would be:

1. Exact techniques (as now are imported into
SAS)
2. Handling of missing values. e.g. Multiple imputation
techniques for Missing at random situations.
3. Bayesian methods: MCMC tools similar to BUGS (which has a more or
less S-Plus like language, and uses in fact S-Plus as pre- and post-
data analysis tool.

#######

I am not sure, what is implemented in the latest versions, but I would
wellcome an uniform approach to NAs. Is it possible to estimate (impute)
them at the same time as parameters as a default, the degrees of freedom
taken care of? Recently there was an ad in the snail mail for a package
that had Rubin's likelihood methods implemented among others. So there is
competition. And perfect defaults are worth my money.

#######

As an epidemiologist, I would highly appreciate the ability to fit
generalized mixed effects linear models and to perform predictions and
diagnostics at any clustering level.

########

I would like to see mixed model analysis at least on the level of
SAS's PROC MIXED.

#######

I really appreciate your initiative.
I haven't upgraded to version 4.0 yet, I'm still using v. 3.3.
In v. 3.3 I have noticed that there is a great lack in multivariate methods.
Being matrix-oriented, S-plus is well suited for descriptive
multivariate data analysis but there are only few built in
functions to do it.
Library multiv from Fionn Murtagh and MASS from Brian Ripley are of help,
I think they should be integrated and "potentiated" into standard S-plus.

##########

I suspect we've deviated substantially from Charlie's topic of
statistical enhancements to Splus. Still, it's been interesting so I'll
add my 2 cents worth.

Based upon what I hear from about 30 Splus/Windows users at Koch, we're
generally happy with Version 4. I don't personally get much value out
of the object browser or point-and-click menus; but many of our new
Splus users prefer them. For us, the whole value of V4 was to attract
some folks who would otherwise analyze data via Excel (whether it was
capable or not). I always worry that point-and-click software just
makes it easy for folks to misinterpret data; but that's a risk I've got
to take given limited resources.

I've been surprised to hear complaints in snews about the speed of
Version 4. The complaints I hear at Koch involve stability.
Splus/Windows V4 almost never crashes for me; so I've got to believe
that the stability complaints either involve menus (which I seldom use)
or are the result of new users telling Splus to do silly things. I do
hear complaints about the time it takes to load Splus V4; but once
loaded it runs much faster than V3 for us. Possibly we just have good
hardware here. All in all, Splus/Windows V4 does everything that V3 did
and does it better. It may be true that its fancy new menus don't live
up to all the hype; but no one is telling me that V3 is better and some
think V4 is substantially easier to use.

My biggest complaint about Splus/Windows revolves around the data
management issues when compared to Splus/Unix. My hope is that future
versions of Splus/Windows will be more Unix-like when it comes to things
like long file names and project management. If it weren't for these
issues, we'd move completely away from Splus/Unix to Splus/Windows. Of
course, Splus/Unix is extremely stable and is my preference even though
my version of Splus/Windows is faster. I tend to use Splus/Unix on my
major data analyses. Then I move my data and functions to Splus/Windows
V4 where it's easy to do things like include my graphics in Word
documents or Powerpoint presentations. If I have a simple problem I may
do the entire analysis in Windows. When I need to provide data and
functions to share with our group, I use Splus/Unix because of its
easier data and project management -- though I do have a set of Koch
Splus functions on a shared drive accessible to our Windows users. Our
long term future at Koch will likely be Windows instead of Unix since
that's the preferred direction of most of our large data base systems.
As it gets easier to use Microsoft's tools in talking to data bases, the
future of Unix here at Koch will probably fade regardless of how future
versions of Splus change.

Someone requested that comments about "time" be specific, so I'll add
the results of a few time trials I did on four different systems about 2
months ago. Even our brand new Sun E450 (Unix) doesn't outperform our
typical Windows installation -- though I still personally prefer the
ease of use and stability of Splus/Unix.

TIMES IN SECONDS
WinV3 WinV4 Unix1V3 Unix2V3
4 3 6 4 Simple Graph
10 6 10 6 More complex
Graph
19 12 27 18 Very Complex
Graph
Crash 130 300 153 Very Complex Function

Specifics:
WinV3: PC with 64 MB RAM, 200 MHz chip, NT 4.0 OS, Splus 3.3
WinV4: PC with 64 MB RAM, 200 MHz chip, NT 4.0 OS, Splus 4.0
Unix1V3: IBM RS/6000 Model 390, 192 MB RAM, 67 MHz chip, AIX V1.6 OS,
Splus 3.4
Unix2V3: Sun E450, 1 GB RAM, 2 - 250 MHz chips, Solaris 2.6 OS, Splus
3.4

########

A lot is already mentioned, I'm not repeating that:

a. More consistency, all modelling function should accept missing
data, for instance. Lowess, for instance, doesnt know to remove
missing. This is annoying.
b. All modelling functions should accept formulas. l1fit doesnt.

GAM: possibility to have different smooths within factor levels.

... running on small machines

#######

Here is my vote for what not to expend great efforts adding to S-Plus:
exact methods. We have so many bigger things to worry about such
as non-normal errors, non-linear covariable effects, unaccounted-for
heterogeneity, that I've never been very concerned about getting
an "exact" P-value for an over-simplified model. A classic saying of
Tukey about exact solutions to the wrong problem comes to mind.
Even in the case of a 2x2 table, the presence of strong risk factors can cause a
heterogeneity of risks great enough to make unadjusted analyses
incorrect. I would rather use the bootstrap or a full Bayesian approach
to get confidence intervals or probabilities of positive effects. And I'm
still not a fan of conditioning when marginal cell counts were not
pre-specified by the experimental design (and mine never are). Lastly,
exact methods don't always extend well. On the other hand it is very
easy to extend the bootstrap to account for intra-cluster correlation,
for example.

My second vote on what not to implement is type III sums of squares
and F-tests, which are more problematic than most statisticians assume.

Here are my votes on what would be worth doing, not in any particular
order:

1. Handle NAs in a smart way for all modeling functions. For example,
the survival modeling functions written by Terry Therneau keep track of
which observations were deleted by NAs so that for example
plot(age, resid(fit)) will work, by making sure that resid(fit) properly
aligns with age. [On our web page there is a document "Supplemental
Notes" to my biostatistical modeling course that gives several hints
for dealing with NAs while using the lm function.] Modeling functions
in my Design library use Therneau's technique. This needs to be builtin
to other S-Plus functions.

2. Sample size and power calculations for the normal-errors model, accounting
for uncertainty in the estimate of sigma. For example, the user could
provide the data (or sufficient statistics) used to estimate sigma
and the program could compute
an entire power 'distribution' taking the uncertainty into account. Sample size
calculations to achieve certain precision (e.g., width of confidence intervals)
would also be welcome. A deluxe help system (see item 6 below) would allow
users to quickly find example simulation programs for handling non-normal
models.

3. Continue to expand capabilities for random effects models, with various
post-fit estimation, multi-level hierarchies, and other analytic capabilities.
Some of this can be done by having an elegant interface with the WINBUGS
Bayesian modeling package from Cambridge.

4. Bootstrap and multiple imputation methods for accounting for imputing
missing values when making inferences. Some new na.action functions
would also be welcome. These functions could develop imputation rules
(using tree, nonparametric regression, nearest neighbor, etc.) that
could be saved and re-executed on demand. Imputations can be tedious
and it's a shame to have to re-develop imputation models for each
analysis. The imputation function could save enough information to
be able to repeat the development of the imputation rule as quickly as
possible, so that you could put this step inside a bootstrap look in order
to be able to properly account for this component of variation. Interested
uses may want to look at the impute and transcan functions in my
Hmisc library for some other ideas.

5. Anything that helps with non-randomly missing serial data.

6. A world-class online help facility that allows users to navigate in many
ways, e.g., getting to a comprehensive set of examples of managing
and recoding data. For Windows users, where installing an add-on
library is as easy as unzipping a .zip file, it would be nice to have a
help button that updates the local PC from a master table of contents of
libraries available from statlib; another button would automatically
download and install a library. See how Microsoft (yes they do a few
things right) allows users to easily update Office products.

When deciding on future directions for software all of the debates about statistics
come alive. I know that many will criticize my point of view. I just wanted to give my
$.02 worth from the standpoint of an applied biostatistician.

--=====================_891406612==_
Content-Type: text/plain; charset="us-ascii"

**********************************************************************
Charles Roosen, PhD 1700 Westlake Ave N, Suite 500
Senior Statistician Seattle, WA 98109
Data Analysis Products Division (206) 283-8802 x254
MathSoft email: roosen@statsci.com
**********************************************************************

--=====================_891406612==_--
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news