[S] Multiple imputation capability added to transcan in Hmisc

Frank E Harrell Jr (fharrell@virginia.edu)
Thu, 23 Jul 1998 12:13:14 -0400


The transcan function in the Hmisc library, which is a function for
developing imputation models among other things, now handles
multiple imputation. An example from the help file follows this note.
There is a new function fit.mult.impute that will run S-PLUS regression
modeling functions separately for each imputation, computing a
new coefficient vector and imputation-corrected variance-covariance
matrix.

On another note:
The document "An Introduction to S-PLUS and the Hmisc and Design
Libraries" by Alzola & Harrell has been much improved. Thanks to the
Y & Y LaTeX system the document also has hyperreferences and bookmarks.
Libraries and documents are available from our web page, and will be
updated on StatLib tonight.
---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm

> # Example with completely random missing data
> set.seed(1)
> x1 <- factor(sample(c('a','b','c'),100,T))
> x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)
> y <- x2 + 1*(x1=='c') + rnorm(100)
> x1[1:20] <- NA
> x2[18:23] <- NA
> n <- naclus(data.frame(x1,x2,y))
> plot(n); naplot(n) # Show patterns of NAs
> f <- transcan(~y + x1 + x2, n.impute=10, shrink=F)
> options(digits=3)
> summary(f)

transcan(x = ~ y + x1 + x2, n.impute = 10, shrink = F)

R-squared achieved in predicting each variable:

y x1 x2
0.905 0.74 0.839

Adjusted R-squared:

y x1 x2
0.898 0.718 0.826

Coefficients of canonical variates for predicting each (row) variable

y x1 x2
y 0.37 0.61
x1 1.23 -0.03
x2 1.06 -0.02

Summary of imputed values

x1
n missing unique Mean
200 0 3 2.12

1 (55, 28%), 2 (66, 33%), 3 (79, 40%)

x2
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
60 0 55 1.55 -1.62653 -1.38461 0.04995 1.66340 3.35163 3.81074 4.12569

lowest : -1.6265 -1.5601 -1.3651 -0.9406 -0.9054
highest: 3.9874 4.1248 4.1420 4.7891 5.0992

Starting estimates for imputed values:

y x1 x2
0.814 1 1

> attr(f,'imputed')

$y:
NULL

$x1:
1 2 3 4 5 6 7 8 9 10
1 1 2 1 2 2 2 2 2 2 2
2 2 1 2 2 1 2 2 2 2 1
3 3 2 2 3 2 1 3 2 3 2
4 2 1 1 2 2 1 2 2 2 2
5 2 2 1 1 2 1 1 2 1 1
6 3 3 2 3 3 2 2 3 3 3
7 1 1 1 1 1 1 1 1 1 1
8 3 3 3 3 3 3 3 3 3 3
9 1 2 2 2 2 2 2 1 2 2
10 1 1 2 3 1 1 2 2 1 1
11 3 3 3 3 3 3 3 2 3 3
12 3 3 3 3 3 3 3 3 3 3
13 1 1 2 1 1 1 1 1 1 1
14 3 3 3 3 3 3 3 3 3 3
15 2 2 2 2 2 2 2 2 2 3
16 1 1 3 2 2 1 1 1 1 2
17 3 3 3 3 2 3 3 3 2 3
18 3 3 3 2 3 2 3 3 3 3
19 1 1 3 1 1 1 2 1 2 1
20 3 3 3 3 3 2 3 3 3 3

$x2:
1 2 3 4 5 6 7 8 9 10
18 1.866 2.37 1.79 1.995 2.517 2.153 2.668 2.259 3.527 1.786
19 -0.330 -1.56 -1.63 1.537 0.711 1.251 -1.365 0.217 -0.568 -0.144
20 4.142 3.72 5.10 3.840 3.371 3.987 3.345 2.779 3.424 2.573
21 0.115 1.15 -1.63 0.518 0.947 1.078 -1.627 0.935 1.448 -0.905
22 3.605 4.79 2.40 4.125 3.657 2.790 3.456 3.807 3.306 3.675
23 0.793 -1.63 1.54 -0.586 0.612 -0.941 0.731 -0.533 -0.324 -1.627

> f <- transcan(~y + x1 + x2, n.impute=10, shrink=T)
> summary(f)

transcan(x = ~ y + x1 + x2, n.impute = 10, shrink = T)

R-squared achieved in predicting each variable:

y x1 x2
0.904 0.739 0.84

Adjusted R-squared:

y x1 x2
0.897 0.718 0.826

Shrinkage factors:

y x1 x2
0.937 0.952 0.939

Coefficients of canonical variates for predicting each (row) variable

y x1 x2
y -0.35 0.57
x1 1.17 -0.03
x2 1.00 0.02

Summary of imputed values

x1
n missing unique Mean
200 0 3 2.195

1 (47, 24%), 2 (67, 34%), 3 (86, 43%)

x2
n missing unique Mean .05 .10 .25 .50 .75 .90 .95
60 0 53 1.651 -1.6265 -0.5036 0.3479 1.7213 3.2052 3.7244 3.9763

lowest : -1.6265 -1.1599 -1.0086 -0.4474 -0.4165
highest: 3.8195 3.9647 4.1976 4.6530 4.7187

Starting estimates for imputed values:

y x1 x2
0.814 1 1

> h <- fit.mult.impute(y ~ x1 + x2, lm, f)

Variance Inflation Factors Due to Imputation:

(Intercept) x11 x12 x2
1.24 1.27 1.26 1.28

> h

Coefficients:
(Intercept) x11 x12 x2
0.366 0.0892 0.494 0.954 #AVERAGE OVER IMPUTATIONS

Degrees of freedom: 100 total; 96 residual
Residual standard error: 0.888 #NOTE: 0.888 is from last imputation

> diag(Varcov(h))
[1] 0.02306 0.01699 0.01188 0.00897

> h.complete <- lm(y ~ x1 + x2, na.action=na.omit)
> h.complete

Coefficients:
(Intercept) x11 x12 x2
0.35 0.0689 0.465 0.934

Degrees of freedom: 77 total; 73 residual
Dropped 23 cases due to missing values
Residual standard error: 0.928

> diag(Varcov(h.complete))
[1] 0.0276 0.0182 0.0141 0.0108 # NOTE: larger than from imputing

# Note: had Design's ols function been used in place of lm, any
# function run on h (anova, summary, etc.) would have automatically
# used imputation-corrected variances and covariances

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news