[S] quantiles

Frank E Harrell Jr (fharrell@virginia.edu)
Wed, 8 Apr 1998 22:19:54 -0400


Consider these 502 observations on a variable:

x <- c(72, 1, 40, 20, 65, 24, 46, 62, 61, 60, 60, 59, 59, 49, 20, 3, 58, 29, 26, 52,
20, 51, 51, 31, 42, 38, 69, 39, 33, 8, 13, 33, 9, 21, 66, 5, 27, 2, 20, 19, 60, 58, 32,
53, 53, 43, 21, 30, 74, 72, 14, 33, 8, 10, 51, 7, 63, 33, 3, 43, 37, 5, 6, 2, 5, 64, 1,
21, 16, 21, 12, 75, 74, 74, 54, 73, 36, 59, 6, 58, 16, 19, 39, 26, 60, 43, 7, 9, 67, 62,
17, 25, 0, 5, 34, 59, 31, 58, 30, 57, 5, 55, 55, 52, 0, 51, 17, 70, 74, 74, 20, 2, 8, 27,
23, 1, 52, 51, 6, 0, 26, 65, 52, 26, 70, 6, 6, 68, 33, 67, 58, 23, 6, 11, 6, 57, 57, 29,
9, 53, 51, 8, 0, 21, 27, 22, 12, 68, 21, 68, 0, 2, 14, 18, 5, 60, 40, 31, 51, 50, 46, 65,
9, 21, 27, 54, 52, 75, 75, 30, 70, 14, 0, 42, 12, 40, 2, 12, 53, 11, 18, 13, 45, 8, 28,
67, 67, 24, 64, 26, 57, 32, 71, 42, 20, 71, 54, 64, 51, 1, 2, 0, 54, 69, 68, 67, 66, 64,
63, 35, 62, 7, 35, 24, 57, 1, 4, 74, 0, 51, 36, 16, 32, 68, 17, 66, 65, 19, 41, 28, 0, 46,
63, 60, 59, 46, 63, 8, 74, 18, 33, 12, 1, 66, 28, 30, 57, 50, 39, 40, 24, 6, 30, 58, 68,
24, 33, 65, 2, 64, 19, 58, 15, 10, 12, 53, 51, 1, 40, 40, 66, 2, 21, 35, 29, 54, 37, 10,
29, 71, 12, 13, 27, 66, 28, 31, 12, 9, 21, 19, 51, 71, 76, 46, 47, 75, 75, 49, 75, 75, 31,
69, 74, 25, 72, 28, 36, 8, 71, 60, 14, 22, 67, 62, 68, 68, 27, 68, 68, 67, 67, 3, 49, 12,
30, 67, 5, 65, 24, 66, 36, 66, 40, 13, 40, 65, 0, 14, 45, 64, 13, 24, 15, 26, 5, 63, 35,
61, 61, 50, 57, 21, 26, 11, 59, 42, 27, 50, 57, 57, 0, 1, 54, 53, 23, 8, 51, 27, 52, 52,
52, 45, 48, 18, 2, 2, 35, 75, 75, 9, 39, 0, 26, 17, 43, 53, 47, 11, 65, 16, 21, 64, 7, 38,
55, 5, 28, 38, 20, 24, 27, 31, 9, 9, 11, 56, 36, 56, 15, 51, 33, 70, 32, 5, 23, 63, 30,
12, 53, 12, 58, 54, 36, 20, 74, 34, 70, 25, 65, 4, 10, 58, 37, 56, 6, 54, 0, 70, 70, 28,
40, 67, 36, 23, 23, 62, 62, 62, 2, 34, 4, 12, 56, 1, 7, 4, 70, 65, 7, 30, 40, 13, 22, 0,
18, 64, 13, 26, 1, 16, 33, 16, 22, 30, 53, 53, 7, 61, 40, 9, 14, 59, 59, 7, 12, 46, 50, 0,
52, 19, 52, 51, 51, 14, 27, 51, 5, 0, 41, 53, 19, 4)

The table of frequencies and cumulate frequencies of x are:

table(x)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
16 10 11 3 5 11 9 8 8 9 4 5 13 7 7 3 6 4 5 7 8 11 4 6 8 3 9 10 7 4 9 6 4 9

34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
3 5 7 3 3 4 11 2 4 4 3 6 2 1 3 5 16 10 11 8 3 4 9 9 8 7 4 7 6 8 10 8

67 68 69 70 71 72 73 74 75 76
10 10 3 8 5 3 1 9 9 1

ct <- cumsum(table(x))

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
16 26 37 40 45 56 65 73 81 90 94 99 112 119 126 129 135 139 144 151 159 170 174 180 188 191 200

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51
210 217 221 230 236 240 249 252 257 264 267 270 274 285 287 291 295 298 304 306 307 310 315 331

52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
341 352 360 363 367 376 385 393 400 404 411 417 425 435 443 453 463 466 474 479 482 483 492 501

76
502

The quantile function takes the ith order statistic of the sample to be
an estimate of the (i-1)/(n-1) quantile of the distribution (see its
help file). The values of the first few
(cumulative frequencies - 1) / 501 are:

(ct -1 )/ 501

0 1 2 3 4 5 6 7 8
0.02994012 0.0499002 0.07185629 0.07784431 0.08782435 0.1097804 0.1277445 0.1437126 0.1596806

>From this I think that the 0.05 quantile of x is just over 1. Using
linear interpolation it would be

1 + (.05 - .0499002)/(.07185629 - .0499002) = 1.004545. Yet
quantile(x,.05) yields

5%
1.05

Would someone shed some light? Assuming my calculation (1.0045) is
correct, one could use the approx function to easily compute quantiles
instead of trying to follow the logic inside quantile, and the result
would still be defensible. That would allow me to compute quantiles
that incorporate sampling weights, as all I would need are cumulative
weighted frequences. (As an aside I would appreciate a concise summary
of how to compute quantiles incorporating such weights; when the
weights represent frequencies it seems to be pretty clear. It would be
nice is the same method works for general sampling weights.)

Thanks -Frank
---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news