[S] Quickly generating multinomials with varying p: Summary

Frank E Harrell Jr (fharrell@virginia.edu)
Mon, 15 Jun 1998 09:50:47 -0400


Thanks for the many fast replies! Several people solved the
problem in good fashion but using the apply function. Here are
two solution's that are faster. Dave Krantz' appears to be
blazing fast.

By the way, as the respondents correctly inferred I am interested in generating the sampled classes, not counts.

Dave Krantz' solution:

To restate the problem, suppose P is an n x k matrix with non-negative
entries whose rows sum to 1; how can one generate a sample of size n,
where the ith element has value 1, 2, ..., k with probabilities given
by the ith row of P?

First convert P to cumulative probabilities within each row.
An efficient way (without external loops) is:

Q <- matrix(cumsum(t(P)) - rep(0:(n-1), rep(k,n)), byrow=T, ncol=k)

Next, generate runif(n) and compare it with each column of Q, summing
the rows of the comparison matrix (and adding 1):

R <- 1 + (Q < runif(n)) %*% rep(1, k)

R is the result you want. This can easily be adapted to give samples
of specified sizes greater than 1 for each multinomial: if Z is the
vector of n sample sizes, replace "Q < runif(n)" by

Q[rep(1:n, Z), ] < runif(sum(Z))

and then make R into a list by using split() (if Z is nonconstant) or
into an n x Z[1] matrix (if Z is constant).

Duncan Murdoch's solution (note: yet another application of the
wonderful and fast approx() function!)

Here's a function I've just written that might be efficient. It
samples from a discrete distribution on the integers 0:(n-1) by the
inverse cdf method. It doesn't do the totals to give you multinomial
counts; if you want those, you'll have to use table() on the result.

rdiscrete <- function(n, p)
{
if(length(n) == 1) {
cp <- c(0, cumsum(p/sum(p)))
x <- 0:length(p)
approx(cp, x, runif(n), "constant")$y
}
else {
p <- as.matrix(p)
if(length(n) != nrow(as.matrix(p)))
stop("Length of n must match rows of p")
cp <- cbind(rep(0, length(n)), t(apply(p, 1, function(
row)
cumsum(row)/sum(row))))
x <- matrix(0:ncol(p), length(n), ncol(cp), byrow = T)
offsets <- matrix(0:(length(n) - 1), length(n),
ncol(cp)
)
approx(as.vector(cp + offsets), as.vector(x),
runif(sum(
n)) + rep(0:(length(n) - 1), n), "constant")$y
}
}

If n is a scalar, it generates n random values from the distribution
defined by a normalized version of p. If n is a vector and p is a
matrix with nrow(p) = length(n), then it generates n[i] values from
the distribution given by row i of p.

The way it handles the matrix case is to concatenate all of the
cumulative sums of each row into a long vector with offsets to make it
an increasing vector. For example, if p is the matrix

0.1 0.5 0.4
0.2 0.2 0.6

then the cumulative sum would be

0.0 0.1 0.6 1.0
1.0 1.2 1.4 2.0

with corresponding values

0 1 2 2
0 1 2 2

An runif value is generated, the appropriate offset for that row is
added, and approx() is used to look up the simulated value.

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news