I was asked to summarize to the list. Many thanks to Prof Brian Ripley,
James Pratt and Brian Cade who offered ideas and help!
I was interested in how to correctly model proportions with a "standard"
method. There were the following two things that I proposed:
Firstly, I fed the data into glm (Y ~ X, family= binomial).
Secondly, I transformed Y into Z = log (Y/1-Y). Then I fitted a line by
lm (Z ~ X).
Brian RIPLEY (ripley@stats.ox.ac.uk) pointed out, that glm (..., family=
binomial) does not expect zeros and ones in the response but: "The
`handbook' is wrong. It converts the response to a proportion."
For my two propositions he clarifies: "Other way round. Actually, in your
case the second is least squares and the first is weighted least squares
with weights 1/(p*(1-p))".
Which one is more correct "Depends on how the proportions were measured".
And he advices me: "I would use glm with a quasi model logit link, and an
appropriate variance function (and you will need to work out what
appropriate is)."
Which will also help if there are many 0s and 1s: "log(Y/(1-Y)) will be
Inf or -Inf, and this will not work. A quasi glm() model will work".
James PRATT (jamesp@MOCR.OAPI.COM) proposed the same and pointed out a
reference (which can also be found in VR2):
"Have you tried quasilikelihood? Logistic regression does assume a
binomial distribution for the errors. With quasilikelihood, need only
define the variance function, but need not define a distribution. In
McCullagh & Nelder 2nd edition, they give an example where response is the
percentage of a leaf's area affected by blotch. This is example 9.2.4 on
page 328. (In the 1st ed., the example is 8.6.1 on page 173).
They first use a scaled version of the binomial variance (sigma*mu*(1-mu))
as the assumed variance function (with the logit link function for the
mean). After some residual plots, they settle on using mu^2*(1-mu)^2, due
to scaled binomial variance function is too large at the extremes of 0 and
1.
McCullagh & Nelder 'Generalized Linear Models' Chapman and Hall 1989 (2
ed), 1983 (1st ed)."
Brian CADE (Brian_Cade@usgs.gov) tends towards using a somewhat more
specialized method for dealing with compositional data:
"Perhaps you might want to work with the logratio approach advocated by
Aithchison for compositional data even though your testing and estimating
are focused on only 1 component. Briefly, the logratio approach for a
2-part composition (call them y and z and say y is the component of
interest) would involve either of the following 2 transformations (maximal
invariants I believe):
log(y/z) or log(y/(y*z)^1/2). The denominator in the first formula is
one of the indvidual components whereas in the second formual the
denominator is the geometric mean of the two components."
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news