[S] S: Missing values in factors & generally

JOHN HILARY MAINDONALD (john.maindonald@anu.edu.au)
Tue, 3 Mar 1998 17:51:39 +1100 (EST)


Infuriatimg Factors

The Splus help(factor) states that a factor is a character vector.
Inconsistently with this statement, if f is a factor then (1) mode(f)
returns "numeric" and (2) is.character(f) returns F.

If f has any missing values then the default is to behave differently
from a character vector in factor(f) and, depending on the setting of
the parameter "exclude", in table(f).

Assign
> abc <- factor(c("Cm", "Lr", "Md", "Sm", "Sp", "Vn", ""), exclude="")
> abc
[1] Cm Lr Md Sm Sp Vn NA
> levels(abc)
[1] "Cm" "Lr" "Md" "Sm" "Sp" "Vn"

> factor(abc) # This is not nice
[1] NA NA NA NA NA NA NA
Levels (first 5 out of 6):
[1] "1" "2" "3" "4" "5"

If on the other hand ab has no missing value, then factor(ab) returns
ab unchanged, except possibly for sorting the levels.

The problem is triggered by the na.last=T in the default setting
levels=sort(unique(x), na.last = T) in the arguments to factor().
[Type in args(factor)]. Thus compare
> factor(abc, levels=sort(unique(abc),na.last=T))
[1] NA NA NA NA NA NA NA
Levels (first 5 out of 6):
[1] "1" "2" "3" "4" "5"

with
> factor(abc, levels=sort(unique(abc)))
[1] Cm Lr Md Sm Sp Vn NA
[By default, sort() leaves off the NA.]

This behaviour of factor() is surely a bug.

If factors were character vectors and had character values then
the following, with abc a factor defined as above, ought to work:

> factor(abc, exclude=character(0))
[1] NA NA NA NA NA NA NA
Levels (first 5 out of 7):
[1] "1" "2" "3" "4" "5"

Again, I regard this as a bug; help(factor) does not say that exclude
is not to be used when the argument is a factor. Use of numeric(0) in
place of character(0) does not help. If we look further down in the
help for factor(), we find a function na.include() which provides a
workaround to turn f into a factor which has NA as a level.

I consider that factors should behave exactly like character vectors,
including when NAs are present. It should not be necessary, as for
table(), to use one device to control exclusion of NAs when the
margin is a character vector (i. e. the parameter exclude which has
no effect for factors) and another device (na.include(factor)) when
the margin is a factor.
[A side issue is that help(table) talks about, and the implementation
of table() actually uses, the deprecated function category().]

Here is my list of functions which raise consistency issues

(1) in functions which expect numeric arguments, a common default
is for the calculation to fail, or perhaps (as with mean()) to
allow any NAs to propagate through to the return value(s).
V&R2 (p.35) note that there are "policy" differences for NAs between
different functions, and give examples.
(2) in sort() the default is to exclude them. For factors it is
irrelevant whether "NA" is included as a factor level. One can use
na.last=T to get NAs tagged on at the end of the sorted vector.
(3) In table() the default is to exclude them, unless the margin
is a factor with "NA" as a level. If necessary one can use
na.include to generate such a factor. If the argument is not a
factor, then one can set the parameter exclude (e. g. to character(0))
so that missing values are included.
(4) tapply omits missing values of margins, unless "NA" appears as
a factor level, perhaps using na.include().
(5) merge() excludes rows of either data frame where the key
(identified using the by parameter) has a missing value, even if all.x
and/or all.y are set to T. (Under USAGE, help has all.x=all, but that
is wrong.) A consequence is that it is not in general true, as
help(merge) claims, that the all.x (c. f. all.y) argument lets one
include all the rows in the x (c. f. y) data frame in the output.

In summary
(1) there are bugs & documentation issues that need fixing.
(2) Missing values in factors are are particularly treacherous, and
handled somewhat inconsistently between different functions. This is
an area that needs flagging in the documentation.
(3) Given that S-PLUS is so picky in insisting that users say what
they want to do with missing values in mean() and in model formulae, I
find it anomalous that the default behaviour of table() omits them
without comment.
(4) There should be a set of guidelines on the syntax for handling
missing values, both for variable and factor values, in S-PLUS
functions that may in future become available.
John Maindonald email : john.maindonald@anu.edu.au
Statistical Consulting Unit, phone : (6249)3998
c/o CMA, SMS, fax : (6249)5549
Australian National University
Canberra ACT 0200
Australia
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news