-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should any and all do in presence of missing values?
edit from Tom: Here's a proposed table of behavior
| case | input | output |
|---|---|---|
| 1. | all([True, NA], skipna=False) |
NA |
| 2. | all([False, NA], skipna=False) |
False |
| 3. | all([NA], skipna=False) |
NA |
| 4. | all([], skipna=False) |
True |
| 5. | any([True, NA], skipna=False) |
True |
| 6. | any([False, NA], skipna=False) |
NA |
| 7. | any([NA], skipna=False) |
NA |
| 8. | any([], skipna=False) |
False |
| case | input | output |
|---|---|---|
| 9. | all([True, NA], skipna=True) |
True |
| 10. | all([False, NA], skipna=True) |
False |
| 11. | all([NA], skipna=True) |
True |
| 12. | all([], skipna=True) |
True |
| 13. | any([True, NA], skipna=True) |
True |
| 14. | any([False, NA], skipna=True) |
False |
| 15. | any([NA], skipna=True) |
False |
| 16. | any([], skipna=True) |
False |
Some context:
Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of any/all with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)
The documentation of any says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)
Return whether any element is True, potentially over an axis.
Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
...
skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
and similar for all (https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).
Default behaviour with skipna=True
in case of some NA's and some True/False values, I think the behaviour is clear: any/all are reductions, and in pandas we use skipna=True for reductions.
So you get something like this:
(I am still using np.nan here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)
In [2]: pd.Series([True, False, np.nan]).any()
Out[2]: True
In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False
In [4]: pd.Series([True, True, np.nan]).all()
Out[4]: True
(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)
Behaviour for all-NA in case of skipna=True
This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for any, True for all):
In [8]: np.array([], dtype=bool).any()
Out[8]: False
In [9]: np.array([], dtype=bool).all()
Out[9]: True
(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)
Behaviour with skipna=False
Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.
With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:
If skipna is False, then NA are treated as True, because these are not equal to zero.
This follows from numpy's behaviour with floats:
In [10]: np.array([0, np.nan]).any()
Out[10]: True
and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:
>>> pd.Series([False, pd.NA], dtype="boolean").any()
True
I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg False | NA = NA, so in that case, the above should give NA).
But are we ok with any/all not returning a boolean in this case? (note, you only have this if someone specifically set skipna=False)