-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #18710 +/- ##
==========================================
- Coverage 91.61% 91.59% -0.03%
==========================================
Files 153 153
Lines 51363 51359 -4
==========================================
- Hits 47058 47044 -14
- Misses 4305 4315 +10
Continue to review full report at Codecov.
|
pandas/core/categorical.py
Outdated
| return self | ||
| # GH 18593: keep current categories if None (ordered can't be None) | ||
| if dtype.categories is None: | ||
| new_categories = self.categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you also set ordered=False here (for dtype.categories is None) and the else take the ordered from the dtype, then I believe you can remvoe 439-441 (also needt o make 450 be
dtype = CategoricalDtype(new_categories, ordered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.
To summarize, I think it is easier to leave it as is and treat 'category' as a special case.
(we might want to check if we can't let the ordered keyword have a default of None to make it easier to deal with this, but that is another issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.
not sure this is True. .astype('category') is clearly == CategoricalDtype() which by-definition has ordered=False. I don't know how you can have any other conclusion. Furthermore if this is NOT the case. Then we should immediately fix this. As a special case for this is monumentally confusing. The very fact that we have to have this discussion attests to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that would be changing the existing behaviour:
In [2]: pd.Categorical(['a', 'b'], ordered=True)
Out[2]:
[a, b]
Categories (2, object): [a < b]
In [3]: pd.Categorical(['a', 'b'], ordered=True).astype('category')
Out[3]:
[a, b]
Categories (2, object): [a < b]
I personally think the above is the logical behaviour, but I can also see a point in to make the above ordered=False.
Main reason for liking the above is that 'category' == CategoricalDtype() and that CategoricalDtype has a default of ordered=False should be more an implementation detail to the user.
But let's maybe open a new issue to discuss that?
And keep this PR just fixing the bug without changing existing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refactor this to a common method that you can use in #18677 (or on that PR is ok too)
Will do that over in #18677, since it seems to be closer to being complete. Or can close the two individual PR's and create a new PR that combines both, if that would be preferable. Didn't realize the fixes would be so similar until the first PR was already in review.
pandas/core/categorical.py
Outdated
| .. versionadded:: 0.19.0 | ||
| """ | ||
| if isinstance(dtype, compat.string_types) and dtype == 'category': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my next comment
| kwargs.setdefault('categories', categories) | ||
| kwargs.setdefault('ordered', ordered) | ||
| return self.make_block(Categorical(self.values, **kwargs)) | ||
| if is_categorical_dtype(self.values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need all of this logic, wouldn't
values = self.values.astype(dtype, copy=copy)
return self.make_block(values, dtype=dtype)
be enough (if values is a Categorical already or dtype is a CDT, it will infer correctly, and if its not it will as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that quite works, since self.values can be a different object depending on what self is: if self is already categorical, then self.values is a Categorical, otherwise self.values is a numpy array.
In the numpy case, self.values.astype raises TypeError: data type not understood when a CDT is passed as the dtype.
Likewise, self.make_block(Categorical(self.values, dtype=dtype)) also doesn't work by itself. In the Categorical case, the constructor ignores the dtype parameter when the input data is already Categorical, so no update occurs.
Seems like the two paths are necessary? Or am I overlooking something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok seems reasonable then
|
give this a rebase and use |
f9a1457 to
0fb9140
Compare
|
Rebased and used Will write up an issue in the next day or so to discuss the behavior of |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just had some small comments on the tests
| expected = np.array(cat, dtype=np.float) | ||
| tm.assert_numpy_array_equal(result, expected) | ||
|
|
||
| @pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not really testing the effect of the copy keyword
|
|
||
| def test_astype_categorical(self): | ||
| @pytest.mark.parametrize('ordered', [True, False]) | ||
| @pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this copy needed here? it is not doing anything in all those cases (they will copy anyway, so I think just using it once differently in the test is good enough)
pandas/tests/series/test_dtypes.py
Outdated
| lambda x: x.astype('object').astype(Categorical)]: | ||
| pytest.raises(TypeError, lambda: invalid(s)) | ||
|
|
||
| @pytest.mark.parametrize('copy', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
0fb9140 to
6702f90
Compare
|
Updated to remove the unnecessary |
| return self.copy() | ||
| return self | ||
| # GH 10696/18593 | ||
| dtype = self.dtype._update_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add some explicit tests for _update_dtype at some point (separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added them in the other PR during the initial implementation actually:
pandas/pandas/tests/dtypes/test_dtypes.py
Lines 127 to 152 in 265e327
| @pytest.mark.parametrize('dtype', [ | |
| CategoricalDtype(list('abc'), False), | |
| CategoricalDtype(list('abc'), True)]) | |
| @pytest.mark.parametrize('new_dtype', [ | |
| 'category', | |
| CategoricalDtype(None, False), | |
| CategoricalDtype(None, True), | |
| CategoricalDtype(list('abc'), False), | |
| CategoricalDtype(list('abc'), True), | |
| CategoricalDtype(list('cba'), False), | |
| CategoricalDtype(list('cba'), True), | |
| CategoricalDtype(list('wxyz'), False), | |
| CategoricalDtype(list('wxyz'), True)]) | |
| def test_update_dtype(self, dtype, new_dtype): | |
| if isinstance(new_dtype, string_types) and new_dtype == 'category': | |
| expected_categories = dtype.categories | |
| expected_ordered = dtype.ordered | |
| else: | |
| expected_categories = new_dtype.categories | |
| if expected_categories is None: | |
| expected_categories = dtype.categories | |
| expected_ordered = new_dtype.ordered | |
| result = dtype._update_dtype(new_dtype) | |
| tm.assert_index_equal(result.categories, expected_categories) | |
| assert result.ordered is expected_ordered |
|
thanks @jschendel nice patches! keep em coming! |
Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes dask#2996
* COMPAT: Pandas 0.22.0 astype for categorical dtypes Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes #2996 * Fix pandas version check * Refactored * update docs * compat * Simplify * Simplify * Update changelog.rst
git diff upstream/master -u -- "*.py" | flake8 --diffCouldn't find an issue about it, but the same problem described with
Series.astypein the linked issues was occurring withCategorical.astype. Put in a fix for that too with some code very similar to what was done in #18677 forCategoricalIndex.astype. Could probably consolidate the two into a single helper function, potentially as part of #18704.