-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| pass | ||
|
|
||
| elif dtype.startswith('interval[') or dtype.startswith('Interval['): | ||
| elif dtype.startswith('interval') or dtype.startswith('Interval'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this because switching to pandas_dtype caused a test to break since it was passing 'interval' as the dtype, which appears to be valid:
In [1]: from pandas.core.dtypes.common import is_interval_dtype
In [2]: is_interval_dtype('interval')
Out[2]: True
pandas/core/indexes/multi.py
Outdated
| if not is_object_dtype(np.dtype(dtype)): | ||
| if not is_object_dtype(pandas_dtype(dtype)): | ||
| raise TypeError('Setting %s dtype to anything other than object ' | ||
| 'is not supported' % self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change since np.dtype doesn't recognize 'category', and would raise a different error message than the one specified here; pandas_dtype ensures that this error message will be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version
I think it should be CategoricalIndex |
pandas/core/indexes/category.py
Outdated
| if is_interval_dtype(dtype): | ||
| from pandas import IntervalIndex | ||
| return IntervalIndex.from_intervals(np.array(self)) | ||
| elif is_categorical_dtype(dtype) and (dtype == self.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use dtype.equals(self.dtype) here I think
pandas/core/indexes/multi.py
Outdated
| if not is_object_dtype(np.dtype(dtype)): | ||
| if not is_object_dtype(pandas_dtype(dtype)): | ||
| raise TypeError('Setting %s dtype to anything other than object ' | ||
| 'is not supported' % self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version
pandas/tests/indexes/common.py
Outdated
| with pytest.raises(ValueError): | ||
| index.putmask('foo', 1) | ||
|
|
||
| def test_astype_category(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass a CategoricalDtype here as well (with and w/o ordered) and make this parameterized
| assert result.equals(idx) | ||
|
|
||
| result = idx.astype('category') | ||
| def test_astype_category(self, closed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/tests/indexes/test_multi.py
Outdated
| with tm.assert_raises_regex(TypeError, "^Setting.*dtype.*object"): | ||
| self.index.astype(np.dtype(int)) | ||
|
|
||
| def test_astype_category(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Codecov Report
@@ Coverage Diff @@
## master #18677 +/- ##
==========================================
- Coverage 91.6% 91.59% -0.02%
==========================================
Files 153 153
Lines 51306 51339 +33
==========================================
+ Hits 46998 47022 +24
- Misses 4308 4317 +9
Continue to review full report at Codecov.
|
|
Question regarding how this should behave with a Setup: In [3]: ci = pd.CategoricalIndex(list('abca'))
In [4]: new_dtype = CategoricalDtype(ordered=True)
In [5]: new_dtype
Out[5]: CategoricalDtype(categories=None, ordered=True)
In [6]: ci.dtype
Out[6]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)Then In [7]: ci.dtype == new_dtype
Out[7]: TrueWhich appears to be intentional, per the comment prior to the implementation: pandas/pandas/core/dtypes/dtypes.py Lines 216 to 222 in 9629fef
In this case, should doing
|
373f9d4 to
b4f354f
Compare
|
Made review related updates and a few additional fixes/changes:
Also modified the The only slightly counter-intuitive thing is that |
| dtype = CategoricalDtype(new_categories, new_ordered) | ||
|
|
||
| # fastpath if dtypes are equal | ||
| if dtype == self.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that dtype.equals(self.dtype) raises AttributeError: 'CategoricalDtype' object has no attribute 'equals'.
The only other thing I found for dtype comparison is is_dtype_equal, but under the hood that basically just does == within a try/except to catch if the dtypes aren't comparable. We're within elif is_categorical_dtype(dtype) here though, so is_dtype_equal seem superfluous, but could still use it if preferred.
I agree that from a user point of view this should change the ordered attribute.
I am wondering if it would not be possible to change this. It is clear that we need to keep
That doesn't sound fully as the desired behaviour .. although I am not fully sure :-) |
Yes, agreed that it's unclear. In that case, I think it's less surprising for
That was for backwards comparability. I wouldn't use it as an argument of consistency, since it's a bad thing to be consistent about :) |
b4f354f to
8066780
Compare
|
Updates:
|
8066780 to
3697d6e
Compare
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small comments.
pandas/core/indexes/category.py
Outdated
|
|
||
| @Appender(_index_shared_docs['astype']) | ||
| def astype(self, dtype, copy=True): | ||
| if isinstance(dtype, compat.string_types) and dtype == 'category': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually need this check as the dtype == self.dtype line below should pick this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this, or some variation of it, is necessary in order to guarantee that CI.astype('category') doesn't change anything. The issue being that pandas_dtype('category') returns CDT(None, False), which would change a CI with ordered=True to False.
pandas/core/indexes/datetimes.py
Outdated
| return self.copy() | ||
| return self | ||
| elif is_categorical_dtype(dtype): | ||
| from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern is to import CI at the top, but ok here too
e.g. from pandas.core.indexes.category import CategoricalIndex should work
pandas/core/indexes/interval.py
Outdated
| elif is_categorical_dtype(dtype): | ||
| from pandas import Categorical | ||
| return Categorical(self, ordered=True) | ||
| from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| 'is not supported' % self.__class__) | ||
| dtype = pandas_dtype(dtype) | ||
| if is_categorical_dtype(dtype): | ||
| msg = '> 1 ndim Categorical are not supported at this time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test fo this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote a test for it in test_multi.py, which overrides the test in common.py:
pandas/core/indexes/numeric.py
Outdated
| elif is_object_dtype(dtype): | ||
| values = self._values.astype('object', copy=copy) | ||
| elif is_categorical_dtype(dtype): | ||
| from pandas.core.indexes.category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| return self.to_timestamp(how=how).tz_localize(dtype.tz) | ||
| elif is_period_dtype(dtype): | ||
| return self.asfreq(freq=dtype.freq) | ||
| elif is_categorical_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
side thing, I think that we could make a more generic astype in indexes.base and remove some boiler plate maybe (of course separate PR), you can make an issue if you want (or just PR!)
3697d6e to
31d4d62
Compare
|
Updated to move the |
|
Looks good to me |
31d4d62 to
6042131
Compare
|
Moved the categorical dtype update code to a new Not sure if the name/location is appropriate, but can rename/move if need be. For the time being, I've maintained the existing logic of |
| @Appender(_index_shared_docs['astype']) | ||
| def astype(self, dtype, copy=True): | ||
| if is_categorical_dtype(dtype): | ||
| from .category import CategoricalIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have an import issue if we import this at the top (with the fully qualified path)?
|
thanks @jschendel nice patch! keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diffNotes:
MultiIndex.astype('category')raises per @TomAugspurger's comment in the issue.IntervalIndex.astype('category')return aCategoricalwithordered=Trueinstead ofCategoricalIndex, since it looks like someone previously intentionally implemented it this way. I don't immediately see a reason why, but left it as is. Would be straightforward to make this consistent and return aCategoricalIndex.CategoricalIndex.