BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677

jschendel · 2017-12-07T08:11:35Z

closes Index astype('category') does not return a CategoricalIndex #18630
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Notes:

MultiIndex.astype('category') raises per @TomAugspurger's comment in the issue.
IntervalIndex.astype('category') return a Categorical with ordered=True instead of CategoricalIndex, since it looks like someone previously intentionally implemented it this way. I don't immediately see a reason why, but left it as is. Would be straightforward to make this consistent and return a CategoricalIndex.
All other types of index should return a CategoricalIndex.

jschendel · 2017-12-07T08:14:43Z

pandas/core/dtypes/common.py

                pass

-        elif dtype.startswith('interval[') or dtype.startswith('Interval['):
+        elif dtype.startswith('interval') or dtype.startswith('Interval'):


Changed this because switching to pandas_dtype caused a test to break since it was passing 'interval' as the dtype, which appears to be valid:

In [1]: from pandas.core.dtypes.common import is_interval_dtype In [2]: is_interval_dtype('interval') Out[2]: True

jschendel · 2017-12-07T08:16:41Z

pandas/core/indexes/multi.py

-        if not is_object_dtype(np.dtype(dtype)):
+        if not is_object_dtype(pandas_dtype(dtype)):
            raise TypeError('Setting %s dtype to anything other than object '
                            'is not supported' % self.__class__)


Made this change since np.dtype doesn't recognize 'category', and would raise a different error message than the one specified here; pandas_dtype ensures that this error message will be raised.

ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version

jorisvandenbossche · 2017-12-07T09:46:56Z

IntervalIndex.astype('category') return a Categorical with ordered=True instead of CategoricalIndex, since it looks like someone previously intentionally implemented it this way.

I think it should be CategoricalIndex

jreback · 2017-12-07T11:25:21Z

pandas/core/indexes/category.py

        if is_interval_dtype(dtype):
            from pandas import IntervalIndex
            return IntervalIndex.from_intervals(np.array(self))
+        elif is_categorical_dtype(dtype) and (dtype == self.dtype):


you should use dtype.equals(self.dtype) here I think

jreback · 2017-12-07T11:26:41Z

pandas/core/indexes/multi.py

-        if not is_object_dtype(np.dtype(dtype)):
+        if not is_object_dtype(pandas_dtype(dtype)):
            raise TypeError('Setting %s dtype to anything other than object '
                            'is not supported' % self.__class__)


ahh yes, we should have very very limited np.dtype calls anywhere; pandas_dtype is the general version

jreback · 2017-12-07T11:27:43Z

pandas/tests/indexes/common.py

        with pytest.raises(ValueError):
            index.putmask('foo', 1)
+
+    def test_astype_category(self):


can you pass a CategoricalDtype here as well (with and w/o ordered) and make this parameterized

jreback · 2017-12-07T11:27:53Z

pandas/tests/indexes/test_interval.py

        assert result.equals(idx)

-        result = idx.astype('category')
+    def test_astype_category(self, closed):


jreback · 2017-12-07T11:28:01Z

pandas/tests/indexes/test_multi.py

        with tm.assert_raises_regex(TypeError, "^Setting.*dtype.*object"):
            self.index.astype(np.dtype(int))

+    def test_astype_category(self):


codecov · 2017-12-07T13:46:06Z

Codecov Report

Merging #18677 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18677      +/-   ##
==========================================
- Coverage    91.6%   91.59%   -0.02%     
==========================================
  Files         153      153              
  Lines       51306    51339      +33     
==========================================
+ Hits        46998    47022      +24     
- Misses       4308     4317       +9

Flag	Coverage Δ
#multiple	`89.45% <100%> (ø)`	⬆️
#single	`40.74% <31.7%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`96.29% <100%> (+0.01%)`	⬆️
pandas/core/dtypes/dtypes.py	`95.27% <100%> (+0.13%)`	⬆️
pandas/core/indexes/numeric.py	`97.36% <100%> (+0.03%)`	⬆️
pandas/core/indexes/interval.py	`93.8% <100%> (ø)`	⬆️
pandas/core/indexes/datetimes.py	`95.7% <100%> (+0.01%)`	⬆️
pandas/core/indexes/base.py	`96.43% <100%> (ø)`	⬆️
pandas/core/indexes/category.py	`97.23% <100%> (+0.03%)`	⬆️
pandas/core/indexes/period.py	`92.93% <100%> (+0.03%)`	⬆️
pandas/core/dtypes/common.py	`94.45% <100%> (ø)`	⬆️
pandas/core/indexes/timedeltas.py	`91.26% <100%> (+0.05%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae74c2b...6042131. Read the comment docs.

jschendel · 2017-12-07T21:22:51Z

Question regarding how this should behave with a CategoricalIndex.

Setup:

In [3]: ci = pd.CategoricalIndex(list('abca'))

In [4]: new_dtype = CategoricalDtype(ordered=True)

In [5]: new_dtype
Out[5]: CategoricalDtype(categories=None, ordered=True)

In [6]: ci.dtype
Out[6]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

Then ci.dtype and new_dtype are considered equal despite ordered being different:

In [7]: ci.dtype == new_dtype
Out[7]: True

Which appears to be intentional, per the comment prior to the implementation:

pandas/pandas/core/dtypes/dtypes.py

Lines 216 to 222 in 9629fef

    
           elif self.categories is None or other.categories is None: 
        
               # We're forced into a suboptimal corner thanks to math and 
        
               # backwards compatibility. We require that `CDT(...) == 'category'` 
        
               # for all CDTs **including** `CDT(None, ...)`. Therefore, *all* 
        
               # CDT(., .) = CDT(None, False) and *all* 
        
               # CDT(., .) = CDT(None, True). 
        
               return True

In this case, should doing ci.astype(new_dtype) change ordered to True?

From a user standpoint I'd expect this to change ordered to True, as this would logically be the user's intention.
From a codebase consistency standpoint, it seems like since the two dtypes are equal, doing an astype shouldn't change anything

jschendel · 2017-12-08T08:09:29Z

Made review related updates and a few additional fixes/changes:

IntervalIndex.astype('category') has been changed to return a CategoricalIndex
Modified MultiIndex.astype('category') to raise a NotImplementedError stating that >1 ndim categoricals aren't currently supported
- Mirrors a similar error message raised by DataFrame.astype('category').

Also modified the CategoricalIndex.astype behavior to address my question above. I chose the first bulleted option. The logic I implemented is that if a CategoricalDtype with categories or ordered being None is passed, None is replaced by the corresponding value from the CategoricalIndex's dtype. Basically None corresponds to "don't change this attribute".

The only slightly counter-intuitive thing is that pandas_dtype('category') returns a CategoricalDtype with ordered=False, meaning that if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

jschendel · 2017-12-08T08:14:34Z

pandas/core/indexes/category.py

+            dtype = CategoricalDtype(new_categories, new_ordered)
+
+            # fastpath if dtypes are equal
+            if dtype == self.dtype:


Note that dtype.equals(self.dtype) raises AttributeError: 'CategoricalDtype' object has no attribute 'equals'.

The only other thing I found for dtype comparison is is_dtype_equal, but under the hood that basically just does == within a try/except to catch if the dtypes aren't comparable. We're within elif is_categorical_dtype(dtype) here though, so is_dtype_equal seem superfluous, but could still use it if preferred.

jorisvandenbossche · 2017-12-08T08:34:24Z

should doing ci.astype(new_dtype) change ordered to True?

I agree that from a user point of view this should change the ordered attribute.

From a codebase consistency standpoint, it seems like since the two dtypes are equal, doing an astype shouldn't change anything

I am wondering if it would not be possible to change this. It is clear that we need to keep cat_dtype == 'category' to be True. But that is already handled a few lines above the snippet you showed.
So I don't fully understand the comment why it is needed that all CDT(None, ordered=True/False) needs to be equal regardless of the ordered attribute.

cc @TomAugspurger

if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

That doesn't sound fully as the desired behaviour .. although I am not fully sure :-)

TomAugspurger · 2017-12-08T12:58:49Z

if you have an ordered CategoricalIndex and do .astype('category') it would keep the same categories but switch ordered to False. Should be straightforward to make this not change anything if that'd be preferred.

That doesn't sound fully as the desired behaviour .. although I am not fully sure :-)

Yes, agreed that it's unclear. In that case, I think it's less surprising for 'category' to behave as CDT(None, None), i.e. don't change anything.

So I don't fully understand the comment why it is needed that all CDT(None, ordered=True/False) needs to be equal regardless of the ordered attribute.

That was for backwards comparability. I wouldn't use it as an argument of consistency, since it's a bad thing to be consistent about :)

jschendel · 2017-12-09T00:09:03Z

Updates:

Modified CategoricalIndex.astype('category') to not change anything
Fixed some tests that previously relied on IntervalIndex.astype('category') returning a Categorical with ordered=True instead of a CategoricalIndex
- This only impacted the creation of expected in the tests; didn't change result.

jreback

lgtm. small comments.

jreback · 2017-12-09T15:39:08Z

pandas/core/indexes/category.py


    @Appender(_index_shared_docs['astype'])
    def astype(self, dtype, copy=True):
+        if isinstance(dtype, compat.string_types) and dtype == 'category':


I don't think you actually need this check as the dtype == self.dtype line below should pick this up.

I think this, or some variation of it, is necessary in order to guarantee that CI.astype('category') doesn't change anything. The issue being that pandas_dtype('category') returns CDT(None, False), which would change a CI with ordered=True to False.

jreback · 2017-12-09T15:40:29Z

pandas/core/indexes/datetimes.py

                return self.copy()
            return self
+        elif is_categorical_dtype(dtype):
+            from pandas.core.indexes.category import CategoricalIndex


pattern is to import CI at the top, but ok here too

e.g. from pandas.core.indexes.category import CategoricalIndex should work

jreback · 2017-12-09T15:40:38Z

pandas/core/indexes/interval.py

        elif is_categorical_dtype(dtype):
-            from pandas import Categorical
-            return Categorical(self, ordered=True)
+            from pandas.core.indexes.category import CategoricalIndex


jreback · 2017-12-09T15:40:53Z

pandas/core/indexes/multi.py

-                            'is not supported' % self.__class__)
+        dtype = pandas_dtype(dtype)
+        if is_categorical_dtype(dtype):
+            msg = '> 1 ndim Categorical are not supported at this time'


test fo this?

Wrote a test for it in test_multi.py, which overrides the test in common.py:

https://github.com/jschendel/pandas/blob/31d4d62295035123453ab24f393176750661a283/pandas/tests/indexes/test_multi.py#L558-L568

jreback · 2017-12-09T15:41:03Z

pandas/core/indexes/numeric.py

        elif is_object_dtype(dtype):
            values = self._values.astype('object', copy=copy)
+        elif is_categorical_dtype(dtype):
+            from pandas.core.indexes.category import CategoricalIndex


jreback · 2017-12-09T15:41:55Z

pandas/core/indexes/period.py

            return self.to_timestamp(how=how).tz_localize(dtype.tz)
        elif is_period_dtype(dtype):
            return self.asfreq(freq=dtype.freq)
+        elif is_categorical_dtype(dtype):


same

side thing, I think that we could make a more generic astype in indexes.base and remove some boiler plate maybe (of course separate PR), you can make an issue if you want (or just PR!)

jschendel · 2017-12-09T20:14:12Z

Updated to move the CI imports, and replied to the other comments in the latest review.

jorisvandenbossche · 2017-12-10T13:07:03Z

Looks good to me

jschendel · 2017-12-11T07:48:34Z

Moved the categorical dtype update code to a new _update_dtype method of CategoricalDtype, per #18710 (comment)

Not sure if the name/location is appropriate, but can rename/move if need be. For the time being, I've maintained the existing logic of .astype('category') not updating anything.

jreback · 2017-12-11T11:02:12Z

pandas/core/indexes/base.py

    @Appender(_index_shared_docs['astype'])
    def astype(self, dtype, copy=True):
+        if is_categorical_dtype(dtype):
+            from .category import CategoricalIndex


I guess we have an import issue if we import this at the top (with the fully qualified path)?

jreback · 2017-12-11T11:06:39Z

thanks @jschendel nice patch! keep em coming!

jschendel commented Dec 7, 2017

View reviewed changes

jorisvandenbossche added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Dec 7, 2017

jreback requested changes Dec 7, 2017

View reviewed changes

jschendel force-pushed the idx-astype-category branch from 373f9d4 to b4f354f Compare December 8, 2017 08:09

jschendel commented Dec 8, 2017

View reviewed changes

jschendel force-pushed the idx-astype-category branch from b4f354f to 8066780 Compare December 9, 2017 00:08

jschendel force-pushed the idx-astype-category branch from 8066780 to 3697d6e Compare December 9, 2017 01:38

jreback requested changes Dec 9, 2017

View reviewed changes

jreback mentioned this pull request Dec 9, 2017

CLN: consolidate Index.astype #18704

Closed

jschendel force-pushed the idx-astype-category branch from 3697d6e to 31d4d62 Compare December 9, 2017 20:10

jschendel mentioned this pull request Dec 10, 2017

BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710

Merged

5 tasks

jschendel added 6 commits December 11, 2017 00:40

BUG: Ensure Index.astype('category') returns a CategoricalIndex

3c37bb7

review updates

6d953e4

Make CI.astype('category') not change anything

20b5504

Fix tests broken by II.astype('category') changes

a90acec

Move CI imports

afcc50a

refactor dtype update

6042131

jschendel force-pushed the idx-astype-category branch from 31d4d62 to 6042131 Compare December 11, 2017 07:40

jreback added this to the 0.22.0 milestone Dec 11, 2017

jreback approved these changes Dec 11, 2017

View reviewed changes

jreback merged commit 3821040 into pandas-dev:master Dec 11, 2017

jschendel deleted the idx-astype-category branch December 11, 2017 21:25

jschendel mentioned this pull request Dec 15, 2017

DISC: Behavior of .astype('category') on existing categorical data #18790

Closed

jschendel mentioned this pull request Apr 27, 2018

Index.astype('category') does not work #20843

Closed

Uh oh!

BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677

BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677

Uh oh!

Conversation

jschendel commented Dec 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Dec 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jschendel commented Dec 7, 2017

Uh oh!

jschendel commented Dec 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Dec 8, 2017

Uh oh!

TomAugspurger commented Dec 8, 2017

Uh oh!

jschendel commented Dec 9, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschendel Dec 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschendel commented Dec 9, 2017

Uh oh!

jorisvandenbossche commented Dec 10, 2017

Uh oh!

jschendel commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Dec 7, 2017 •

edited

Loading

jschendel Dec 9, 2017 •

edited

Loading