BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710

jschendel · 2017-12-10T05:07:27Z

closes BUG: astype('category', categories=...) failes on a series of categorical type #10696
closes x.astype(categorical_dtype) ignored if x is already categorical #18593
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Couldn't find an issue about it, but the same problem described with Series.astype in the linked issues was occurring with Categorical.astype. Put in a fix for that too with some code very similar to what was done in #18677 for CategoricalIndex.astype. Could probably consolidate the two into a single helper function, potentially as part of #18704.

codecov · 2017-12-10T05:40:20Z

Codecov Report

Merging #18710 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18710      +/-   ##
==========================================
- Coverage   91.61%   91.59%   -0.03%     
==========================================
  Files         153      153              
  Lines       51363    51359       -4     
==========================================
- Hits        47058    47044      -14     
- Misses       4305     4315      +10

Flag	Coverage Δ
#multiple	`89.46% <100%> (-0.01%)`	⬇️
#single	`40.77% <84.61%> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/categorical.py	`95.79% <100%> (+0.14%)`	⬆️
pandas/core/internals.py	`94.42% <100%> (-0.02%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/util/testing.py	`82.32% <0%> (-0.2%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9705a48...6702f90. Read the comment docs.

jreback · 2017-12-10T14:52:50Z

pandas/core/categorical.py

-            return self
+            # GH 18593: keep current categories if None (ordered can't be None)
+            if dtype.categories is None:
+                new_categories = self.categories


if you also set ordered=False here (for dtype.categories is None) and the else take the ordered from the dtype, then I believe you can remvoe 439-441 (also needt o make 450 be

dtype = CategoricalDtype(new_categories, ordered)

The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.

To summarize, I think it is easier to leave it as is and treat 'category' as a special case.

(we might want to check if we can't let the ordered keyword have a default of None to make it easier to deal with this, but that is another issue)

so this is pretty much the same block of code as in #18677

this is way too special casey to exist in 2 places. Can you refator this to a common method that you can use in #18677 (or on that PR is ok too).

I'd like to do that before merging either one of these.

The problem is that astype('category') should not change the ordered attribute (so not set it always to False), so you would need to take the ordered from self. But then, you are not really sure if the user specified the order of the CategoricalDtype specifically, or if the ordered=False came from the default value.

not sure this is True. .astype('category') is clearly == CategoricalDtype() which by-definition has ordered=False. I don't know how you can have any other conclusion. Furthermore if this is NOT the case. Then we should immediately fix this. As a special case for this is monumentally confusing. The very fact that we have to have this discussion attests to this.

Because that would be changing the existing behaviour:

In [2]: pd.Categorical(['a', 'b'], ordered=True) Out[2]: [a, b] Categories (2, object): [a < b] In [3]: pd.Categorical(['a', 'b'], ordered=True).astype('category') Out[3]: [a, b] Categories (2, object): [a < b]

I personally think the above is the logical behaviour, but I can also see a point in to make the above ordered=False.
Main reason for liking the above is that 'category' == CategoricalDtype() and that CategoricalDtype has a default of ordered=False should be more an implementation detail to the user.

But let's maybe open a new issue to discuss that?
And keep this PR just fixing the bug without changing existing behaviour.

Can you refactor this to a common method that you can use in #18677 (or on that PR is ok too)

Will do that over in #18677, since it seems to be closer to being complete. Or can close the two individual PR's and create a new PR that combines both, if that would be preferable. Didn't realize the fixes would be so similar until the first PR was already in review.

jreback · 2017-12-10T14:52:58Z

pandas/core/categorical.py

            .. versionadded:: 0.19.0

        """
+        if isinstance(dtype, compat.string_types) and dtype == 'category':


see my next comment

jreback · 2017-12-10T14:54:46Z

pandas/core/internals.py

-            kwargs.setdefault('categories', categories)
-            kwargs.setdefault('ordered', ordered)
-            return self.make_block(Categorical(self.values, **kwargs))
+            if is_categorical_dtype(self.values):


I don't think you need all of this logic, wouldn't

values = self.values.astype(dtype, copy=copy) return self.make_block(values, dtype=dtype)

be enough (if values is a Categorical already or dtype is a CDT, it will infer correctly, and if its not it will as well).

I don't think that quite works, since self.values can be a different object depending on what self is: if self is already categorical, then self.values is a Categorical, otherwise self.values is a numpy array.

In the numpy case, self.values.astype raises TypeError: data type not understood when a CDT is passed as the dtype.

Likewise, self.make_block(Categorical(self.values, dtype=dtype)) also doesn't work by itself. In the Categorical case, the constructor ignores the dtype parameter when the input data is already Categorical, so no update occurs.

Seems like the two paths are necessary? Or am I overlooking something?

ok seems reasonable then

jreback · 2017-12-11T11:25:05Z

give this a rebase and use _update_dtype

jschendel · 2017-12-12T07:06:37Z

Rebased and used _update_dtype.

Will write up an issue in the next day or so to discuss the behavior of .astype('category') on existing categorical data, extending on the discussion that took place further up in the code review comments.

jorisvandenbossche

Looks good to me, just had some small comments on the tests

jorisvandenbossche · 2017-12-12T08:28:13Z

pandas/tests/categorical/test_dtypes.py

+        expected = np.array(cat, dtype=np.float)
+        tm.assert_numpy_array_equal(result, expected)
+
+    @pytest.mark.parametrize('copy', [True, False])


you are not really testing the effect of the copy keyword

jorisvandenbossche · 2017-12-12T08:29:02Z

pandas/tests/categorical/test_dtypes.py


-    def test_astype_categorical(self):
+    @pytest.mark.parametrize('ordered', [True, False])
+    @pytest.mark.parametrize('copy', [True, False])


is this copy needed here? it is not doing anything in all those cases (they will copy anyway, so I think just using it once differently in the test is good enough)

jorisvandenbossche · 2017-12-12T08:32:13Z

pandas/tests/series/test_dtypes.py

                        lambda x: x.astype('object').astype(Categorical)]:
            pytest.raises(TypeError, lambda: invalid(s))

+    @pytest.mark.parametrize('copy', [True, False])


…gorical data

jschendel · 2017-12-13T03:59:59Z

Updated to remove the unnecessary copy parametrize. Removed an additional instance of this that was introduced in #18677.

jreback · 2017-12-13T14:34:47Z

pandas/core/categorical.py

-                return self.copy()
-            return self
+            # GH 10696/18593
+            dtype = self.dtype._update_dtype(dtype)


we might want to add some explicit tests for _update_dtype at some point (separate PR)

added them in the other PR during the initial implementation actually:

pandas/pandas/tests/dtypes/test_dtypes.py

Lines 127 to 152 in 265e327

@pytest.mark.parametrize('dtype', [

CategoricalDtype(list('abc'), False),

CategoricalDtype(list('abc'), True)])

@pytest.mark.parametrize('new_dtype', [

'category',

CategoricalDtype(None, False),

CategoricalDtype(None, True),

CategoricalDtype(list('abc'), False),

CategoricalDtype(list('abc'), True),

CategoricalDtype(list('cba'), False),

CategoricalDtype(list('cba'), True),

CategoricalDtype(list('wxyz'), False),

CategoricalDtype(list('wxyz'), True)])

def test_update_dtype(self, dtype, new_dtype):

if isinstance(new_dtype, string_types) and new_dtype == 'category':

expected_categories = dtype.categories

expected_ordered = dtype.ordered

else:

expected_categories = new_dtype.categories

if expected_categories is None:

expected_categories = dtype.categories

expected_ordered = new_dtype.ordered

result = dtype._update_dtype(new_dtype)

tm.assert_index_equal(result.categories, expected_categories)

assert result.ordered is expected_ordered

jreback · 2017-12-13T14:35:27Z

thanks @jschendel nice patches! keep em coming!

Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes dask#2996

* COMPAT: Pandas 0.22.0 astype for categorical dtypes Change in pandas-dev/pandas#18710 caused a dask failure when reading CSV files, as our `.astype` relied on the old (broken) behavior. Closes #2996 * Fix pandas version check * Refactored * update docs * compat * Simplify * Simplify * Update changelog.rst

jreback requested changes Dec 10, 2017

View reviewed changes

jreback added Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Dec 10, 2017

jschendel mentioned this pull request Dec 11, 2017

BUG: Ensure Index.astype('category') returns a CategoricalIndex #18677

Merged

4 tasks

jschendel force-pushed the cat-astype-cat branch from f9a1457 to 0fb9140 Compare December 12, 2017 06:56

jorisvandenbossche approved these changes Dec 12, 2017

View reviewed changes

jschendel added 3 commits December 12, 2017 20:35

BUG: Fix Series.astype and Categorical.astype to update existing Cate…

fe30db4

…gorical data

rebase and use _update_dtype

821d79d

remove copy from tests

6702f90

jschendel force-pushed the cat-astype-cat branch from 0fb9140 to 6702f90 Compare December 13, 2017 03:37

jreback reviewed Dec 13, 2017

View reviewed changes

jreback approved these changes Dec 13, 2017

View reviewed changes

jreback added this to the 0.22.0 milestone Dec 13, 2017

jreback merged commit 73e73db into pandas-dev:master Dec 13, 2017

jschendel deleted the cat-astype-cat branch December 13, 2017 19:34

TomAugspurger mentioned this pull request Dec 14, 2017

read_csv failure on pandas master dask/dask#2996

Closed

TomAugspurger mentioned this pull request Dec 14, 2017

COMPAT: Pandas 0.22.0 astype for categorical dtypes dask/dask#2997

Merged

jschendel mentioned this pull request Dec 15, 2017

DISC: Behavior of .astype('category') on existing categorical data #18790

Closed

jorisvandenbossche mentioned this pull request Feb 17, 2018

ENH: Allow storing ExtensionArrays in containers #19520

Merged

	@pytest.mark.parametrize('dtype', [
	CategoricalDtype(list('abc'), False),
	CategoricalDtype(list('abc'), True)])
	@pytest.mark.parametrize('new_dtype', [
	'category',
	CategoricalDtype(None, False),
	CategoricalDtype(None, True),
	CategoricalDtype(list('abc'), False),
	CategoricalDtype(list('abc'), True),
	CategoricalDtype(list('cba'), False),
	CategoricalDtype(list('cba'), True),
	CategoricalDtype(list('wxyz'), False),
	CategoricalDtype(list('wxyz'), True)])
	def test_update_dtype(self, dtype, new_dtype):
	if isinstance(new_dtype, string_types) and new_dtype == 'category':
	expected_categories = dtype.categories
	expected_ordered = dtype.ordered
	else:
	expected_categories = new_dtype.categories
	if expected_categories is None:
	expected_categories = dtype.categories
	expected_ordered = new_dtype.ordered

	result = dtype._update_dtype(new_dtype)
	tm.assert_index_equal(result.categories, expected_categories)
	assert result.ordered is expected_ordered

Uh oh!

BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710

BUG: Fix Series.astype and Categorical.astype to update existing Categorical data #18710

Uh oh!

Conversation

jschendel commented Dec 10, 2017

Uh oh!

codecov bot commented Dec 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 11, 2017

Uh oh!

jschendel commented Dec 12, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschendel commented Dec 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Dec 10, 2017 •

edited

Loading