Skip to content

Conversation

@jankatins
Copy link
Contributor

If a series is a type category and the underlying Categorical has categories of type string or datetime, then make it possible to use the .str/.dt assessor on such a series.

The string/dt methods work on the categories (and therefore fast if we have
only a few categories), but return a Series with a dtype other than
category (integer, boolean, string,...), so that it is no different if we use
.str / .dt on a series of type string or of type category.

The main reason for that is that I think things like s.str.slice(...) + s.str.slice(...) should work.

Closes: #10661

@jankatins
Copy link
Contributor Author

I've no idea why .cat is not in dir(series_of_type_category)?!?

@jankatins jankatins changed the title Make .str available for Series of type category with strings Make .str/.dt available for Series of type category with string/datetime Nov 12, 2015
@kawochen
Copy link
Contributor

@jankatins
Copy link
Contributor Author

@kawochen Thanks! 7a20c36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.cat.categories, because otherwise you could be forcing a conversion here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, but as I understand the code, I made sure that it is a categorical before using the categories on values: com.is_categorical_dtype(self.dtype) and com.is_object_dtype(self.values.categories)

@jankatins
Copy link
Contributor Author

Yay, it's green :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better to have a loop here, that just looks at all of the StringMethods and tests against them, rather than writing out each one. (of course prob will need some special cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not so sure: you would need to replace this 6 3liners by a much harder to read list:

tests = [("method", (args), {"kwarg":"whatever"}),...]

I still don't think it should be necessary to build tests for all string methods, as far as I can tell, I covered all code paths (_wrap_result and _wrap_result_expand)

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

can you update the doc as well (maybe a reference / example in the Categorical section to one in the TextMethods section)

@jankatins
Copy link
Contributor Author

Added docs in categorical.rst

@jankatins
Copy link
Contributor Author

Ok, @jreback, you were right, it needed more tests...

Rebased + new tests + fixes... Lets wait what travis says...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so instead of listing these out as this would be instantly out-dated if something else is added.

create a list of the ones that need special arguments, then programatically create the rest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jankatins
Copy link
Contributor Author

Ok, two failures: https://travis-ci.org/pydata/pandas/builds/91286698, but I can't reproduce both errors (in py3.5 and py2.7, and it seems I'm too stupid to get a proper py3.4 environment which can run python setup.py build_ext --inplace :-( )`.

On py3.3/3.4:
nosetests pandas.io.tests.test_data

=====================================================================
ERROR: test_get_underlying_price (pandas.io.tests.test_data.TestYahooOptions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 1774, in wrapper
    return t(*args, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/io/tests/test_data.py", line 400, in test_get_underlying_price
    quote_price = options_object._underlying_price_from_root(root)
  File "/home/travis/build/pydata/pandas/pandas/io/data.py", line 737, in _underlying_price_from_root
    underlying_price = root.xpath('.//*[@class="time_rtq_ticker Fz-30 Fw-b"]')[0]\
IndexError: list index out of range

Py27, https://travis-ci.org/pydata/pandas/jobs/91286713
nosetests pandas.tseries.tests.test_timeseries

======================================================================
FAIL: test_datetime64_with_DateOffset (pandas.tseries.tests.test_timeseries.TestDatetimeIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tseries/tests/test_timeseries.py", line 2668, in test_datetime64_with_DateOffset
    assert_func(klass([op + x for x in s]), op + s)
  File "/opt/python/2.7.9/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 2042, in assert_produces_warning
    % expected_warning.__name__)
AssertionError: Did not see expected warning of class 'PerformanceWarning'.

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

ignore the first one, 2nd I am fixing now.

`.dt` and `.str` can be available when the Series is of type
category, which makes `.cat` available.
@jankatins
Copy link
Contributor Author

Ok, remodeled the tests and rebased. You could cancel the following travis run, the difference to the latest is just some reshuffeling in the commits: https://travis-ci.org/pydata/pandas/builds/91400953

@jankatins
Copy link
Contributor Author

Yay, https://travis-ci.org/pydata/pandas/builds/91400953 is green so hopefull this one (no code change in the PR, only moved some changes to a different commit) will also be green :-)

@jankatins
Copy link
Contributor Author

@jreback green... :-) Thanks for fixing whatever needed fixing :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string and datetime accessors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you try/catching then raise? (for debugging?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: if the methods failed due to missing args, then this was the easiest way to find out why/where they failed.

[Edit:] Ok, I remember that I couldn't figure out which func was failing but now I get propery errors for this:

getattr(s.str, "repeat")(*())

I'll take it out...

@jankatins
Copy link
Contributor Author

@jreback Hopefully addressed your comments. Thanks for the review!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not yet be a categorical series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops...

@jorisvandenbossche
Copy link
Member

@JanSchulz Nice! I just added a few minor doc comments

@jankatins
Copy link
Contributor Author

@jorisvandenbossche Addressed your comments. Thanks for the review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not correct from a user perspective. What you want / need to know is that you are NOT getting a categorical type back.

The fact that this can equally be done ONLY on the categories and manually constructing a new categorical is important ,but should be separate from this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is explained in the text above the example?
But I agree it can maybe be repeated explicitly in the note, as this note is what stands out from the text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have this note only on the performance implications, nothing to do with the return type. IMO this could also go into a note in the string method docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this would be gr8 on the doc-strings as well (you can do a follow up) for that if you'd like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add the same para on to of text.rst

@jreback jreback added this to the 0.17.1 milestone Nov 17, 2015
@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

couple comments. pls squash when finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to make (half of ) this paragraph into note and insert a (and not of type "category") like this:

.. note::

    The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the 
    ``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of type ``category``). 

That means, that in the above case, the returned values of ``str_cat.str.contains("a")`` and ``str_s.str.contains("a")`` or ``date_cat.dt.day`` and ``date_s.dt.day`` will be equal:

...

@jankatins
Copy link
Contributor Author

@jreback: please no squash: I really put some work into making each commit stand on its own, the unit of change is each commit (4 of them...) and not the complete PR.

@jankatins
Copy link
Contributor Author

Ok, addressed the comments, please review the doc change if that is clearer now.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

ok, lgtm. ping on green.

on the squash; ok for this one :) normally we just squashem as a matter of course.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2015

@JanSchulz ok, ping when green (after any more doc changes you have)

If a series is a type category and the underlying Categorical has
categories of type string, then make it possible to use the `.str`
assessor on such a series.

The string methods work on the categories (and therefor fast if we have
only a few categories), but return a Series with a dtype other than
category (boolean, string,...), so that it is no different if we use
`.str` on a series of type string or of type category.
If a series is a type category and the underlying Categorical has
categories of type datetime, then make it possible to use the .dt
assessor on such a series.

The string methods work on the categories (and therefore fast if we have
only a few categories), but return a Series with a dtype other than
category (integer,...), so that it is no different if we use
.dt on a series of type datetime or of type category.
@jankatins
Copy link
Contributor Author

Ok, added something to text.rst. Please review. It would also good if someone with access to travis would cancel the other builds so that we have a few hours less to wait for this :-)

Also add some docs in text.rst to mention the performance gains
when using ``s_cat.str`` vs ``s.str``.
@jankatins
Copy link
Contributor Author

@jreback ping :-)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

merged via 6eefe75

thanks!

as usual, pls review built docs for accuracy

@jreback jreback closed this Nov 18, 2015
@jankatins
Copy link
Contributor Author

@jreback Thanks for all the reviews and for your patience! The PR got much better thanks to you and @jorisvandenbossche!

Will add a PR forthis:

In [124]: date_s = pd.Series(date_range('1/1/2015', periods=5))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-124-f7c79f210480> in <module>()
----> 1 date_s = pd.Series(date_range('1/1/2015', periods=5))

NameError: name 'date_range' is not defined

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@JanSchulz no thank you!

great PR as always from you.

I got the doc fix, was making some corrections anyhow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants