Skip to content

Conversation

@topper-123
Copy link
Contributor

This issue was found when looking for a solution to #20395. I've found that CategoricalIndex._create_categorical makes an unnecessary call to Categorical._set_dtype when passed a dtype that is equal to self.dtype:

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a'*n + 'b'*n + 'c'*n))
>>> %timeit ci._create_categorical(ci, ci)
197 µs  # master and this PR
>>> %timeit ci._create_categorical(ci, ci, dtype=ci.dtype)
1.92 ms  # master
197 µs  # this PR

Internally, some operations in Pandas pass self.dtype to CategoricalIndex._create_categorical and onwards to _set_dtype, which is a very slow code path.

By avoiding calling _set_dtype unnecessarily, some operations become faster. For example:

>>> df = pd.DataFrame(dict(A=range(n*3)), index=ci)
>>> %timeit df.loc['b']
3.55 ms # master
2.14 ms  # this PR

As Categoricalndex._create_categorical is called directly or indirectly by various methods (e.g. CategoricalIndex._shallow_copy), there are probably other places where this speedup is relevant also.

@codecov
Copy link

codecov bot commented Jun 16, 2018

Codecov Report

Merging #21506 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21506      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         153      153              
  Lines       49587    49589       +2     
==========================================
+ Hits        45583    45585       +2     
  Misses       4004     4004
Flag Coverage Δ
#multiple 90.32% <100%> (ø) ⬆️
#single 41.89% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.11% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e982e1...0db3e02. Read the comment docs.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Jun 16, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 16, 2018

@topper-123 : Nice! Can you post asv benchmark timings (to make sure there were no regressions introduced) if you have a chance?

@topper-123
Copy link
Contributor Author

Hmm, there seem to be introduced some performance regressions by this PR that I don't quite understand. I will close this PR until I've figured it out, so as not to clutter PR quoue...

@topper-123 topper-123 closed this Jun 16, 2018
@topper-123 topper-123 deleted the CategoricalIndex._shallow_copy branch September 20, 2018 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants