-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BENCH: Fix CategoricalIndexing benchmark #38476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCH: Fix CategoricalIndexing benchmark #38476
Conversation
…ndexing_benchmark
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
|
|
||
| def time_get_indexer_list(self, index): | ||
| self.data.get_indexer(self.cat_list) | ||
| self.data_unique.get_indexer(self.cat_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timing you get for this? Because data_unique is only a small index, while data uses N = 10 ** 5. We might want to make data_unique larger as well?
(there is a rands helper in pandas._testing to create random strings of a certain length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #38372, CategoricalIndex.get_indexer raises an error if if the index has duplicate values; so I think to keep this benchmark we need to use this smaller index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still have more unique strings when combining multiple characters (which is what rands does). Just want to be sure that it's not a tiny benchmark that might not catch actual regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. Here's a short timeit of this benchmark.
In [3]: %timeit idx.get_indexer(['a', 'c'])
736 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you think this is too tiny I can bump it up, though I think asv regressions are reported as a percentage correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's too little. Based on a quick profile, we are mostly measuring here the overhead (converting ['a', 'c'] to an Index, checking the dtypes, etc), which doesn't necessarily scale linearly with the size of the data. While the actual indexing operation is only about 10% of the time. So assume that we have a regression there that makes it go x2, we would hardly notice that in this benchmark's timing.
Now, we would maybe also solve that by making cat_list longer, instead of making idx longer (eg int_list has 10000 elements, cat_list only 2. Doing cat_list = ['a', 'c'] * 5000 would probably already help.
|
thanks @mroeschke |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffThis benchmark behavior changed after #38372
Additionally, exploring if its feasible to always run the benchmarks to catch these changes earlier.