Skip to content

Conversation

@l3robot
Copy link
Contributor

@l3robot l3robot commented Aug 25, 2023

Checklist

Explanation

So after a bit of research for #54550, I found that the source of the behavior is coming from PR #22826 at this line and particularly this comment here.

In the comment, @rtlee9 makes an excellent explanation for the reason behind the threshold. The default behavior is a linear search (O(n)) in the pandas index and #22826 added a binary search optimization for monotonically increasing index which is in O(m*log(n)) where m is the number of target indexes.

Proposed Solution

I'm proposing we increase the limit, which was set conservatively to 5. We could at least increase it to n / (2 * log2[n]) which is derivable from m*log2[n]<n and dividing per 2 for keeping a certain margin. I'll open a PR for that.

New results of benchmarking + homemade script

Here is a screenshot of asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b ^indexing.*Index* output (done on Macbook Air m1, inside the docker container provided by pandas):

Capture d’écran, le 2023-08-23 à 09 05 41

Here is also a new output for the script use to demonstrate the problem in #54700:

With all at once: num_indexes=1 => 0.00048s
With all at once: num_indexes=2 => 0.00045s
With all at once: num_indexes=3 => 0.00038s
With all at once: num_indexes=4 => 0.00035s
With all at once: num_indexes=5 => 0.00035s
With all at once: num_indexes=6 => 0.00045s
With all at once: num_indexes=7 => 0.00042s
With all at once: num_indexes=8 => 0.00066s
With all at once: num_indexes=9 => 0.00049s
With all at once: num_indexes=10 => 0.00048s

Thanks in advance for the review!

@l3robot l3robot requested a review from WillAyd as a code owner August 25, 2023 02:31
if (
stargets and
len(stargets) < 5 and
len(stargets) < (n / (2 * n.bit_length())) and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit_lenght is just a fast way of computing the ceiling of log2.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I think this is good

Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
-
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550`
Copy link
Member

@mroeschke mroeschke Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Increase the threshold to use binary search instead of a linear search in IndexEngine :issue:`54550`
- Performance improvement when indexing with more than 4 keys (:issue:`54550`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks a lot for the review.
done!

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Aug 25, 2023
@l3robot l3robot changed the title Increase threashold for using binary search in IndexEngine PERF: Increase threashold for using binary search in IndexEngine Aug 25, 2023
@mroeschke mroeschke added this to the 2.2 milestone Aug 26, 2023
@mroeschke mroeschke merged commit 48f5a96 into pandas-dev:main Aug 26, 2023
@mroeschke
Copy link
Member

Thanks @l3robot

@l3robot
Copy link
Contributor Author

l3robot commented Aug 26, 2023

Thanks @l3robot

Glad to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Unreliable performance of .loc with big non-unique index dataframe

4 participants