-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
PERF: fix some of .clip() performance regression by using numpy arrays where possible #24735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Continue to review full report at Codecov.
|
Codecov Report
Continue to review full report at Codecov.
|
pandas/core/generic.py
Outdated
|
|
||
| with np.errstate(all='ignore'): | ||
| if upper is not None: | ||
| subset = self.values <= upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, does this place nice when self is the new Int64 dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to - let me know if there's a more extensive test you have in mind:
In [2]: s = pd.Series(range(5)).astype('Int64')
In [3]: s.clip(1, 3)
Out[3]: 0 1
1 1
2 2
3 3
4 3
dtype: Int64
| result = result.where(subset, lower, axis=None, inplace=False) | ||
| mask = isna(self.values) | ||
|
|
||
| with np.errstate(all='ignore'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simply reverting back to what this block used to do; it's needed in the event values <= upper would otherwise raise a type error.
pandas/core/generic.py
Outdated
|
|
||
| with np.errstate(all='ignore'): | ||
| if upper is not None: | ||
| subset = self.values <= upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idiomatic approach here would now be to_numpy instead of .values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
@qwhelan can you update |
|
@qwhelan can you merge master and ping when passing |
|
thanks @qwhelan |
A recent change to respect dtypes in
.clip()(#24458) introduced a decent overhead of ~2ms to the call:This PR cuts the overhead from ~2ms to ~0.6ms by keeping
subsetas a numpy array; it's entirely boolean regardless of underlying dtype, so a DataFrame only adds overhead here:git diff upstream/master -u -- "*.py" | flake8 --diff