-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look! @mroeschke might also have thoughts
3be9d0e to
1ca73fd
Compare
|
@WillAyd thanks for your review - I've written it a different way (so no side-effects this time) |
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks like a better approach. Just wondering if there's something even cleaner out there (a lot of the groupby stuff is a mess as is...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments. The OP is only partially correct. If you do .rolling.groupby (or any of the variants) and .apply then you will get back the grouping column, just like groupby.apply does; so this should not change.
e.g.
In [1]: df = pd.DataFrame({'A': [1, 1, 2], 'B': [1, 2, 3]})
In [2]: df.groupby('A').sum()
Out[2]:
B
A
1 3
2 3
In [3]: df.groupby('A').apply(lambda x: x.sum())
Out[3]:
A B
A
1 2 3
2 2 3
The issue here is that .rolling.groupby happens to use .apply as an implementation detail, so that's the behavior you get. I am open to patching this only there.
|
Thanks @jreback - that probably simplifies the patch then. What about if we use This is the output that's currently being tested for in |
|
@jreback have pushed a new fix. Have had to update some tests which check output against |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will affect groupby.rolling and groupby.resample but NOT groupby.apply
pandas/core/window/rolling.py
Outdated
| def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
| if isinstance(obj, ABCDataFrame): | ||
| try: | ||
| new_obj = super()._shallow_copy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAT?
lint wip clean re-write add comment
pandas/core/window/rolling.py
Outdated
| self.exclusions = kwargs.get("exclusions", set()) | ||
|
|
||
| def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
| if isinstance(obj, ABCDataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAT? i would much rather properly copy in the super class. really avoid if/then cases unless absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW we can always just copy exclusions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback sure, I can remove the if-then cases here, but then I need to potentially remove elements from self.exclusions in
def _obj_with_exclusions(self)
, otherwise this could throw an error if there are elements of self.exclusions that are no longer part of self.obj.columns:
return self.obj.drop(self.exclusions, axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, that's what I've done in the latest commit
pandas/core/window/rolling.py
Outdated
| def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin: | ||
| exclusions = self.exclusions | ||
| new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs) | ||
| new_obj.obj = new_obj._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this line is actually needed? this is very very odd to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying the exclusions should be enough. you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py which is better than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for doing this was that RollingGroupby._groupby._python_apply_general operates on RollingGroupby._groupby._selected_obj.
And RollingGroupby._groupby._selected_obj returns RollingGroupby._groupby.obj, possibly sliced by RollingGroupby._groupby._selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you maybe be able to move this shallow copy to pandas/core/groupby/groupby.py
As in, define _GroupBy._shallow_copy? I don't understand how that would work: when we call Rolling.sum, we go to _Rolling_and_Expanding.sum and then WindowGroupByMixin._apply. Inside there, we have
def f(x, name=name, *args):
x = self._shallow_copy(x)
if isinstance(name, str):
return getattr(x, name)(*args, **kwargs)
return x.apply(name, *args, **kwargs)
Here, self is WindowGroupByMixin. In the case when we get here from Rolling.sum, then self._shallow_copy will take us directly to ShallowMixin._shallow_copy. Therefore, even if we define _GroupBy._shallow_copy, the execution won't get there during this operation.
EDIT
Anyway, am trying a slightly different approach, will ping if/when green and once I've thought over why corr and cov currently use ._selected_obj instead of ._obj_with_exclusions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for your review. I've pushed a new attempt at a solution.
I patches obj before we get to _GroupBy.apply, and only for Rolling so that previous behaviour in DataFrame.groupby.apply.
…play dataframe in whatsnew entry separately from before vs after
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not going well. there is mutation all over the place. This needs a good clean solution to get merged.
| GroupBy.rolling no longer returns grouped-by column in values | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an expl here, a couple of sentences; this the issue number.
|
|
||
| # there may be elements in self.exclusions that are no longer | ||
| # in self.obj, see GH 32468 | ||
| nlevels = self.obj.columns.nlevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? why are you doing all of this
| def f(x): | ||
| x = self._shallow_copy(x, groupby=self._groupby) | ||
| # patch for GH 32332 | ||
| x.obj = x._obj_with_exclusions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this
|
Yes, I see what you mean - to be honest I can't think of a clean solution, and have spent a while on it already. I'll close this for now so someone else can take it up |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffDoes this seem like the correct idea? If so, I'll expand on tests / write whatsnew / think about a cleaner patch
EDIT
lots of unwanted changes here, just seeing what tests fail
EDIT
need to check the types make sense and see what's the proper way to write