-
-
Notifications
You must be signed in to change notification settings - Fork 27
Support median in Groupby.aggregate #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
ea0bd1e
Adjust aggregate
hendrikmakait eb63cf7
Fixed column order
hendrikmakait b9fcb7e
Refactor
hendrikmakait fda7135
Docstring
hendrikmakait b5db3df
Fix
hendrikmakait 3c4cb84
Remove comment
hendrikmakait cdd51dd
Fix
hendrikmakait 695c824
Merge branch 'main' into median-aggregate
crusaderky e763874
Merge branch 'main' into median-aggregate
crusaderky 568b790
Wipe cache and rerun
crusaderky b483493
Move sort
hendrikmakait d453b65
Fixed column ordering in column projection
hendrikmakait 5ee6887
Minor
hendrikmakait 10f09f8
Reduce cyclomatic complexity
hendrikmakait 530fbe4
Merge branch 'main' into median-aggregate
hendrikmakait File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -447,6 +447,13 @@ def _divisions(self): | |
| def _chunk_cls_args(self): | ||
| return [] | ||
|
|
||
| @property | ||
| def should_shuffle(self): | ||
| sort = getattr(self, "sort", False) | ||
| return not ( | ||
| not isinstance(self.split_out, bool) and self.split_out == 1 or sort | ||
| ) | ||
|
|
||
| def _lower(self): | ||
| # Normalize functions in case not all are defined | ||
| chunk = self.chunk | ||
|
|
@@ -465,12 +472,11 @@ def _lower(self): | |
| combine = aggregate | ||
| combine_kwargs = aggregate_kwargs | ||
|
|
||
| sort = getattr(self, "sort", False) | ||
| split_every = getattr(self, "split_every", None) | ||
| chunked = self._chunk_cls( | ||
| self.frame, type(self), chunk, chunk_kwargs, *self._chunk_cls_args | ||
| ) | ||
| if not isinstance(self.split_out, bool) and self.split_out == 1 or sort: | ||
| if not self.should_shuffle: | ||
|
Comment on lines
-473
to
+479
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the change. I think we can remove |
||
| # Lower into TreeReduce(Chunk) | ||
| return TreeReduce( | ||
| chunked, | ||
|
|
@@ -496,7 +502,7 @@ def _lower(self): | |
| split_by=self.split_by, | ||
| split_out=self.split_out, | ||
| split_every=split_every, | ||
| sort=sort, | ||
| sort=getattr(self, "sort", False), | ||
| shuffle_by_index=getattr(self, "shuffle_by_index", None), | ||
| shuffle_method=getattr(self, "shuffle_method", None), | ||
| ignore_index=getattr(self, "ignore_index", True), | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this logic is just copied, but is it correct? We want to shuffle if split_out > 1 or if
sort is True(we can't use a tree reduction ifsort=True). Maybe this should be something like:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn`t cover the bool case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my suggestion was off, but I was pointing out the sort check is wrong (I think).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember operator precedence correctly,
is the same as
i.e., if
sort is True, the expression evaluates toTrueFalse. Yeah, that seems problematic.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XREF #817
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use a tree reduction? The modus is to reduce to one partition and then sort, so we can not shuffle if sort=True, we have to reduce to one partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for leaving a "this is a bug" comment, and then getting pulled away.
I think some of my confusion comes from that fact that dask/dask allows the tree reduction path to be used with
split_out > 1, but that algorithm will produce incorrect results ifsort=True.In dask-expr, we do not allow the tree reduction to be used with
split_out > 1, so the sort check seems unnecessary/confusing. Ifsplit_out ==, then we don't really care aboutsort.