-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: Add sort parameter to set operations for some Indexes and adjust… #24521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add sort parameter to set operations for some Indexes and adjust… #24521
Conversation
| return self._shallow_copy(result, name=name, freq=None, tz=self.tz) | ||
|
|
||
| def intersection(self, other): | ||
| def intersection(self, other, sort=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort parameter is not fully implemented for intersection on DatetimeIndex yet. I just put this in for now to fix some failing tests
Codecov Report
@@ Coverage Diff @@
## master #24521 +/- ##
==========================================
+ Coverage 31.88% 31.9% +0.01%
==========================================
Files 166 166
Lines 52434 52444 +10
==========================================
+ Hits 16718 16730 +12
+ Misses 35716 35714 -2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24521 +/- ##
==========================================
- Coverage 92.38% 92.38% -0.01%
==========================================
Files 166 166
Lines 52382 52392 +10
==========================================
+ Hits 48395 48404 +9
- Misses 3987 3988 +1
Continue to review full report at Codecov.
|
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
pandas/core/indexes/base.py
Outdated
| try: | ||
| taken = taken.sort_values() | ||
| except TypeError as e: | ||
| warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually get hit? I really hate all of these sort warnings. maybe can you factor them out in a helper function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format here
|
so we can followup up on the 2 cases you mentioned at the top of the PR. pls roll them into an issue (you can expand with check boxes if more clear). |
f3b517a to
f393803
Compare
| join_index = self.intersection(other, sort=False) | ||
| elif how == 'outer': | ||
| join_index = self.union(other, sort=sort) | ||
| join_index = self.union(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Master union does generally sort by default so I have left the default sort=True here for now for compatibility reasons so the current behaviour of join(..., how='outer') does not change for now and all of the tests expecting the results of this type of join to be sorted don't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO
|
Updated the original issue (#24471) with some of the corner cases I mention above. I will try to address these in subsequent PRs once this PR is finished. |
| labels = ensure_index(labels.unique()) | ||
| if other is not None: | ||
| labels = ensure_index(other.unique()) & labels | ||
| labels = ensure_index(other.unique()).intersection(labels, sort=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, on Master intersection does not generally sort by default so I have added sort=False here for compatibility
pandas/tests/indexes/test_base.py
Outdated
| if PY3 and sort: | ||
| # unorderable types | ||
| warn_type = RuntimeWarning | ||
| expected = Index([0, 'a', 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment Index.intersection does not use sorting.safe_sort which tries to sort arrays of mixed type (int and str) by ordering int before str in Python 3. I could try to change this either in this PR or in a separate one if this is getting too big already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to use sort_safe in Index.intersection in this PR in the next commit
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments. ping on green.
| pass | ||
| if sort: | ||
| try: | ||
| uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might see if using safe_sort makes sense here (you would have to import here, otherwise this would e circular)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using safe_sort here but it was causing some problems. The issue seems to be that uniques here is a list of tuples which are then used to construct a MultiIndex in MultiIndex.union. However, when we use safe_sort it turns uniques into an np.array and doesn't sort it correctly so we get the wrong results. I can have a closer look at trying to resolve this if you want but it might involve changing safe_sort a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's followup up later then (new PR).
pandas/core/indexes/base.py
Outdated
| if sort: | ||
| try: | ||
| taken = sorting.safe_sort(taken.values) | ||
| if self.name != other.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
_get_reconciled_name_object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using get_reconciled_name_object but from the docstring it says that it's used when the "result of the set operation will be self, return self, unless the name changes, in which case make a shallow copy of self". But the result of the set operation is not self in this case, I think.
pandas/core/indexes/base.py
Outdated
| try: | ||
| taken = taken.sort_values() | ||
| except TypeError as e: | ||
| warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
| join_index = self.intersection(other, sort=False) | ||
| elif how == 'outer': | ||
| join_index = self.union(other, sort=sort) | ||
| join_index = self.union(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO
| pass | ||
| if sort: | ||
| try: | ||
| uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
pandas/core/indexes/base.py
Outdated
| result = sorting.safe_sort(result) | ||
| except TypeError as e: | ||
| warnings.warn("%s, sort order is undefined for " | ||
| "incomparable objects" % e, RuntimeWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format() here
pandas/core/indexes/base.py
Outdated
| try: | ||
| taken = taken.sort_values() | ||
| except TypeError as e: | ||
| warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format here
d9e68be to
9549661
Compare
|
Hello @reidy-p! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 19, 2019 at 18:22 Hours UTC |
9549661 to
7b15248
Compare
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments, but let's do in a followup if you can (otherwise pls make an issue).
| pass | ||
| if sort: | ||
| try: | ||
| uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's followup up later then (new PR).
|
|
||
| taken = other.take(indexer) | ||
|
|
||
| if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I still think we can use _get_reconcile_name_object here (maybe need to wrap taken in Index), but can do in a followup.
|
thanks @reidy-p |
|
@jreback thanks. I hope to do a few follow-up PRs on this. |
|
Was it intended to change the default behavior of |
|
all of the set ops always sorted; the parameter now allows u to turn it off (default should be the same as before) |
|
This is definitely changed behavior from 0.23.x for intersection (it turned up in a failure in xarray's test suite), but I'll raise another issue about that. |
|
See #24959 for discussion about the behavior change. |
… tests
git diff upstream/master -u -- "*.py" | flake8 --diffThis PR makes some progress towards adding a
sortparameter with a default ofTrueto the set operations (union,intersection,differenceandsymmetric_difference) onIndexclasses. Adding this parameter into every type ofIndexwould result in a very big PR so I have decided to break it up and try to add it in stages. I have tried to focus onIndex,MultiIndex,PeriodIndexandIntervalndexin this PR but have made some very small changes to set operations on other indices where necessary.Some issues to consider:
Indextypes. For example, because of the way some of the set operations are implemented on someIndextypes the results may always be sorted even ifsort=Falseis passed (e.g., aunionoperation onDatetimeIndexs may always return a sorted result in some cases even ifsort=False). Perhaps this is not really a problem as long as we document it.sortparameter at the moment. For example, theintersectionof an unsortedIndexwith itself will return the original unsortedIndexeven ifsort=Trueis passed because thesortparameter is simply ignored in this case. Similarly, theintersectionof an unsortedIndexwith an emptyIndexwill also return the original unsortedIndexeven ifsort=True. There is similar behaviour for the other set operations.