-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
REF: Back IntervalArray by array instead of Index #36310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/interval.py
Outdated
| right = self.right.fillna(value=value.right) | ||
| from pandas import Index | ||
|
|
||
| left = Index(self.left).fillna(value=value.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm why do you need to coerce to .fillna?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM self.left is an ndarray which doesnt have fillna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh
|
AFAICT remaining test failures are on from arrow |
|
cc @jorisvandenbossche are the pyarrow failures here something we can address on our end? |
|
It are our own tests (testing the conversion that is implemented in pandas itself), so that's something that need to be fixed in this PR. Looking at the failure, it seems that setting a missing value in an IntervalArray does not/no longer set a missing value in both left and right arrays. But looking at the changes in |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility, we could keep .left and .right returning an Index? (since the arrays are actually stored as ._left and ._right)
|
Nice idea, +1 to changing the storage. We shouldn't change |
pandas/_testing.py
Outdated
| assert_index_equal(left.right, right.right, exact=exact, obj=f"{obj}.left") | ||
| if left._left.dtype.kind in ["m", "M"]: | ||
| # We have a DatetimeArray or Timed | ||
| # TODO: `exact` keyword? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to
kwargs = {}
if left._left.dtype.kind in ["m", "M"]:
kwargs['check_freq'] = False
....
pandas/core/arrays/interval.py
Outdated
| new_right = self.right.astype(dtype.subtype) | ||
| # We need to use Index rules for astype to prevent casting | ||
| # np.nan entries to int subtypes | ||
| new_left = Index(self._left).astype(dtype.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False ?
pandas/core/arrays/interval.py
Outdated
| fill_value = self.left._na_value | ||
| from pandas import Index | ||
|
|
||
| fill_value = Index(self._left)._na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we get the fill value rather than doing this? if we need to do this add copy=False
pandas/core/indexes/interval.py
Outdated
|
|
||
| @cache_readonly | ||
| def left(self) -> Index: | ||
| return Index(self._values.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this?
|
i think comments have been addressed, LMK if i missed anything |
| _mask = None | ||
|
|
||
| _data: IntervalArray | ||
| _values: IntervalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, now i am confused, what is different about these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this, mypy thinks _values is ExtensionArray and has a bunch of new complaints since we access self._values.left below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, should try to remove this at some point
pandas/core/indexes/interval.py
Outdated
|
|
||
| @cache_readonly | ||
| def left(self) -> Index: | ||
| return Index(self._values.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this?
|
this is marked POC, but i assume want to merge it? |
Yes, updated the title. |
|
any more comments here? |
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally. Just one change needed in the whatsnew.
This should improve the performance of IntervalArray.__setitem__, right? That could be added in the release notes.
doc/source/whatsnew/v1.2.0.rst
Outdated
| - `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`) | ||
| - :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`) | ||
| - :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`) | ||
| - :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs should only reference public functions. How would users actually pass this through a public API? I suspect it's impossible, since check_freq in assert_frame_equal applies to the index rather than values of an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So assert_extension_array_equal could perhaps take kwargs and pass it through. But that's maybe not worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so remove this note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?
| if self.closed != other.closed: | ||
| return np.zeros(len(self), dtype=bool) | ||
| return (self.left == other.left) & (self.right == other.right) | ||
| return (self._left == other.left) & (self._right == other.right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is other here known to be a particular type (like IntervalArray), or is it something like Union[IntervalArray, Series, Index,Interval]. If it's just IntervalArray it'd be a bit faster to compare with other._left and right._left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point other can be an Interval or IntervalArray
note added to whatsnew. |
doc/source/whatsnew/v1.2.0.rst
Outdated
| - `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`) | ||
| - :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`) | ||
| - :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`) | ||
| - :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?
pandas/_testing.py
Outdated
| # We have a DatetimeArray or TimedeltaArray | ||
| kwargs["check_freq"] = False | ||
|
|
||
| # TODO: `exact` keyword? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this TODO needs to be solved first? It was there before, but you now removed it? (so is not being ignored?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're ok without the keyword, will remove the comment
| from pandas.core.ops.array_ops import maybe_upcast_datetimelike_array | ||
|
|
||
| left = maybe_upcast_datetimelike_array(left) | ||
| left = extract_array(left, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we are first ensuring that the arrays passed to _simple_new are an index, and then we extract the array again. Is this roundtrip to index and back needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the roundtrip eventually, will be best accomplished by being stricter in what we pass to _simple_new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, would be a nice follow-up
pandas/core/arrays/interval.py
Outdated
| right._values[key] = value_right | ||
| self._right = right | ||
| self._left[key] = value_left | ||
| self._right[key] = value_right # TODO: needs tests for not breaking views |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the un-xfail-ed test doing that?
(or if not, can you add a test now?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will remove comment
|
updated per comments |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment
| new_right = self.right.astype(dtype.subtype) | ||
| # We need to use Index rules for astype to prevent casting | ||
| # np.nan entries to int subtypes | ||
| new_left = Index(self._left, copy=False).astype(dtype.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add copy=False to .astype (not sure how much any of this matters though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge is ok now and can see if this matters on followup
| _mask = None | ||
|
|
||
| _data: IntervalArray | ||
| _values: IntervalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, should try to remove this at some point
The benefit I have in mind here is that we could back it by a single 2xN array and a) avoid the kludge needed to make
__setitem__atomic, b) do a view to get native types for e.g uniqueness checks, c) possibly share some methods with NDarrayBackedExtensionArray.Also just in principle having EAs not depend on Index is preferable dependency-structure-wise.
cc @jschendel