-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10663: [C++] Fix is_in and index_in behaviour #9164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2110abd to
ba20782
Compare
* Reject duplicates in SetLookupOptions::value_set, because otherwise the indices returned by index_in would be relative to a deduplicated value_set. * Honour SetLookup::skip_nulls in is_in. * Revamp tests.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, in the Python bindings we used skip_null instead of skip_nulls, maybe could take the opportunity to fixup that as well
cpp/src/arrow/compute/api_scalar.h
Outdated
| @@ -290,17 +290,19 @@ Result<Datum> KleeneAndNot(const Datum& left, const Datum& right, | |||
| /// \brief IsIn returns true for each element of `values` that is contained in | |||
| /// `value_set` | |||
| /// | |||
| /// If null occurs in left, if null count in right is not 0, | |||
| /// it returns true, else returns null. | |||
| /// Behaviour of nulls is governed by SetLookupOptions::skip_nulls. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could mention here that the default is skip_nulls=False (or in the docstring of SetLookupOptions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could maybe add a similar sentence "Behaviour of nulls is governed by SetLookupOptions::skip_nulls" to the is_in_doc ?
| @@ -272,10 +278,9 @@ struct IsInVisitor { | |||
|
|
|||
| Status Visit(const DataType&) { | |||
| const auto& state = checked_cast<const SetLookupState<NullType>&>(*ctx->state()); | |||
| // XXX should skip_nulls be taken into account? | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit strange corner case, but for consistency probably yes?
Right now for null array the skip_nulls argument has no effect (using this PR):
In [27]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=False)
Out[27]:
<pyarrow.lib.BooleanArray object at 0x7fac1d800880>
[
true
]
In [28]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=True)
Out[28]:
<pyarrow.lib.BooleanArray object at 0x7fac1d800ca0>
[
true
]
while if you add a non-null type to the array creation, but use the same values, you get a different result:
In [30]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=False)
Out[30]:
<pyarrow.lib.BooleanArray object at 0x7fac1d82f880>
[
true
]
In [31]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=True)
Out[31]:
<pyarrow.lib.BooleanArray object at 0x7fac1d7c2040>
[
false
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
|
Thanks for the PR! So something like If you see "isin" as a shortcut to write multiple equality comparisons ( |
ba20782 to
447c1c3
Compare
|
@jorisvandenbossche I have no idea. Perhaps @bkietz or @michalursa can share their opinion. |
|
Looking at the behavior of The SQL |
|
(BTW, since this is existing behaviour, and this PR doesn't change that, that should certainly not hold up this PR, and should probably move the discussion to a JIRA) |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for rewriting the tests
| RETURN_NOT_OK(AddArrayValueSet(*options.value_set.array())); | ||
| } else if (options.value_set.kind() == Datum::CHUNKED_ARRAY) { | ||
| const ChunkedArray& value_set = *options.value_set.chunked_array(); | ||
| for (const std::shared_ptr<Array>& chunk : value_set.chunks()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a lot of code like this, maybe later we should add Datum::chunks() so we can write
if (!options.values_set.is_arraylike()) {
return Status::Invalid("value_set should be an array or chunked array");
}
for (const std::shared_ptr<ArrayData>& chunk : options.value_set.chunks()) {
RETURN_NOT_OK(AddArrayValueSet(*chunk->data()));
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not too bothered by the cost of a temporary vector then it may be nice indeed.
|
The CI failure are spurious. Green Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/212852090 |
Reject duplicates in SetLookupOptions::value_set, because otherwise the indices returned by index_in would be relative to a deduplicated value_set.
Honour SetLookup::skip_nulls in is_in.
Revamp tests.