Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 11, 2021

  • Reject duplicates in SetLookupOptions::value_set, because otherwise the indices returned by index_in would be relative to a deduplicated value_set.

  • Honour SetLookup::skip_nulls in is_in.

  • Revamp tests.

@pitrou pitrou requested a review from bkietz January 11, 2021 19:17
@pitrou pitrou force-pushed the ARROW-10663-set-lookup-null branch from 2110abd to ba20782 Compare January 11, 2021 19:33
@github-actions
Copy link

* Reject duplicates in SetLookupOptions::value_set, because otherwise
  the indices returned by index_in would be relative to a deduplicated
  value_set.

* Honour SetLookup::skip_nulls in is_in.

* Revamp tests.
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, in the Python bindings we used skip_null instead of skip_nulls, maybe could take the opportunity to fixup that as well

@@ -290,17 +290,19 @@ Result<Datum> KleeneAndNot(const Datum& left, const Datum& right,
/// \brief IsIn returns true for each element of `values` that is contained in
/// `value_set`
///
/// If null occurs in left, if null count in right is not 0,
/// it returns true, else returns null.
/// Behaviour of nulls is governed by SetLookupOptions::skip_nulls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could mention here that the default is skip_nulls=False (or in the docstring of SetLookupOptions)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could maybe add a similar sentence "Behaviour of nulls is governed by SetLookupOptions::skip_nulls" to the is_in_doc ?

@@ -272,10 +278,9 @@ struct IsInVisitor {

Status Visit(const DataType&) {
const auto& state = checked_cast<const SetLookupState<NullType>&>(*ctx->state());
// XXX should skip_nulls be taken into account?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange corner case, but for consistency probably yes?

Right now for null array the skip_nulls argument has no effect (using this PR):

In [27]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=False)
Out[27]: 
<pyarrow.lib.BooleanArray object at 0x7fac1d800880>
[
  true
]

In [28]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=True)
Out[28]: 
<pyarrow.lib.BooleanArray object at 0x7fac1d800ca0>
[
  true
]

while if you add a non-null type to the array creation, but use the same values, you get a different result:

In [30]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=False)
Out[30]: 
<pyarrow.lib.BooleanArray object at 0x7fac1d82f880>
[
  true
]

In [31]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=True)
Out[31]: 
<pyarrow.lib.BooleanArray object at 0x7fac1d7c2040>
[
  false
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 12, 2021

Thanks for the PR!
One additional thing I am wondering, while testing it out, would we ever want a behaviour where a null in the input gives a null in the output? Right now it's only possible to get false (if there is no null in the value_set, or if skip_nulls=True) or true (if there is a null in the value_set and skip_nulls=False).

So something like isin([1, 2, null], value_set=[1, 3]) -> [true, false, null] (instead of true, false, false)

If you see "isin" as a shortcut to write multiple equality comparisons (isin(input, value_set=[1, 3, ...] -> (input == 1) | (input == 3) | ...), then you would get such behaviour.
But so it's a bit the question whether for "isin" we use "equality" semantics or "identity/lookup" semantics for nulls (and given it's now called "SetLookup" in the function names, we clearly go for the second, but I am not fully sure which of the two are most useful / expected in practice).

@pitrou
Copy link
Member Author

pitrou commented Jan 12, 2021

@jorisvandenbossche I have no idea. Perhaps @bkietz or @michalursa can share their opinion.

@jorisvandenbossche
Copy link
Member

Looking at the behavior of %in% in R (cc @nealrichardson), there NA's also get matched (eg c(1, 2, NA) %in% c(1, 3) gives true,false,false and c(1, 2, NA) %in% c(1, 3, NA) gives true,false,true), so that is consistent with the behaviour we have in Arrow right now.

The SQL IN operator does not seem to match Nulls, because there it is a short-hand for multiple comparisons. But, in practice, you can only use this (as far as I know, only limited SQL knowledge) in a WHERE clause. So whether the Null in the column gives False or Null doesn't matter much, because in both cases the row does not get preserved in a WHERE filter.

@jorisvandenbossche
Copy link
Member

(BTW, since this is existing behaviour, and this PR doesn't change that, that should certainly not hold up this PR, and should probably move the discussion to a JIRA)

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for rewriting the tests

RETURN_NOT_OK(AddArrayValueSet(*options.value_set.array()));
} else if (options.value_set.kind() == Datum::CHUNKED_ARRAY) {
const ChunkedArray& value_set = *options.value_set.chunked_array();
for (const std::shared_ptr<Array>& chunk : value_set.chunks()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a lot of code like this, maybe later we should add Datum::chunks() so we can write

if (!options.values_set.is_arraylike()) {
  return Status::Invalid("value_set should be an array or chunked array");
}
for (const std::shared_ptr<ArrayData>& chunk : options.value_set.chunks()) {
  RETURN_NOT_OK(AddArrayValueSet(*chunk->data()));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not too bothered by the cost of a temporary vector then it may be nice indeed.

@pitrou
Copy link
Member Author

pitrou commented Jan 12, 2021

The CI failure are spurious. Green Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/212852090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants