-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray #10199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Not necessarily related to this PR (just noticed while looking at the code and testing a few things), but the conversion for fixed size binary is currently (on master) incorrect for the strided case: |
Good that you pointed that out, because I just added a test for that case and it fails (differently) also with my changes in place. Throws a |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @amol- ! Here are some additional comments.
python/pyarrow/tests/test_array.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is supposed to test. The fact that a copy is made is just an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.
The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.
which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a copy=True/False argument to the pyarrow.array function.
We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when pa.binary() and pa.binary(N) are used. So the test verifies that if you modify the numpy array the arraw array doesn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this solve the strided conversion case? If so, perhaps you can add a test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly not, I expected it would, but I wrote some tests and it wasn't enough. That's why I made https://issues.apache.org/jira/browse/ARROW-12667 as a follow up issue. So that I can test it for all various types and make sure it works in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added tests and fix for strided binary arrays (with and without mask)
|
@pitrou @jorisvandenbossche did you have a chance to have a final pass? Given that the solution is comparable to what we are already doing for variable length arrays, it doesn't seem to introduce new issues and is isolated enough, I think it could make sense to ship a fix to contain the bug while we work on eventual performance improvements and the other two related issues. |
This reverts commit b3e0a37049dffdcac63b5ce446907be60523e651.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay and thanks for the fix, @amol- . I've rebased and will merge if CI is green.
…rray Closes apache#10199 from amol-/ARROW-12431 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
No description provided.