Skip to content

Conversation

@amol-
Copy link
Member

@amol- amol- commented Apr 29, 2021

No description provided.

@github-actions
Copy link

@amol- amol- marked this pull request as ready for review April 30, 2021 13:18
@jorisvandenbossche jorisvandenbossche requested a review from pitrou May 3, 2021 10:14
@jorisvandenbossche
Copy link
Member

Not necessarily related to this PR (just noticed while looking at the code and testing a few things), but the conversion for fixed size binary is currently (on master) incorrect for the strided case:

In [38]: arr = np.array([b"ab", b"cd", b"ef"])

In [39]: arr[::2]
Out[39]: array([b'ab', b'ef'], dtype='|S2')

In [41]: pa.array(arr[::2], pa.binary(2)).to_pylist()
Out[41]: [b'ab', b'cd']

@amol-
Copy link
Member Author

amol- commented May 5, 2021

Not necessarily related to this PR (just noticed while looking at the code and testing a few things), but the conversion for fixed size binary is currently (on master) incorrect for the strided case:

Good that you pointed that out, because I just added a test for that case and it fails (differently) also with my changes in place. Throws a

NumPy type not implemented: unrecognized type (18) in GetNumPyTypeName

@pitrou pitrou changed the title ARROW-12431: [C++] Mask is inverted when creating FixedSizeBinaryArray ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryArray May 12, 2021
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @amol- ! Here are some additional comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is supposed to test. The fact that a copy is made is just an implementation detail.

Copy link
Member Author

@amol- amol- May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.

The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.

which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a copy=True/False argument to the pyarrow.array  function.

We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when pa.binary() and pa.binary(N) are used. So the test verifies that if you modify the numpy array the arraw array doesn't change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this solve the strided conversion case? If so, perhaps you can add a test for it?

Copy link
Member Author

@amol- amol- May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly not, I expected it would, but I wrote some tests and it wasn't enough. That's why I made https://issues.apache.org/jira/browse/ARROW-12667 as a follow up issue. So that I can test it for all various types and make sure it works in all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added tests and fix for strided binary arrays (with and without mask)

@amol-
Copy link
Member Author

amol- commented May 21, 2021

@pitrou @jorisvandenbossche did you have a chance to have a final pass?

Given that the solution is comparable to what we are already doing for variable length arrays, it doesn't seem to introduce new issues and is isolated enough, I think it could make sense to ship a fix to contain the bug while we work on eventual performance improvements and the other two related issues.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay and thanks for the fix, @amol- . I've rebased and will merge if CI is green.

@pitrou pitrou closed this in 4b3f6c3 Jun 15, 2021
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…rray

Closes apache#10199 from amol-/ARROW-12431

Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants