Skip to content
15 changes: 13 additions & 2 deletions cpp/src/arrow/python/numpy_to_arrow.cc
Original file line number Diff line number Diff line change
Expand Up @@ -594,9 +594,20 @@ Status NumPyConverter::Visit(const FixedSizeBinaryType& type) {

if (mask_ != nullptr) {
Ndarray1DIndexer<uint8_t> mask_values(mask_);
RETURN_NOT_OK(builder.AppendValues(data, length_, mask_values.data()));
RETURN_NOT_OK(builder.Reserve(length_));
for (int64_t i = 0; i < length_; ++i) {
if (mask_values[i]) {
RETURN_NOT_OK(builder.AppendNull());
} else {
RETURN_NOT_OK(builder.Append(data));
}
data += stride_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this solve the strided conversion case? If so, perhaps you can add a test for it?

Copy link
Member Author

@amol- amol- May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly not, I expected it would, but I wrote some tests and it wasn't enough. That's why I made https://issues.apache.org/jira/browse/ARROW-12667 as a follow up issue. So that I can test it for all various types and make sure it works in all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added tests and fix for strided binary arrays (with and without mask)

}
} else {
RETURN_NOT_OK(builder.AppendValues(data, length_));
for (int64_t i = 0; i < length_; ++i) {
RETURN_NOT_OK(builder.Append(data));
data += stride_;
}
}

std::shared_ptr<Array> result;
Expand Down
45 changes: 45 additions & 0 deletions python/pyarrow/tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2714,6 +2714,51 @@ def test_array_masked():
assert arr.type == pa.int64()


def test_binary_array_masked():
# ARROW-12431
masked_basic = pa.array([b'\x05'], type=pa.binary(1),
mask=np.array([False]))
assert [b'\x05'] == masked_basic.to_pylist()

# Fixed Length Binary
masked = pa.array(np.array([b'\x05']), type=pa.binary(1),
mask=np.array([False]))
assert [b'\x05'] == masked.to_pylist()

masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(1),
mask=np.array([True]))
assert [None] == masked_nulls.to_pylist()

# Variable Length Binary
masked = pa.array(np.array([b'\x05']), type=pa.binary(),
mask=np.array([False]))
assert [b'\x05'] == masked.to_pylist()

masked_nulls = pa.array(np.array([b'\x05']), type=pa.binary(),
mask=np.array([True]))
assert [None] == masked_nulls.to_pylist()

# Fixed Length Binary, copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is supposed to test. The fact that a copy is made is just an implementation detail.

Copy link
Member Author

@amol- amol- May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It verifies that the behaviour is the same that we get from variable length binary arrays, which do not reuse the numpy array memory. I don't think it's an implementation detail because it changes the user experience.

The fact that the underlying numpy array is shared or not changes the user experience as it means the user can't modify the original numpy array without indirectly modifying (probably unexpectedly) the Arrow array too.

which lead me to create https://issues.apache.org/jira/browse/ARROW-12666 because in some cases we reuse the numpy memory (all basic types) and in other cases we don't (the string, binary etc... types). The follow up ticket suggests to make that behaviour clear as numpy does by adding a copy=True/False argument to the pyarrow.array  function.

We can discuss further what's the best way to go in that dedicated ticket, here I wanted to make sure we were consistent with that happens when pa.binary() and pa.binary(N) are used. So the test verifies that if you modify the numpy array the arraw array doesn't change.

npa = np.array([b'aaa', b'bbb', b'ccc']*10)
arrow_array = pa.array(npa, type=pa.binary(3),
mask=np.array([False, False, False]*10))
npa[npa == b"bbb"] = b"XXX"
assert ([b'aaa', b'bbb', b'ccc']*10) == arrow_array.to_pylist()


def test_binary_array_strided():
# Masked
nparray = np.array([b"ab", b"cd", b"ef"])
arrow_array = pa.array(nparray[::2], pa.binary(2),
mask=np.array([False, False]))
assert [b"ab", b"ef"] == arrow_array.to_pylist()

# Unmasked
nparray = np.array([b"ab", b"cd", b"ef"])
arrow_array = pa.array(nparray[::2], pa.binary(2))
assert [b"ab", b"ef"] == arrow_array.to_pylist()


def test_array_invalid_mask_raises():
# ARROW-10742
cases = [
Expand Down
2 changes: 1 addition & 1 deletion python/pyarrow/tests/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -1705,7 +1705,7 @@ def test_numpy_string_array_to_fixed_size_binary(self):
expected = pa.array(list(arr), type=pa.binary(3))
assert converted.equals(expected)

mask = np.array([True, False, True])
mask = np.array([False, True, False])
converted = pa.array(arr, type=pa.binary(3), mask=mask)
expected = pa.array([b'foo', None, b'baz'], type=pa.binary(3))
assert converted.equals(expected)
Expand Down