Skip to content

Conversation

@pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Oct 13, 2017

This is currently a workaround until the Arrow tensor supports zero copy of byte-length booleans.

Copy link
Contributor Author

@pcmoritz pcmoritz Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not at all sure if this is the right fix; maybe we need a separate field for the width if the type is contained in a tensor? Standardizing around numpy for tensors seems the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean in Arrow is 1 bit, so don’t make this change. We may need to get creative about dealing with NumPy’s metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, for our Python serialization we could hack around it by defining a custom serializer. However, for Tensor.from_numpy() we can't do that because the type needs to be fully encoded in the Tensor type. Would it be acceptable to introduce a new type for this? Let me know which solution you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular I'm thinking of introducing a "bool8" type, which is a bool that is encoded as a single byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://issues.apache.org/jira/browse/ARROW-1674. This is probably the right way to handle this at the format level. We can separately add a data type in C++. It will be useful to be able to receive numpy.bool_ data with zero copy in Arrow

@pcmoritz pcmoritz force-pushed the ndarray-bool branch 5 times, most recently from 097a78b to 1de11a4 Compare October 17, 2017 22:52
@robertnishihara
Copy link
Contributor

This just passes bool arrays to the custom serializer, right? Does it make sense to register the custom serializer in the default serialization context or no?

@pcmoritz
Copy link
Contributor Author

Magically, this is already taken care of. The custom serializer we already have is generic, it will convert the array to nested lists and the custom deserializer will make a numpy array out of it. Not the most efficient solution but it fixes the problem until we have the proper solution :)

@robertnishihara
Copy link
Contributor

Oh, I see. Could be made efficient by having the custom serializer special case bool arrays. But if this is just temporary then no need to.

@pcmoritz
Copy link
Contributor Author

+1 this is ready to merge as a workaround for ray-project/ray#1121

@wesm
Copy link
Member

wesm commented Oct 18, 2017

I haven't looked too deeply, but could you explain how this fix works?

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Oct 18, 2017

Yeah, the switch case I removed makes it fall back to the default, which uses the custom serializer. This will fall back to the function

def _serialize_numpy_array(obj):

which converts the array to a nested list upon serialization and back upon deserialization.

@wesm
Copy link
Member

wesm commented Oct 18, 2017

Got it, so the workaround is slower / not zero copy. No big deal. I will work to get this fixed more properly + zero copy reads in time for 0.8.0

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@asfgit asfgit closed this in 298e343 Oct 18, 2017
@wesm wesm deleted the ndarray-bool branch October 18, 2017 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants