Skip to content

Conversation

@pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Oct 20, 2017

This is taking a stab at exposing UnionArray to pyarrow. Tasks to be done:

  • Support UnionType::SPARSE
  • Add doc strings

@wesm
Copy link
Member

wesm commented Oct 23, 2017

Can you rebase? What more work needs to be done here?

@wesm
Copy link
Member

wesm commented Nov 4, 2017

@pcmoritz anything I can do to help on this? Would be great to get this into 0.8.0

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Nov 4, 2017

@wesm: Agreed it should be part of 0.8.0. I'll take a stab at the remaining items now and let you know how things go.

@pcmoritz pcmoritz force-pushed the pyarrow-union-array branch 5 times, most recently from df611cb to fc73439 Compare November 5, 2017 18:31
@pcmoritz pcmoritz changed the title [WIP] ARROW-972: UnionArray in pyarrow ARROW-972: UnionArray in pyarrow Nov 5, 2017
@pcmoritz
Copy link
Contributor Author

pcmoritz commented Nov 5, 2017

This is now ready to review

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some minor comments about names and a few other things. Thank you for doing this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better with the type on the LHS (std::vector<std::shared_ptr<Field>> types;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be useful as an alternate version of arrow::union_? Maybe with the same API (the mode as the last argument) but omitting the type codes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May also want to assert that value_offsets has 0 null count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is naming similar to ListArray::FromArrays, but because there are two versions, I wonder if calling this UnionArray::MakeDense (or FromArraysDense, more verbose) would be clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't initially clear that this is the child type, maybe call this child_type instead (and add doxygen comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now gone and using the Array's type to get this information. It's more consistent with how things are done for ListArray and StructArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that using an underscore saves you from declaring this method in the pxd file? If so I've been wasting my time a bunch in the past :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite expensive on a per-value basis. Could these wrapped types be accessed from the parent pyarrow UnionArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good point! I'm constructing the Python objects for children types once per array in the UnionType.

@pcmoritz pcmoritz force-pushed the pyarrow-union-array branch from 296de0a to be13b58 Compare November 8, 2017 20:30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't do this and plug mode in directly below, clang complains:

[ 33%] Building CXX object CMakeFiles/lib.dir/lib.cxx.o
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100873:100: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((((enum  arrow::UnionMode)digits[1]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[0]));
                                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100882:102: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((((((enum  arrow::UnionMode)digits[2]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[1]) << PyLong_SHIFT) ...
                                                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100891:104: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((((((((enum  arrow::UnionMode)digits[3]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[2]) << PyLong_SHIFT...
                                                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100929:130: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((enum  arrow::UnionMode)-1)*(((((enum  arrow::UnionMode)digits[1]) << PyLong_SHIFT) | (enum  arrow::UnionMode)d...
                                                                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100938:101: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) ((((((enum  arrow::UnionMode)digits[1]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[0])));
                                                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100947:132: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((enum  arrow::UnionMode)-1)*(((((((enum  arrow::UnionMode)digits[2]) << PyLong_SHIFT) | (enum  arrow::UnionMode...
                                                                                               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100956:103: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) ((((((((enum  arrow::UnionMode)digits[2]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[1]) << PyLong_SHIFT) ...
                                                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100965:134: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) (((enum  arrow::UnionMode)-1)*(((((((((enum  arrow::UnionMode)digits[3]) << PyLong_SHIFT) | (enum  arrow::UnionMo...
                                                                                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
/Users/pcmoritz/arrow/python/build/temp.macosx-10.7-x86_64-3.5/lib.cxx:100974:105: error: invalid operands to binary expression ('enum arrow::UnionMode' and 'int')
                            return (enum  arrow::UnionMode) ((((((((((enum  arrow::UnionMode)digits[3]) << PyLong_SHIFT) | (enum  arrow::UnionMode)digits[2]) << PyLong_SHIFT...
                                                                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~
9 errors generated.

@pcmoritz pcmoritz force-pushed the pyarrow-union-array branch 2 times, most recently from 2007736 to 02bdfc8 Compare November 8, 2017 23:08
@pcmoritz
Copy link
Contributor Author

pcmoritz commented Nov 9, 2017

@wesm This is ready except for the Windows compile error, do you think you could take a quick look at that? I'm not sure what is going on.

@wesm
Copy link
Member

wesm commented Nov 9, 2017

enum class is a huge pain in Cython. I'll go ahead and change UnionMode to be a scoped enumeration here quickly

@wesm wesm force-pushed the pyarrow-union-array branch from 02bdfc8 to 7f3ca31 Compare November 9, 2017 05:17
@wesm
Copy link
Member

wesm commented Nov 9, 2017

Done. Will merge this once the build is green

@pcmoritz
Copy link
Contributor Author

pcmoritz commented Nov 9, 2017

Great, thanks :)

@wesm wesm closed this in 65a9055 Nov 9, 2017
@wesm wesm deleted the pyarrow-union-array branch November 9, 2017 16:32
@wesm
Copy link
Member

wesm commented Nov 9, 2017

Thanks @pcmoritz!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants