-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Are there some related JIRAs that will be fixed by this? |
|
Not directly, but this will help to solve multiple ones. I'm going to create one. |
|
Just executed the ASV benchmarks on this PR and the master branch with the two sample sizes and the results showed slighty better performance in both cases: Also compared the So far there is no performance degradation - if we can trust the ASV benchmarks. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why separate PyArrayConverter and ArrayConverter?
My hope is that the classes in |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a neat cleanup of the conversion code, and potentially reusable as well. Thanks for doing this! I have a few comments on the overall structure below
|
@romainfrancois since the conversion code in Could you please take a look at the base classes in converter.h to see whether would it help/improve the R side implementation? I have not documented it yet because it's still under discussion but you can take a look how I'm using it in the python bindings. |
|
@github-actions crossbow submit test-spark |
|
Revision: 79e8229d6828f7f63f80db75a4ef89af9deb03d3 Submitted crossbow builds: ursa-labs/crossbow @ actions-541
|
|
@ursabot build |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chunker is neat, thanks for working on this
My main concern here is overcomplexity of the Converter<> interface, since it currently bears responsibilities both as the abstract base class for conversion and as a trait mapping from DataType subclasses to concrete converters (used to enable generic construction).
I think Converter<>, Chunker<> and the mixins PrimitiveConverter<> etc can remain in util/converter.h but costruction logic (MakeConverterImpl) should be moved back to python_to_arrow.h. Fully generic construction is definitely a worthwhile goal but I also think it's worth doing so in a dedicated follow up. Establishing the base classes for generic conversion is already a great step forward
33c5aa7 to
f3be748
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking much better, just a few more simplifications to construction logic and I think this will be ready to go
cpp/src/arrow/util/converter.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The converter's Reserve() might fail due to overflow, but this shouldn't be an error for Chunker::Reserve since an overflow can be averted by finishing the chunk. See for example the logic surrounding ChunkedBinaryBuilder::extra_capacity_. This doesn't need to be handled in this PR since Reserve is a performance hint; for now I think we can just ignore errors emitted by this call to Reserve
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments:
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(only briefly looked at the python code + tests, impressive set of changes ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't raise on "d" being present here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tolerates extra items (similarly when converting from dicts). I can restrict it if you think that would be more appropiate. We may want to alter this behavior with specific conversion flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the order matters when constructing from pairs (based on the test below), I would personally expect additional elements would also not be allowed (BTW, what actually happens if the additional element is not in the last position? Maybe add a test for that as well?).
While when constructing from dicts also the order does not matter I suppose? So this could be more tolerant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will raise since the key field equality is checked explicitly. It also supports converting less elements than the number of fields.
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging
Targets of the refactoring:
Resultreturn values and containis_utf8flagFixes
ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field namesBackward incompatibility
Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the.as_py()representation to return with a list of key-value pairs.TODOs:
Library size
Before:
After:
Benchmarks
Executed the following ASV benchmark:
After some optimization: