-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Not yet! But we hope to in ARROW-9293. cc @romainfrancois |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Though I see don't think R is using the chunker. I wonder if R can handle very large string arrays.
Edit: Ah, I see you beat me to the observation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to check these log statements in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, didn't mean to reveal my amazing debugging secrets :) (thanks for catching this)
|
@pitrou would you like to check over the Python changes here? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: wasn't the conversion code from @kszucs already supposed to handle this?
cpp/src/arrow/python/iterators.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about always passing offset? That will simplify the overloads a bit (this should not be considered a public API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to reconcile the PyPrimitiveConverter for binary and string types? They really look very similar.
It was. That PR has landed in 2.0 and according to the issue this regression appeared in version 4.0. I assume that another patch has introduced the issue again. If I recall correctly we regularly exercised the big memory tests on buildbot since we didn't have the necessary resources available on hosted CI services. We didn't catch the regression since we decommissioned the buildbot machines. I'm going to investigate further. |
|
I suspect that this commit has introduced the regressions since this is the only one applied since version 3.0. |
|
Ah sorry @kszucs I should've linked the commit somewhere, I also think it was that commit that introduced this. I've cleaned up the VisitSequence overloads and consolidated the binary/string converters. |
|
Confirmed. I executed the large memory tests for the suspected commit and I instantly got a segfault. Sorry, it was my bad. I quickly drafted an experimental PR to include the Extend methods in the converter API so the R bindings can use the chunker as well. I probably assumed that the builder rewinds I simply forgot to execute the big memory tests locally and the CI has passed since we didn't have the buildbot builds running anymore, so Romain could have assumed that this is going to work in the R converter PR as expected. Though I assume there are no test cases covering the chunking on the R side (or just marked as large memory as well).
|
|
@lidavidm the |
|
Hmm, thanks, I'll take a look. We could do that too. That'll clean things up. |
|
Argh, it's because I only looked at the string/binary case and not the list case, thanks for pointing that out. (Is there a crossbow build that runs these tests?) |
No, we don't have the necessary resources available on any of the CI providers. |
|
Hmm, I'm not sure if ExtendAsMuchAsPossible should replace Extend everywhere. For one, if you use Extend, you'll then have to check that the number of items added is equal to the number of items you gave it, and in that case, you'll also lose the actual error (or we'll have to declare Maybe what we could do instead is expose a |
|
I think I have a solution, a rather simple one actually. I'm going to share it tomorrow. |
|
Closing in favor of #10556 |
…ython-to-Arrow conversion Still need to port the R changes from #10470 Tested locally using: ``` PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/ ``` Closes #10556 from kszucs/fff Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
…ython-to-Arrow conversion Still need to port the R changes from apache#10470 Tested locally using: ``` PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/ ``` Closes apache#10556 from kszucs/fff Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
This fixes an error from when this was last refactored/improved. Now, when building a chunked array, we properly build multiple chunks.
This also fixes a case Weston pointed out, such that an array of 2**31 strings (which won't fit in the offsets buffer) will also get chunked instead of just saying OOM.
Unfortunately the test is rather slow so I've marked it such that it doesn't run in CI.
I've updated R but note that R never builds a chunked array hence there should be no effect.