ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10470

lidavidm · 2021-06-07T19:21:03Z

This fixes an error from when this was last refactored/improved. Now, when building a chunked array, we properly build multiple chunks.

This also fixes a case Weston pointed out, such that an array of 2**31 strings (which won't fit in the offsets buffer) will also get chunked instead of just saying OOM.

Unfortunately the test is rather slow so I've marked it such that it doesn't run in CI.

I've updated R but note that R never builds a chunked array hence there should be no effect.

github-actions · 2021-06-07T19:21:24Z

https://issues.apache.org/jira/browse/ARROW-12983

nealrichardson · 2021-06-07T19:32:28Z

I've updated R but note that R never builds a chunked array hence there should be no effect.

Not yet! But we hope to in ARROW-9293. cc @romainfrancois

westonpace

Looks good. Though I see don't think R is using the chunker. I wonder if R can handle very large string arrays.

Edit: Ah, I see you beat me to the observation :)

westonpace · 2021-06-07T19:33:48Z

cpp/src/arrow/python/python_to_arrow.cc

Did you mean to check these log statements in?

Whoops, didn't mean to reveal my amazing debugging secrets :) (thanks for catching this)

lidavidm · 2021-06-16T13:51:04Z

@pitrou would you like to check over the Python changes here?

pitrou

I'm curious: wasn't the conversion code from @kszucs already supposed to handle this?

pitrou · 2021-06-17T17:34:13Z

cpp/src/arrow/python/iterators.h

How about always passing offset? That will simplify the overloads a bit (this should not be considered a public API).

pitrou · 2021-06-17T17:36:30Z

cpp/src/arrow/python/python_to_arrow.cc

Can you try to reconcile the PyPrimitiveConverter for binary and string types? They really look very similar.

kszucs · 2021-06-17T18:20:03Z

I'm curious: wasn't the conversion code from @kszucs already supposed to handle this?

It was. That PR has landed in 2.0 and according to the issue this regression appeared in version 4.0. I assume that another patch has introduced the issue again.

If I recall correctly we regularly exercised the big memory tests on buildbot since we didn't have the necessary resources available on hosted CI services. We didn't catch the regression since we decommissioned the buildbot machines.

I'm going to investigate further.

… chunked arrays

kszucs · 2021-06-17T18:27:02Z

I suspect that this commit has introduced the regressions since this is the only one applied since version 3.0.

lidavidm · 2021-06-17T20:13:26Z

Ah sorry @kszucs I should've linked the commit somewhere, I also think it was that commit that introduced this.

I've cleaned up the VisitSequence overloads and consolidated the binary/string converters.

kszucs · 2021-06-17T20:27:59Z

Confirmed. I executed the large memory tests for the suspected commit and I instantly got a segfault.

Sorry, it was my bad. I quickly drafted an experimental PR to include the Extend methods in the converter API so the R bindings can use the chunker as well. I probably assumed that the builder rewinds* to its previous length on a faulty Extend() call, hence my comment.

I simply forgot to execute the big memory tests locally and the CI has passed since we didn't have the buildbot builds running anymore, so Romain could have assumed that this is going to work in the R converter PR as expected. Though I assume there are no test cases covering the chunking on the R side (or just marked as large memory as well).

* which reminds me that we should implement a Rewind() method for the builders to shrink it instead of slicing (or perhaps it is not needed with the changes in this PR)

kszucs · 2021-06-17T21:38:58Z

@lidavidm the list tests cases are failing for me locally. BTW couldn't we use only the ExtendAsMuchAsPossible methods by passing zero offset by default?

lidavidm · 2021-06-17T21:40:15Z

Hmm, thanks, I'll take a look.

We could do that too. That'll clean things up.

lidavidm · 2021-06-17T22:03:46Z

Argh, it's because I only looked at the string/binary case and not the list case, thanks for pointing that out. (Is there a crossbow build that runs these tests?)

kszucs · 2021-06-17T22:21:10Z

Is there a crossbow build that runs these tests?

No, we don't have the necessary resources available on any of the CI providers.

lidavidm · 2021-06-17T23:10:45Z

Hmm, I'm not sure if ExtendAsMuchAsPossible should replace Extend everywhere. For one, if you use Extend, you'll then have to check that the number of items added is equal to the number of items you gave it, and in that case, you'll also lose the actual error (or we'll have to declare Status ExtendAsMuchAsPossible(..., int64_t* appended)).

Maybe what we could do instead is expose a int64_t Converter::MaxItemsPerChunk(int64_t) so that the chunked converter can handle all the logic internally, and the converters themselves could assume everything fits in a chunk. Though that's much less flexible, it's basically the current situation.

kszucs · 2021-06-17T23:37:01Z

~~I think we need to return with a tuple/pair: Result<(status, num_converted)> from the extend methods. I'm looking at it too.~~

I think I have a solution, a rather simple one actually. I'm going to share it tomorrow.

kszucs · 2021-06-21T18:44:48Z

Closing in favor of #10556

…ython-to-Arrow conversion Still need to port the R changes from #10470 Tested locally using: ``` PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/ ``` Closes #10556 from kszucs/fff Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

…ython-to-Arrow conversion Still need to port the R changes from apache#10470 Tested locally using: ``` PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/ ``` Closes apache#10556 from kszucs/fff Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

github-actions bot added Component: C++ Component: Python Component: R labels Jun 7, 2021

westonpace reviewed Jun 7, 2021

View reviewed changes

pitrou reviewed Jun 17, 2021

View reviewed changes

lidavidm added 5 commits June 17, 2021 14:26

ARROW-12983: [C++][Python] Properly convert large Python sequences to…

abef73e

… chunked arrays

ARROW-12983: [R] Update r_to_arrow.cpp to new interface

07fe1da

ARROW-12983: [C++] Fix lint error

62756c7

ARROW-12983: [C++] Fix DCHECKs

c7e1198

ARROW-12983: [C++] Remove debug logs

6d951b3

ARROW-12983: [C++] Consolidate binary/string converters

8516d40

lidavidm force-pushed the arrow-12983 branch from c0cdb9e to 8516d40 Compare June 17, 2021 18:48

kszucs mentioned this pull request Jun 18, 2021

ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10556

Closed

1 task

kszucs closed this Jun 21, 2021

asfimport mentioned this pull request Jul 20, 2021

[C++][Python] Converter::Extend gets stuck in infinite loop causing OOM if values don't fit in single chunk #28701

Closed

ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10470

ARROW-12983: [C++][Python][R] Properly overflow to chunked array in Python-to-Arrow conversion #10470

Uh oh!

Conversation

lidavidm commented Jun 7, 2021

Uh oh!

github-actions bot commented Jun 7, 2021

Uh oh!

nealrichardson commented Jun 7, 2021

Uh oh!

westonpace left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

lidavidm Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jun 16, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

kszucs commented Jun 17, 2021

Uh oh!

kszucs commented Jun 17, 2021

Uh oh!

lidavidm commented Jun 17, 2021

Uh oh!

kszucs commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Jun 17, 2021

Uh oh!

lidavidm commented Jun 17, 2021

Uh oh!

kszucs commented Jun 17, 2021

Uh oh!

lidavidm commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kszucs commented Jun 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

westonpace left a comment •

edited

Loading

kszucs commented Jun 17, 2021 •

edited

Loading

kszucs commented Jun 17, 2021 •

edited

Loading

lidavidm commented Jun 17, 2021 •

edited

Loading

kszucs commented Jun 17, 2021 •

edited

Loading