Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Jun 7, 2021

This fixes an error from when this was last refactored/improved. Now, when building a chunked array, we properly build multiple chunks.

This also fixes a case Weston pointed out, such that an array of 2**31 strings (which won't fit in the offsets buffer) will also get chunked instead of just saying OOM.

Unfortunately the test is rather slow so I've marked it such that it doesn't run in CI.

I've updated R but note that R never builds a chunked array hence there should be no effect.

@github-actions
Copy link

github-actions bot commented Jun 7, 2021

@nealrichardson
Copy link
Member

I've updated R but note that R never builds a chunked array hence there should be no effect.

Not yet! But we hope to in ARROW-9293. cc @romainfrancois

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Though I see don't think R is using the chunker. I wonder if R can handle very large string arrays.

Edit: Ah, I see you beat me to the observation :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to check these log statements in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, didn't mean to reveal my amazing debugging secrets :) (thanks for catching this)

@lidavidm
Copy link
Member Author

@pitrou would you like to check over the Python changes here?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: wasn't the conversion code from @kszucs already supposed to handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about always passing offset? That will simplify the overloads a bit (this should not be considered a public API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to reconcile the PyPrimitiveConverter for binary and string types? They really look very similar.

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

I'm curious: wasn't the conversion code from @kszucs already supposed to handle this?

It was. That PR has landed in 2.0 and according to the issue this regression appeared in version 4.0. I assume that another patch has introduced the issue again.

If I recall correctly we regularly exercised the big memory tests on buildbot since we didn't have the necessary resources available on hosted CI services. We didn't catch the regression since we decommissioned the buildbot machines.

I'm going to investigate further.

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

I suspect that this commit has introduced the regressions since this is the only one applied since version 3.0.

@lidavidm
Copy link
Member Author

Ah sorry @kszucs I should've linked the commit somewhere, I also think it was that commit that introduced this.

I've cleaned up the VisitSequence overloads and consolidated the binary/string converters.

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

Confirmed. I executed the large memory tests for the suspected commit and I instantly got a segfault.

Sorry, it was my bad. I quickly drafted an experimental PR to include the Extend methods in the converter API so the R bindings can use the chunker as well. I probably assumed that the builder rewinds* to its previous length on a faulty Extend() call, hence my comment.

I simply forgot to execute the big memory tests locally and the CI has passed since we didn't have the buildbot builds running anymore, so Romain could have assumed that this is going to work in the R converter PR as expected. Though I assume there are no test cases covering the chunking on the R side (or just marked as large memory as well).

* which reminds me that we should implement a Rewind() method for the builders to shrink it instead of slicing (or perhaps it is not needed with the changes in this PR)

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

@lidavidm the list tests cases are failing for me locally. BTW couldn't we use only the ExtendAsMuchAsPossible methods by passing zero offset by default?

@lidavidm
Copy link
Member Author

Hmm, thanks, I'll take a look.

We could do that too. That'll clean things up.

@lidavidm
Copy link
Member Author

Argh, it's because I only looked at the string/binary case and not the list case, thanks for pointing that out. (Is there a crossbow build that runs these tests?)

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

Is there a crossbow build that runs these tests?

No, we don't have the necessary resources available on any of the CI providers.

@lidavidm
Copy link
Member Author

lidavidm commented Jun 17, 2021

Hmm, I'm not sure if ExtendAsMuchAsPossible should replace Extend everywhere. For one, if you use Extend, you'll then have to check that the number of items added is equal to the number of items you gave it, and in that case, you'll also lose the actual error (or we'll have to declare Status ExtendAsMuchAsPossible(..., int64_t* appended)).

Maybe what we could do instead is expose a int64_t Converter::MaxItemsPerChunk(int64_t) so that the chunked converter can handle all the logic internally, and the converters themselves could assume everything fits in a chunk. Though that's much less flexible, it's basically the current situation.

@kszucs
Copy link
Member

kszucs commented Jun 17, 2021

I think we need to return with a tuple/pair: Result<(status, num_converted)> from the extend methods. I'm looking at it too.

I think I have a solution, a rather simple one actually. I'm going to share it tomorrow.

@kszucs
Copy link
Member

kszucs commented Jun 21, 2021

Closing in favor of #10556

@kszucs kszucs closed this Jun 21, 2021
kszucs added a commit that referenced this pull request Jun 22, 2021
…ython-to-Arrow conversion

Still need to port the R changes from #10470

Tested locally using:

```
 PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/
```

Closes #10556 from kszucs/fff

Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
…ython-to-Arrow conversion

Still need to port the R changes from apache#10470

Tested locally using:

```
 PYARROW_TEST_SLOW=ON PYARROW_TEST_LARGE_MEMORY=ON ./run_test.sh -sv pyarrow/tests/
```

Closes apache#10556 from kszucs/fff

Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants