Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Jun 25, 2019

ChunkedArrayBuilder should never request BinaryBuilder reserve space for more strings than it can hold (kListMaximumElements); with this change it will instead of begin a new chunk

@wesm wesm self-requested a review June 25, 2019 19:55
@wesm wesm force-pushed the 3762-Parquet-arrowTable-reads-error-when-over branch from 095862e to fd35c7f Compare June 26, 2019 01:02
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay, but I think we need to add some unit tests (that aren't disabled) to exercise the various code paths in ChunkedBinaryBuilder::Reserve to assert that they do what we expect. Otherwise if we refactor later we're going to be playing whackamole

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable improvement to me. Unfortunately it'll be difficult to come up with a test that doesn't blow 16+GB of memory.

@bkietz
Copy link
Member Author

bkietz commented Jun 26, 2019

I'll add some tests to src/arrow/array-binary-test.cc

@bkietz bkietz force-pushed the 3762-Parquet-arrowTable-reads-error-when-over branch from fd35c7f to 9ae140f Compare June 26, 2019 19:11
@bkietz
Copy link
Member Author

bkietz commented Jun 26, 2019

@pitrou @wesm This new test is pretty minimal but still takes ~4 minutes (or 20s in release) to pass on my machine

@wesm
Copy link
Member

wesm commented Jun 26, 2019

Hm. Can we make the 2GB limit something that's configurable rather than hard-coded? In theory that's the purpose of the max chunk size. We can't have such a long-running test in CI

@bkietz
Copy link
Member Author

bkietz commented Jun 26, 2019

max_chunk_size_ is a restriction on the character data (string_array.buffer[2]) and not on the number of strings in an array; the test I added appends 2 ** 31 empty strings so chunk_data_size_ never rises from 0. (see also: ArrayBuilder::Reserve which allocates space to store array elements vs BinaryBuilder::ReserveData which allocates space to store character data)

We could certainly add another to apply to the # of strings in the string_array, or alternatively use max_chunk_size_ / sizeof(int32_t) as that limit

@wesm
Copy link
Member

wesm commented Jun 26, 2019

Oh okay, sorry for my confusion. Can we make the number of elements per chunk configurable, then? I can see a use case for having no more than 64K elements per chunk, no matter what their size. Then this is much easier to test

@bkietz
Copy link
Member Author

bkietz commented Jun 26, 2019

Will do

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks @bkietz! Will merge once CI has run

@wesm wesm force-pushed the 3762-Parquet-arrowTable-reads-error-when-over branch from 16baa10 to 95a4e30 Compare June 27, 2019 04:36
@wesm
Copy link
Member

wesm commented Jun 27, 2019

Merging, the failure is the cloud.r-project.org DNS issue

@wesm wesm closed this in a634f92 Jun 27, 2019
nealrichardson pushed a commit to nealrichardson/arrow that referenced this pull request Jun 27, 2019
ChunkedArrayBuilder should never request BinaryBuilder reserve space for more strings than it can hold (`kListMaximumElements`); with this change it will instead of begin a new chunk

Author: Benjamin Kietzman <bengilgit@gmail.com>

Closes apache#4695 from bkietz/3762-Parquet-arrowTable-reads-error-when-over and squashes the following commits:

95a4e30 <Benjamin Kietzman> add configurable maximum element count to ChunkedBinaryBuilder
5331a2c <Benjamin Kietzman> add test for ChunkedBinaryBuilder's new reserve behavior
3eb5719 <Benjamin Kietzman> manage ChunkedBinaryBuilder's capacity
d626c07 <Benjamin Kietzman> attempted fix, killed: OOM
cb000ab <Benjamin Kietzman> add (failing) test which repros issue in c++
@bkietz bkietz deleted the 3762-Parquet-arrowTable-reads-error-when-over branch February 25, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants